super-fast triangle-triangle intersection algorithm

bootstrap · Post by **bootstrap** » Tue Aug 09, 2011 3:46 am

Some time ago one of the geniuses here mentioned a super-efficient way to implement an algorithm for triangle-triangle intersection. I vaguely recall it may have been a SAT variant, but I'm not sure, and for my purposes I don't care. So I ask the general question: what is the fastest known technique for triangle-triangle intersection?

PS: In case it matters, in my code I already know the triangle pairs I'm testing overlap on all three principle axes (x,y,z), so they are at least "fairly likely" to intersect. I can certainly imagine that knowing this from the start might allow the order of operations to be optimized, and perhaps some operations to be eliminated.

bootstrap · Post by **bootstrap** » Thu Aug 11, 2011 12:13 am

A followup post, after many more hours of searching for "the fastest technique".

I invented and implemented a triangle-triangle intersection routine for my engine, but want to speed up the performance of this speed-critical routine. After a lot of research I found what appears to be the fastest known algorithm, but the paper that describes it is... well... over my head. I can usually beat myself up sufficiently to eventually understand such things, but this time I'm too far out of my league (ie, much too stupid). A friend took at shot at converting it to code, but his attempt isn't quite working right.

My brain went out the window when the very first equation in the paper had three variables that appeared out of nowhere and were never described or shown in the illustrations (alpha1, alpha2, Beta-sub-i). Maybe every self-respecting math whiz knows these are standard terminology for something or other, but that also explains why I'm stumped, because I always have to grunt and groan my way through math.

So I'm looking for a volunteer to either explain this to me and/or convert to C code. I prefer both, of course, but I understand beggars can't be choosers. According to the article the algorithm is quite compact (my friend's function is short enough), so this appears to be more an issue of understanding the math than writing code. Specifically, the algorithm executes a total of 97 math opcodes (all being plus, minus, multiply and compares, (no divide, trig, abs or etc), plus some assignments of course. Much of the math is just subtracting positions/vectors and performing cross-products (one line function calls), so this routine isn't many lines of C code.

Any takers?

You're welcome to look at the functions I already have (my way that apparently works, plus my friend's attempt at this paper algorithm). I can't post these functions on a public forum, but I'm happy to email them to anyone willing to help. Still, I doubt they'd help you, and in fact might lure you into the same confusions that foiled us.

The article is attached, so you can figure out in advance whether you understand the math and technique or not. My technique is totally brain-dead obvious and intuitive, but their "brilliant math-whiz way" is not at all intuitive to me. But my code executes a total of 6 divide opcodes and theirs executes none --- so I'm sure theirs beats mine in a speed comparison by a fair margin.

If nothing else, I hope I save somebody out there the effort of searching for hours and hours for the fastest triangle-triangle intersection algorithm. And if this isn't the fastest algorithm, please say so!

PS: Apparently I can't attach PDF files! Yikes! Okay, find the article here:
http://www.iceapps.com/collision_detect ... p_0001.pdf

Remotion · Post by **Remotion** » Thu Aug 11, 2011 12:21 pm

Hello,

have you already read this post on Christer Ericson blog?

My brain went out the window when the very first equation in the paper had three variables that appeared out of nowhere and were never described or shown in the illustrations (alpha1, alpha2, Beta-sub-i).

You can rewrite this plane ray intersection equation this way.
P + u*p1 + v*p2 = Q + t*q
The left part of the equation is a parametric plane.
Using alpha1(u) and alpha2(v) scalars one can describe any point on the plane.
The right is the ray.
Q is the origin of the ray.
q is the direction of the ray.
And the beta(t) is the scalar that describes the intersection point.

Why not start from the authors code?
It is available here:
Ilan Shimshoni

It would be also interesting to see better (real case) comparison of all 4 algorithms on modern CPU and/or GPU too.
Robustness and handling of all degenerate cases is very important too.

Remo

bootstrap · Post by **bootstrap** » Thu Aug 11, 2011 8:26 pm

Remotion wrote:Hello,

Have you already read this post on Christer Ericson blog?

My brain went out the window when the very first equation in the paper had three variables that appeared out of nowhere and were never described or shown in the illustrations (alpha1, alpha2, Beta-sub-i).
You can rewrite this plane ray intersection equation this way.
P + u*p1 + v*p2 = Q + t*q
The left part of the equation is a parametric plane.
Using alpha1(u) and alpha2(v) scalars one can describe any point on the plane.
The right is the ray.
Q is the origin of the ray.
q is the direction of the ray.
And the beta(t) is the scalar that describes the intersection point.

Why not start from the authors code?
It is available here:
Ilan Shimshoni

It would be also interesting to see better (real case) comparison of all 4 algorithms on modern CPU and/or GPU too. Robustness and handling of all degenerate cases is very important too.

Remo

What a fabulously helpful reply! Thanks very much. Where do I start?

That is a very interesting blog, written from a very practical standpoint, which I much appreciate because I'm all about practical implemention, not theoretical trickiness. Not that I don't appreciate elegance, because I do - when I'm smart enough to recognize it. And yeah, I paid a lot of attention to avoiding numerical troubles when I implemented my straightforward algorithm, and that aspect is important to me too.

I can't believe the hours of search-and-read I went through didn't lead me to either that blog or to the page with the source code. Thanks much for both. I will compare his source with my friend's currently [slightly] malfunctioning attempt to implement the tropp algorithm. If the source code you pointed me to can be made compatible with our argument lists (can't imagine why not), then I'll create a version of his code and test against the other algorithms I have.

I already have my engine set up to test these algorithms. Currently I have the two algorithms extinguish (set to zero) the red and blue colors of the vertices of intersected triangles, so I can watch collisions happening and see whether neither, one, other, or both routines detect a collision on each triangle by the displayed color of each triangle. I will make his code extinguish the remaining blue channel and see what happens. And I'll perform benchmarks.

Yeah, I too have no problem writing SIMD/SSE3+ assembly language. I already have 3 or 4 of the most crucial routines implemented that way, and maybe I'll implement one of these eventually too. The biggest hassle with assembly language these days is the need for 4 versions (windoze form, linux form, and separate forms for each to support 32-bit and 64-bit mode).

I doubt I'll take time to write an article, but I will be sure to post whatever results and conclusions I come up with here in this forum. That's the least I can do to repay your help. And if you want to take my results and write an article, more power to you!

bootstrap · Post by **bootstrap** » Fri Aug 12, 2011 5:54 am

Okay, now I have three working functions: my original (that I fully understand), the code from the website you pointed me at, plus a variation that I tried and produces the same results as the other functions. All 3 functions report the same results... always. Maybe in 100s of billions of calls one might vary, but not after 100s of millions so far.

I captured the 64-bit CPU clock register immediately before and after each function and added each delay for each function to a separate variable. After calling each function 100,000,000 times each, here is the average time consumption in CPU cycles:

function #1: 1314 cycles // original
function #2: 0902 cycles // code with article
function #3: 0801 cycles // new variant

Somehow I doubt I can speed this up much more with different algorithms. Whether I decide it is crucial enough to convert to SIMD/SSE3+ assembly language remains to be seen. All versions clearly have some code that would be more efficient in SIMD/SSE3+ assembly language.

Thanks for all the great information and tips!

bootstrap · Post by **bootstrap** » Fri Aug 19, 2011 4:08 am

I hesitate to post the code here, but I'm happy to send these 3 functions to anyone who wants them.

Now I'm trying to find the fastest [combination-of] sort algorithms for another part of the engine. My plots of "performance" versus "# of elements" versus "degree of disorder" of array contents is very surprising. This engine is dragging me into lots of funky places!

Remotion · Post by **Remotion** » Sat Aug 20, 2011 1:05 pm

Well it would be very nice to test the speed in my environment.
I would be thankful if you could send it to me.

It would be also nice to rip of all of this defines and make the code more modern c++, template based.

Have you uses totally different triangles in all 100,000,000 tests, or the same triangle?

Probably the best way to code SIMD code today is to use intrinsics.
They should be pretty portable. At least much more portable as assembler code.

bootstrap · Post by **bootstrap** » Sun Aug 21, 2011 12:57 am

Remotion wrote:Well it would be very nice to test the speed in my environment. I would be thankful if you could send it to me.

It would be also nice to rip of all of this defines and make the code more modern c++, template based.

Have you uses totally different triangles in all 100,000,000 tests, or the same triangle?

Probably the best way to code SIMD code today is to use intrinsics. They should be pretty portable. At least much more portable as assembler code.

I'll prepare a ZIP file and send it to you via PM (personal message). Let me know whether you have different results.

My routines don't have any macros or defines. I'd say my routines are pretty much clean C code, but no templates.

BTW, the tropp code involves a bit of a scam, obviously concocted to boost speed by pushing part of the required code outside of his function by him requiring unnatural input arguments (four edge vectors and two triangle vertex positions instead of what every engine has available - the positions of the 6 triangle vertices). So I made sure ALL my timings were based on a level playing field - the timer starts with the same arguments - the 6 triangle vertices - and thus computing two edge vectors is overhead in his routine that he doesn't count. I hate when people pull scams like that. Still, I appreciate their work, because my improvement of their code is the fastest routine, and I would never have figured out their approach myself.

I ran my tests on my engine with a bunch of objects rotating and moving around in 3D space and suffering random collisions with each other. I had the collision detection code call all these routines with the same data each time it needed to determine whether two triangles were intersecting. In all cases, the vertices of both triangles had just been accessed by the collision detection routine, and therefore in all cases that information was in the cache. Therefore, every triangle-pair sent to these routines was unique (and fairly randomized), and no routine had any cache advantages or disadvantages.

I program in straight assembler. I dislike intrinsics for various reasons, but I won't get into that here (not relevant to this topic). Oh, but I do have a question related to this. Currently I have 4 versions of each assembly language routine: #1: 32-bit mode MASM syntax; #2: 32-bit mode linux syntax; #3: 64-bit mode MASM syntax; #4: 64-bit mode linux syntax. Does anyone know whether it is possible to compile the linux syntax code on windoze with the tools that would be running with CodeBlocks (on windoze)? Clearly this process generates an object file that gets linked into an executable that runs on windoze. My question is, if I was to take that object file and make it part of a VisualStudio project, would that link in correctly? If only one of the function protocols is supported in this mode (cdecl or stdcall or whatever), I can live with that. It would be nice to downshift from keeping 4 versions in sync to "only" 2 versions.

-----

Later: I found that PM doesn't let me attach files. Oh well. I created a ZIP file that contains two .cpp files and uploaded the ZIP file where you can download it: http://www.iceapps.com/triangle_triangl ... ection.zip.

One .cpp file is the tropp code exactly as his partner posted it on the internet, except for a few very minor formatting touch-ups. This is precisely the file I called to perform the speed tests I mentioned. Note again that you need to start the timer before you compute the four edge vectors his function requires as arguments, or you're not benchmarking on equal footing. You will find those four lines of code near the top of my "improved version" of tropp function in the other file. Just start your timer, execute those four functions, then call the tropp function, then stop the timer when it returns. That will be equivalent to calling the other intersection functions in the other file (that perform those operations inside those functions). Probably I should have put that code inside tropps function and changed his arguments, but I didn't feel right doing that (even with comments to that effect), so I didn't.

The other .cpp file contains functions from my engine, including two additional triangle-triangle routines. The one labeled "30% slower" is the one that makes perfect sense to me and is what I wrote from tabula rasa originally. It executes 6 divide operations which surely slows it down. I see how to eliminate the divide overhead on 4 of those divides by performing other non-math operations before I access the result of the divide. Furthermore, two divides can be performed in parallel in SIMD/SSE2+ assembly/intrinsics. But I haven't done any of this because the code code would become less readable and especially because I suspect the routine would probably still be 10% to 20% slower than my cleaned-up and slightly improved version of the tropp routine. I did find a couple other optimizations I could make in my original function, but I sorta lost interest because I have lots of other work to do on my engine, and suspect that I would never quite make it faster than the current fastest function.

ig_triangle_triangle_intersection() - fastest one, improvement of the tropp routine
ig_triangle_triangle_intersectiono() - my original tabula rasa implementation
tri_tri_intersect3D - tropp function in separate file

In the second file I included several other functions that these functions call, just in case replacing them isn't easy for you (or will be easier to replace with these to examine).

A few notes about my code. I have typedefs in a file to make consistent, readable type names, so you'll see lots of variables declared with types like f64, s32, s64, and so forth. I'm sure you can create equivalent typedefs in your program without any trouble, and I'm sure you know f64 is "double", f32 is "single" or "float" (gee, i forget which it is now!), s32 is a 32-bit signed integer, u32 is a 32-bit unsigned integer, and so forth. The only non-trivial type names might be "cpu" and "f64vec4". The "cpu" type is a 32-bit signed integer when compiling 32-bit mode applications, and a 64-bit signed integer when compiling 64-bit mode applications. f64vec4 is what it sounds like - a structure containing four f64 variables called .x .y .z .w defined with convenient unions so the variables can alternatively be accessed as s64 integers as .ix .iy .iz .iw or as array elements .a[0] ~ .a[3]. If you need or want my file that creates all these and many other useful type names, let me know.

You will also note my math function names are "excessively readable and descriptive"... as in "very long". At least you're unlikely to wonder what these functions do. You'll see what I mean. Also, the code is written to "align and look right" when tabs are displayed as 4 spaces wide. Please note the terms attached to this code as stated at the top of the file. Note that the tropp functions (in the separate .cpp file) are not my code, but was downloaded from the internet.

If you find these functions call any functions that I didn't include, let me know and I'll dig them up.

Let me know what are your impressions, opinions and independent timing results.

Remotion · Post by **Remotion** » Mon Sep 05, 2011 4:32 pm

I finally have time to lock at the code.
This is pretty unusual style but this should be not a problem.

The problem with Tropp intersection is the it only test if there is a intersection or not.
But usually one what to know more for example the segment of intersection.
Calculating this after boolean test will probably cost much more time as reusing data from boolean test.
For tropp function I see no way to reuse the data for segment of intersection calculation.

So if another function is only 30% slower but you can use it to calculation segment of intersection then this is preferable one.

bootstrap · Post by **bootstrap** » Thu Sep 08, 2011 6:13 pm

Remotion wrote:I finally have time to lock at the code. This is pretty unusual style but this should be not a problem.

The problem with Tropp intersection is the it only test if there is a intersection or not. But usually one what to know more for example the segment of intersection. Calculating this after boolean test will probably cost much more time as reusing data from boolean test.
For tropp function I see no way to reuse the data for segment of intersection calculation.

So if another function is only 30% slower but you can use it to calculation segment of intersection then this is preferable one.

I look forward to your results.

Depending on how many triangles you test for intersection versus how many triangles you need detailed intersection information for, perhaps running the faster routine to detect intersections, then re-running the 30% slower routine on those with intersections might be the fastest approach overall.

Let us all know what are your final conclusions.

bootstrap · Post by **bootstrap** » Tue Sep 13, 2011 1:44 am

jerry0503222 wrote:I hesitate to post the code here, but I'm happy to send these 3 functions to anyone who wants them.

Now I'm trying to find the fastest [combination-of] sort algorithms for another part of the engine. My plots of "performance" versus "# of elements" versus "degree of disorder" of array contents is very surprising. This engine is dragging me into lots of funky places!

Is that a question? A request? ?????

Real-Time Physics Simulation Forum

super-fast triangle-triangle intersection algorithm

super-fast triangle-triangle intersection algorithm

Re: super-fast triangle-triangle intersection algorithm

Re: super-fast triangle-triangle intersection algorithm

Re: super-fast triangle-triangle intersection algorithm

Re: super-fast triangle-triangle intersection algorithm

Re: super-fast triangle-triangle intersection algorithm

Re: super-fast triangle-triangle intersection algorithm

Re: super-fast triangle-triangle intersection algorithm

Re: super-fast triangle-triangle intersection algorithm

Re: super-fast triangle-triangle intersection algorithm

Re: super-fast triangle-triangle intersection algorithm