Sweep and Prune optimizations for colliding classes?

Please don't post Bullet support questions here, use the above forums instead.
User avatar
Erwin Coumans
Site Admin
Posts: 4221
Joined: Sun Jun 26, 2005 6:43 pm
Location: California, USA

Just to make it more clear: no data duplication is necessary

Post by Erwin Coumans »

To make the physics simulation scalable, divide and conquer seems a natural way to go, both for collision detection and for constraint solving.

The splitting of a universe into smaller spaces, where each space has its own broadphases only requires boundary management as I mentioned before. The objects that are in the boundary area only need to be inserted in all overlapping broadphases, but that doesn't require a real copy of data: you can just increase the reference count (instancing), so no real duplication apart from the broadphase AABB entry.

So no collision shape data or rigidbody is ever duplicated, except for a tiny AABB entry. Also, the instance disappears when the object's 'simulation island' is not overlapping the boundary anymore.

Erwin
Erin Catto
Posts: 316
Joined: Fri Jul 01, 2005 5:29 am
Location: Irvine

Post by Erin Catto »

Erwin,

I like your ideas. They make sense, even though they seem like they might be a bit painful to implement.

Do you have a sense of what conditions lead to the broadphase partitioning becoming necessary? Number of bodies, world size, etc.

In our game, dealing with a couple hundred broadphase boxes, the broadphase performance was never an issue (except for raycasts).
User avatar
Erwin Coumans
Site Admin
Posts: 4221
Joined: Sun Jun 26, 2005 6:43 pm
Location: California, USA

Post by Erwin Coumans »

Erin Catto wrote:Erwin,
Do you have a sense of what conditions lead to the broadphase partitioning becoming necessary? Number of bodies, world size, etc.
As soon as far away objects cause noticeable performance decrease (like Pierre mentioned in this topic http://www.continuousphysics.com/Bullet ... .php?t=328) you can consider splitting the broadphase. With large streaming worlds it is very likely to be beneficial.
In our game, dealing with a couple hundred broadphase boxes, the broadphase performance was never an issue (except for raycasts).
Did you consider using the ray through SAP implementation? Especially on Playstation 2 you can use the scratchpad which makes it faster then brute force (ray versus all AABB's) implementations (both worst case and average case). Although Pierre Terdiman has different results, I think he didn't make the algorithm cache friendly, incrementally updaing bitfields (see http://www.continuousphysics.com/Bullet ... c.php?t=50 ).
User avatar
John McCutchan
Posts: 133
Joined: Wed Jul 27, 2005 1:05 pm
Location: Berkeley, CA

Post by John McCutchan »

Erwin Coumans wrote: I'm only aware of the stabbing numbers that don't work nicely with this.
But as Gino and me already concluded, there are better methods then stabbing, which don't suffer from large AABB's. So apart from this single case, I would love to know a few of those 'many' optimizations that suffer from those large AABB's.
What alternatives are there to stabbing numbers?
User avatar
Erwin Coumans
Site Admin
Posts: 4221
Joined: Sun Jun 26, 2005 6:43 pm
Location: California, USA

Post by Erwin Coumans »

You can add 'marker objects' distributed over the world into the SAP broadphase. Then before inserting a (batch of) new objects (or one-shot queries), first search for the nearest marker object (not using the SAP broadphase, your own method, like binary search or hash lookup), and start with the begin/end points of this object. This can save a lot of swaps. Other optimizations include splitting the broadphase into multiple broadphase.

Another useful consideration is the overlapping pairList management: you can avoid doing removal of the overlapping pairs during the SAP swaps, and remove the pairs that are not overlapping during the pair-list traversal (needed for narrowphase). Then you can store the overlapping pairlist as a flat contiguous array. The only caveat is to remove duplicated pairs from this array, which can be easily done by sorting the array once at the end, and then pruning duplicates while traversal the array to dispatch narrowphase calculations.

I might implement those samples to the public version of the Bullet physics library at some stage, if there is interest.

Hope this helps,
Erwin
jmc wrote:
Erwin Coumans wrote: I'm only aware of the stabbing numbers that don't work nicely with this.
But as Gino and me already concluded, there are better methods then stabbing, which don't suffer from large AABB's. So apart from this single case, I would love to know a few of those 'many' optimizations that suffer from those large AABB's.
What alternatives are there to stabbing numbers?
Pierre
Posts: 67
Joined: Mon Jul 25, 2005 8:56 am

Post by Pierre »

Did you consider using the ray through SAP implementation? Especially on Playstation 2 you can use the scratchpad which makes it faster then brute force (ray versus all AABB's) implementations (both worst case and average case). Although Pierre Terdiman has different results, I think he didn't make the algorithm cache friendly, incrementally updaing bitfields
My implementation was cache-friendly, yet maybe not cache-friendly enough :)

There are two things that I really don't like with the ray-through-SAP approach:

- if the ray happens to cross the whole world without touching anything, it *will* be slower than brute-force (simply because you're touching all objects anyway, and there's the DDA overhead). I agree it's a pathological case, and most of the time it's not an issue, but it still "feels" wrong.

- if I don't already have a shape inside the SAP, from which I can start the ray, then finding this starting point is painful/slow.

On top of that, the more I use it, the more I dislike the whole SAP anyway. It just doesn't scale well. If all the objects are moving a stupid radix-based brute-force approach outperforms the SAP, no matter how you implement it. So on one hand there is the SAP, great when nothing is moving, lame when everything is moving. And on the other hand there is the radix approach, taking pretty much the same time no matter what. I think it's a well-known fact that we should optimize for the worst-case, right? (Making the best case better and the worst case worse is not a good approach, a constant 30 FPS is better than constantly dropping from 60 to 10, blah, blah). So... the more I think about it... the more I like the brute-force box pruning. If I had to start a new engine again, from scratch, I think I would go this way, with "all shapes are moving" in mind. At least then the system wouldn't collapse when they actually do.

Then for raycasts (and all the other scene queries like frustum culling, sweep tests, overlap tests for the AI, whatever) I would maintain a separate pruning structure, almost unrelated to the physics engine. And for this I would use a dynamic AABB tree for all the scene objects. I recently added that to the Novodex SDK and it gave huge speedups over the previous loose octree/quadtree stuff!

- Pierre
dog
Posts: 40
Joined: Fri Jul 22, 2005 8:00 pm
Location: Santa Monica

Post by dog »

<removed some insanity>
Last edited by dog on Sat Jun 23, 2007 5:51 pm, edited 1 time in total.
User avatar
Erwin Coumans
Site Admin
Posts: 4221
Joined: Sun Jun 26, 2005 6:43 pm
Location: California, USA

Post by Erwin Coumans »

I agree that different environments (static, slow moving, fast moving) might require different broadphase optimizations, but I think SAP is a great general purpose broadphase when used in combination with aabb tree for static objects.
Pierre wrote:
There are two things that I really don't like with the ray-through-SAP approach:

- if the ray happens to cross the whole world without touching anything, it *will* be slower than brute-force (simply because you're touching all objects anyway, and there's the DDA overhead). I agree it's a pathological case, and most of the time it's not an issue, but it still "feels" wrong.
It should be similar speed, the computation and memory usage are O(n) for both cases. Then it's mostly a matter of implementation details. Why would a brute force approach be faster, if both are cache-friendly implemented? I would say that cache misses are the bottleneck, not computation.
- if I don't already have a shape inside the SAP, from which I can start the ray, then finding this starting point is painful/slow.
It doesn't need to be slow, you just insert marker objects into the broadphase nicely distributed. Then for a raycast (or other single query) first search for the closest marker object (any fast method you like), and then start from there.

- SAP should be great when many objects move slowly. If there are too many non-moving objects, I wouldn't add them into the SAP, but into a separate structure like aabb tree, and only insert one AABB into the SAP for that tree.

- Improve broadphase performance, and avoiding cost of swaps of far-away objects:
The broadphase can be split into multiple broadphases, and add some transition area in the beginning and end of the axis to prevent allocations when objects move from one broadphase to the other.

We found that overlapping pair management, narrowphase and meshes are causing most collision detection performance issues, so we started optimizing those for Bullet PS3. Instead of removing pairs during the SAP processing, we remove redundant (non-overlapping and duplicate) pairs during the pairarray traversal. So we keep the overlapping pairarray as array, instead of a complicated structure, because we don't need a find anymore. Parallelizations by doing as much work as possible in parallel, and managing the contact points more cache friendly can help a lot.

One nice thing about loose octrees is that you can recompute the overlapping pairarray from scratch, instead maintaining them incrementally as SAP does. This can be a benefit when many objects overlap, so you don't need to keep the memory for all overlapping pairs.

Pierre, I'm interested to hear about your dynamic AABB tree. Can you give some more information about that? It is spatial partitioning or object partitioning? How do you deal with re-balancing cost when many objects move fast?

Thanks,
Erwin
Pierre
Posts: 67
Joined: Mon Jul 25, 2005 8:56 am

Post by Pierre »

It should be similar speed, the computation and memory usage are O(n) for both cases. Then it's mostly a matter of implementation details. Why would a brute force approach be faster, if both are cache-friendly implemented? I would say that cache misses are the bottleneck, not computation.
Exactly. And this is why the brute-force approach "won" in those bad cases: for me all the AABBs are always stored in a linear array (never in the objects themselves), so the brute-force version was just parsing a linear array of AABBs, which is the fastest you can get. On the other hand, the SAP version kept jumping from one box to the next in "random" order. Combined with the extra DDA logic needed to know where to "jump", the SAP version just ended up slower. I'm not sure what "cache friendly" would be here, for the SAP. The order in which you need to fetch the boxes always changes with each ray, while in the brute-force version it's always the same and always the "best". I don't see how the SAP version can compete.
It doesn't need to be slow, you just insert marker objects into the broadphase nicely distributed.
Yeah, I know this approach, and I absolutely hate it :) Some customers are already stressing the BP with 50.000 (!) objects or similar numbers. I really don't want to pollute the SAP with even more "junk" objects...

How many of those marker objects do you typically use? Doesn't it make the SAP slower for everything else, due to the increased number of swaps?
Instead of removing pairs during the SAP processing, we remove redundant (non-overlapping and duplicate) pairs during the pairarray traversal. So we keep the overlapping pairarray as array, instead of a complicated structure, because we don't need a find anymore.
But... you can have a quick "find" even on a linear array, with some hashing. The "PairManager" class in our codebase (I think you still have access, right?) does exactly this. Colliding pairs are stored in a contiguous linear array, and all operations (add / remove / find ) are O(1).

(Just saying.... The other approach certainly works as well)

This can be a benefit when many objects overlap, so you don't need to keep the memory for all overlapping pairs.
Don't you need some kind of persistent per-pair structure anyway, to cache various things?
Pierre, I'm interested to hear about your dynamic AABB tree. Can you give some more information about that? It is spatial partitioning or object partitioning? How do you deal with re-balancing cost when many objects move fast?
It's a BV-tree, so not spatial partitioning. It's exactly the same as Opcode's AABB-tree. Rebalancing is automatic since I "simply" keep recomputing a new tree in the background, over N frames, while the previous tree is still used. After N frames I switch to the new tree and start again. The current tree is "refit" when objects move, and some leaves are marked as invalid when objects are deleted but the tree is still used. The tree never gets "too degenerated" by the refit operation, since it only lives for N frames. So the objects never go very far anyway. It's really an easy theory, but the implementation was rather tricky and painful. Well worth it in the end, it's an order of magnitude faster than our previous octree stuff, and as a bonus it's using a lot less memory.

- Pierre
User avatar
Erwin Coumans
Site Admin
Posts: 4221
Joined: Sun Jun 26, 2005 6:43 pm
Location: California, USA

Post by Erwin Coumans »

On the other hand, the SAP version kept jumping from one box to the next in "random" order.
No, it shouldn't be random access, the next 3 candidate axis should be predictable and hence cache friendly: the SAP should be 3 contiguous arrays, one for each axis, and you just traverse each array in a predictable manner. You can keep the current 'overlapping' status in a bit-array, which can fit in fast Local Store / L1 cache / scratchpad memory.
How many of those marker objects do you typically use? Doesn't it make the SAP slower for everything else, due to the increased number of swaps?
Less then a few hundred marker objects should be fine. 50k objects in the SAP sounds bad, are they all potentially moving? Can't you move them to static AABB trees? For this, multiple-broadphases would be better. Also an optimized batch-add for large number of added objects might be good.

Thanks for the AABB tree idea. So you re-calculate the quantized AABB tree for the scene, over some frames. Is it exposed to the user/API to choose how many frames to rebuild/replace?

Which part was painful? The refit, or flagging 'deleted' objects, or adding new objects?

Thanks,
Erwin

PS: We just added quantized nodes to Bullet PS3 stackless aabb tree traversal, which works very nice on PS3 SPU. We quantize the query object's aabb before traversal, so just check the quantized aabb's overlap. We use 16 bytes per node, but with the restriction that we only have 16-bit triangle indices (but allow multiple triangles per node). We could add 32bit triangle indices support, but I'm not we can cram that in 16 byte nodes. Hopefully I will get time soon to port this over to the public version of Bullet, it might be useful for deformable objects etc (I think the quantized OPCODE tree doesn't have refit).