Performance Options

bbangerter
Posts: 6
Joined: Thu Apr 03, 2008 4:45 pm

Performance Options

Post by bbangerter »

I've been put on a project to do some performance improvements. After fixing several other issues I'm currently stuck with poor performance from the bullet physics engine. I'm just not familiar with the API enough to or collision detection theory in general to know where to begin looking or what parameters I might be able to tweak to get better performance. I've tried searching the forums but thus far my search terms have provided to broad a list of posts to search through to be meaningful, or have returned only a few posts that don't seem to apply to what I'm missing. This I expect is just my lack of knowing really what terms to search for to narrow things down correctly.

Scenario is as follows:
Worst case we have around 4500 boxes scattered more or less evenly throughout our world.
At any given time there are maybe as many as 500 that enough enough velocity or angular velocity to be graphically visible as moving from one frame to the next. Though in testing it seems many more are making small (non-noticable graphically) adjusts to position and orientation per frame.

1) Is this to many objects (all boxes) to handle in a real time scenario?
2) If not can you please provide suggestions for what I might tweak or what search terms I might use to get more meaningful search results on the forums.

If you need more details about our specific world I'll be happy to answer them of course. Thanks.
User avatar
Erwin Coumans
Site Admin
Posts: 4221
Joined: Sun Jun 26, 2005 6:43 pm
Location: California, USA

Re: Performance Options

Post by Erwin Coumans »

Worst case we have around 4500 boxes scattered more or less evenly throughout our world.
Are all those boxes active? Usually Bullet will de-activate (non-moving) objects, to improve performance. 4500 active boxes is challenging, and the PC version of Bullet needs more optimization. Check out the Bullet 2.68 Benchmarks (in AllBulletDemos), one of the benchmarks is 3000 boxes all falling in one pile. We will use these benchmarks to improve performance in the future.

For Bullet 2.68 we just added the ODE box-box collision detector, that should be a bit faster then Bullet's generic GJK convex collision detector.

Any help in making Bullet faster is welcome,
Erwin
RobW
Posts: 33
Joined: Fri Feb 01, 2008 9:44 am

Re: Performance Options

Post by RobW »

A few things to check:

1) When you create the overlapping pair cache, you pass in the maximum world extents. Make sure these are not larger than necessary as bounding boxes are quantized to fit in the range, so become looser fitting and report more false positive overlaps.
2) If your objects are failing to sleep when 'inactive' check that sleeping isn't accidentally disabled, if not, enabling low pass damping should help them off to sleep.
3) Check the internal timestep hasn't been set very low, 60fps is usually sufficient.
4) Iterating 4500 objects on PS3 or Xbox 360 (and probably PC, too) will touch an awful lot of L2 cache. There are some things you can strip down to reduce the memory footprint in the constraint solver, amongst other places, I can describe this in more detail if it would be useful.

I'll probably add some more thing as I remember them, I'm not at work right now.

I have done some fairly brutal optimisation to Bullet, for example, rewriting the simulation island construction, as I found this to be the bottleneck in simulating large numbers of objects. Much of what I've done is very platform specific, but I would certainly like to discuss some things which could be harvested back.

Cheers,
Rob
bbangerter
Posts: 6
Joined: Thu Apr 03, 2008 4:45 pm

Re: Performance Options

Post by bbangerter »

For reference this is being done on an Xbox 360. I have seen some references to multi-threaded code for Xbox 360 for bullet, but we are not currently using the multi-threaded code.

I added some code to give me a count of active objects by iterating through all the btRigidBody's and checking the isActive method. This is generally sitting around 120 active objects, and peaks as high as 300. This was with ~1800 in game objects at the time.

If I let everything settle down my active box count drops to under 30 (we have a few AI objects that are constantly moving around and bumping into things).

With everything settled I am getting ~20 FPS (according to the PIX tool). Bullet is consuming about 2/5's of our CPU time (with rendering calls taking another 2/5th and various other application calls taking the last 1/5th).

I noted that btDiscreteDynamicsWorld::solveConstraints accounts for slightly less than half of the processing done by bullet. Do sleeping objects still run through constraint resolution? I'm guessing yes given all these boxes have gravity acting upon them, and are stacked haphazardly - and respond correctly if something is moved. Is there any way to put a group of objects (that share a local proximity) to sleep to avoid constraint resolution until an outside object enters their proximity?

Another item of note that I picked up on in looking through the user-manual again is sharing of collision shapes. We are not currently doing this (though I will add that in the near future). Currently each object has it's own collision shape, even though most of them are identical in size. Is the performance gain from switching to shared shapes likely to be significant?
RobW wrote:A few things to check:

1) When you create the overlapping pair cache, you pass in the maximum world extents. Make sure these are not larger than necessary as bounding boxes are quantized to fit in the range, so become looser fitting and report more false positive overlaps.
2) If your objects are failing to sleep when 'inactive' check that sleeping isn't accidentally disabled, if not, enabling low pass damping should help them off to sleep.
3) Check the internal timestep hasn't been set very low, 60fps is usually sufficient.
4) Iterating 4500 objects on PS3 or Xbox 360 (and probably PC, too) will touch an awful lot of L2 cache. There are some things you can strip down to reduce the memory footprint in the constraint solver, amongst other places, I can describe this in more detail if it would be useful.
1) World extents are matched exactly to our world size.
2) Objects appear to be going to sleep correctly (and staying asleep till bumped by something else), so I believe I'm good here.
3) Time step has been left at the default 60fps. Max steps is set to 2 (rather than the default of 1).
4) A description (or online reference if you have one) for stripping down the memory footprint will likely be useful. In general I don't expect to have more than 1500-2000 objects - but I need to account for the worst case scenario of the ~4500.
RobW
Posts: 33
Joined: Fri Feb 01, 2008 9:44 am

Re: Performance Options

Post by RobW »

It sounds like most of your objects are sleeping, so that's good.

After collision detection, islands of interdependent obejcts are created, if an entire island is asleep it doesn't get passed to the constraint solver. It might be worth checking how many calls are done to 'ProcessIsland' to see how many islands are getting through to the constraint solver, and how many objects are in each island.
bbangerter
Posts: 6
Joined: Thu Apr 03, 2008 4:45 pm

Re: Performance Options

Post by bbangerter »

Using the same scenario as above (~1800 objects averaging ~120 active objects) I ended up with ~1600 islands. At it's peak processed islands was ~60 per internalSingleStepSimulation call. If left to settle processed islands was around ~5.
RobW
Posts: 33
Joined: Fri Feb 01, 2008 9:44 am

Re: Performance Options

Post by RobW »

If only 5 islands are being processed once everything settles down, I'm suprised you're still having a performance problem, and even more suprised that half the processing time would be in the constraint solver; that seems to be a very different experience to mine.

I take it the cache friendly constraint solver is being used? (it is set on a flag when you create the SequentialImpulse constraint solver) Do you have any constraints active other than contact and friction constraints which are perhaps proving expensive?

One simple idea I had, but have not actually done, is to reduce the number of iterations done by the constraint solver when an island consists of a low number of bodies (1 or 2, say), with only friction and contact constraints, on the basis that this scenario should converge easily. This could be achieved with a couple of lines of code in 'cacheFriendlyIterations' (sorry, can't remember the exact name). As I mentioned, I haven't tried this and there may be a good reason to not do that, I guess it can't do any harm to try it :)

Besides this, there are lots of quite simple platform specific optimisations you could do. For instance, Bullet does no data prefetching, so you could insert this when it is iterating over collision shapes preparing simulation islands (actually, there are lots of places where they are iterated).

There is a function called btSelect which can be configured to use platform specific branchless floating point select (I think it is in btScalar.h). btSelect isn't really used anywhere, and I actually think the implementation of the function is incorrect. The point is, though, there are many, many floating point compare and branches you could elminiate with a couple of hours work, including in the friction constraint solver which is used twice for every contact point on an object, so potentially each object could have 8 friction constraint, thus with 10 solver iterations...

When the next version of Bullet comes out, I shall implement my rewrite of the simulation island construction, and other generic optimisations, and look at the performance benefit on the 3000 object stress test that Erwin mentions. I did the rewrite in 2 passes, but I believe the total speed increase is around 2x.

It would be nice to propagate the use of btSelect as I think most platforms could benefit from this, and there could even be something like 'btPrefetch' although platform specific details make it hard to organise prefetching in a generic way that is optimal.
bbangerter
Posts: 6
Joined: Thu Apr 03, 2008 4:45 pm

Re: Performance Options

Post by bbangerter »

Just the default friction and contact constraints are active.

I did find another issue though in our own code that was propagating the problem in the physics system. There was another piece of internal code we had that was also doing a fixed time step sort of behavior. While not taking up as much overall time as the physics system (which is why I hadn't looked at it closely initially) the physics system and this other component were feeding off each other causing poor performance (because I'm using a max of 2 internal step simulations rather than a default of 1). So one frame the physics system would run 2 internalStepSimulations (taking double the normal processing time of course), which would lower the frame rate and cause this other system to take multiple steps internally on the next frame - make that frame take more time, which then fed back to the physics simulations doubling up again. I reworked the other piece of code and got some decent performance gains.

I still get under 30 FPS with lots of objects - but project managers have decided to use overall smaller worlds for various reasons (one of which was performance concerns). So for the time being this is a non-issue for me.

Thanks for your help with this Rob. I may come back to doing some platform specific optimizations in the future on this so will ask then in a new thread if I get stuck.