Page 1 of 1

VFP optimizations for iPhone ARM

Posted: Wed Jul 01, 2009 4:34 pm
by SouthpawBob
Has Anyone done VFP optimizations for any of the (SIMD) Math Classes? I'm not sure where to start and haven't been able to find any information. Any help would be greatly appreciated.

Re: VFP optimizations for iPhone ARM

Posted: Thu Jul 02, 2009 5:36 am
by warmi
It is going to be hard to implement for VFP since the way VFP works doesn't work out too well for short burst of vectorized code ( you need to switch VFP in/out of vector mode which is slow)

If anything, I would rather concentrate on the latest Cortex/Neon ( iPhone 3gs).

Re: VFP optimizations for iPhone ARM

Posted: Thu Jul 02, 2009 9:57 pm
by Erwin Coumans
Moving the math library to VFP won't help performance indeed.

There is already a SSE optimized constraint solver available. Converting this SSE code into VFP should make Bullet much faster for iPhone.
See http://code.google.com/p/bullet/source/ ... ver.cpp#60

Thanks,
Erwin

Re: VFP optimizations for iPhone ARM

Posted: Fri Jul 10, 2009 1:37 pm
by PfhorSlayer
I think I am going to port the SSE-optimized code over to the VFP. Speeding up Bullet on the iPhone is something I would very much like to see happen!

However, I'm not too knowledgeable about how the physics simulation actually works under the hood; is there any way that I can confirm that my port is behaving properly, or within whatever "reasonable" limits there may be?

I will most certainly submit my changes once I am done.

Re: VFP optimizations for iPhone ARM

Posted: Fri Sep 11, 2009 6:52 pm
by Francescu
Hi PfhorSlayer,

Have you been able to make progress with the SSE port to VFP?

Cheers

--Frank
PfhorSlayer wrote:I think I am going to port the SSE-optimized code over to the VFP. Speeding up Bullet on the iPhone is something I would very much like to see happen!

However, I'm not too knowledgeable about how the physics simulation actually works under the hood; is there any way that I can confirm that my port is behaving properly, or within whatever "reasonable" limits there may be?

I will most certainly submit my changes once I am done.

Re: VFP optimizations for iPhone ARM

Posted: Wed Jun 02, 2010 5:41 am
by noangel
I have some progress with converting SSE->VFP.
I can share the source code, but I still not verified it.

btSequentialImpulseConstraintSolver::resolveSingleConstraintRowGenericSIMD:
{SSE}

Code: Select all

__m128 cpAppliedImp = _mm_set1_ps(c.m_appliedImpulse);
	__m128	lowerLimit1 = _mm_set1_ps(c.m_lowerLimit);
	__m128	upperLimit1 = _mm_set1_ps(c.m_upperLimit);
	__m128 deltaImpulse = _mm_sub_ps(_mm_set1_ps(c.m_rhs), _mm_mul_ps(_mm_set1_ps(c.m_appliedImpulse),_mm_set1_ps(c.m_cfm)));
	__m128 deltaVel1Dotn	=	_mm_add_ps(_vmathVfDot3(c.m_contactNormal.mVec128,body1.m_deltaLinearVelocity.mVec128), _vmathVfDot3(c.m_relpos1CrossNormal.mVec128,body1.m_deltaAngularVelocity.mVec128));
	__m128 deltaVel2Dotn	=	_mm_add_ps(_vmathVfDot3((-c.m_contactNormal).mVec128,body2.m_deltaLinearVelocity.mVec128) ,_vmathVfDot3(c.m_relpos2CrossNormal.mVec128,body2.m_deltaAngularVelocity.mVec128));
	deltaImpulse	=	_mm_sub_ps(deltaImpulse,_mm_mul_ps(deltaVel1Dotn,_mm_set1_ps(c.m_jacDiagABInv)));
	deltaImpulse	=	_mm_sub_ps(deltaImpulse,_mm_mul_ps(deltaVel2Dotn,_mm_set1_ps(c.m_jacDiagABInv)));
	btSimdScalar sum = _mm_add_ps(cpAppliedImp,deltaImpulse);
	btSimdScalar resultLowerLess,resultUpperLess;
	resultLowerLess = _mm_cmplt_ps(sum,lowerLimit1);
	resultUpperLess = _mm_cmplt_ps(sum,upperLimit1);
	__m128 lowMinApplied = _mm_sub_ps(lowerLimit1,cpAppliedImp);
	deltaImpulse = _mm_or_ps( _mm_and_ps(resultLowerLess, lowMinApplied), _mm_andnot_ps(resultLowerLess, deltaImpulse) );
	c.m_appliedImpulse = _mm_or_ps( _mm_and_ps(resultLowerLess, lowerLimit1), _mm_andnot_ps(resultLowerLess, sum) );
	__m128 upperMinApplied = _mm_sub_ps(upperLimit1,cpAppliedImp);
	deltaImpulse = _mm_or_ps( _mm_and_ps(resultUpperLess, deltaImpulse), _mm_andnot_ps(resultUpperLess, upperMinApplied) );
	c.m_appliedImpulse = _mm_or_ps( _mm_and_ps(resultUpperLess, c.m_appliedImpulse), _mm_andnot_ps(resultUpperLess, upperLimit1) );
	__m128	linearComponentA = _mm_mul_ps(c.m_contactNormal.mVec128,_mm_set1_ps(body1.m_invMass));
	__m128	linearComponentB = _mm_mul_ps((-c.m_contactNormal).mVec128,_mm_set1_ps(body2.m_invMass));
	__m128 impulseMagnitude = deltaImpulse;
	body1.m_deltaLinearVelocity.mVec128 = _mm_add_ps(body1.m_deltaLinearVelocity.mVec128,_mm_mul_ps(linearComponentA,impulseMagnitude));
	body1.m_deltaAngularVelocity.mVec128 = _mm_add_ps(body1.m_deltaAngularVelocity.mVec128 ,_mm_mul_ps(c.m_angularComponentA.mVec128,impulseMagnitude));
	body2.m_deltaLinearVelocity.mVec128 = _mm_add_ps(body2.m_deltaLinearVelocity.mVec128,_mm_mul_ps(linearComponentB,impulseMagnitude));
	body2.m_deltaAngularVelocity.mVec128 = _mm_add_ps(body2.m_deltaAngularVelocity.mVec128 ,_mm_mul_ps(c.m_angularComponentB.mVec128,impulseMagnitude));
{VFP}

Code: Select all

float32x4_t cpAppliedImp=vdupq_n_f32(c.m_appliedImpulse);
		float32x4_t lowerLimit1=vdupq_n_f32(c.m_lowerLimit);
		float32x4_t upperLimit1=vdupq_n_f32(c.m_upperLimit);
		float32x4_t deltaImpulse=vsubq_f32(vdupq_n_f32(c.m_rhs),vmulq_f32(vdupq_n_f32(c.m_appliedImpulse),vdupq_n_f32(c.m_cfm)));
		float32x4_t deltaVel1Dotn=vaddq_f32(vmathVfDot3(c.m_contactNormal.getf32x4(),body1.m_deltaLinearVelocity.getf32x4()),
											vmathVfDot3(c.m_relpos1CrossNormal.getf32x4(),body1.m_deltaAngularVelocity.getf32x4()));
		float32x4_t deltaVel2Dotn=vaddq_f32(vmathVfDot3((-c.m_contactNormal).getf32x4(),body2.m_deltaLinearVelocity.getf32x4()),
											vmathVfDot3(c.m_relpos2CrossNormal.getf32x4(),body2.m_deltaAngularVelocity.getf32x4()));
		deltaImpulse=vsubq_f32(deltaImpulse,vmulq_f32(deltaVel1Dotn,vdupq_n_f32(c.m_jacDiagABInv)));
		deltaImpulse=vsubq_f32(deltaImpulse,vmulq_f32(deltaVel2Dotn,vdupq_n_f32(c.m_jacDiagABInv)));
		float32x4_t sum=vaddq_f32(cpAppliedImp,deltaImpulse);
		uint32x4_t resultLowerLess=vcltq_f32(sum,lowerLimit1);
		uint32x4_t resultUpperLess=vcltq_f32(sum,upperLimit1);
		float32x4_t lowMinApplied=vsubq_f32(lowerLimit1,cpAppliedImp);
		deltaImpulse=vreinterpretq_f32_u32(vorrq_u32(vandq_u32(resultLowerLess,vreinterpretq_u32_f32(lowMinApplied)),
													 vbicq_u32(vreinterpretq_u32_f32(deltaImpulse),resultLowerLess)));
		c.m_appliedImpulse=vreinterpretq_f32_u32(vorrq_u32(vandq_u32(resultLowerLess,vreinterpretq_u32_f32(lowerLimit1)),
														   vbicq_u32(vreinterpretq_u32_f32(sum),resultLowerLess)));
		float32x4_t upperMinApplied=vsubq_f32(upperLimit1,cpAppliedImp);
		deltaImpulse=vreinterpretq_f32_u32(vorrq_u32(vandq_u32(resultUpperLess,vreinterpretq_u32_f32(deltaImpulse)),
													 vbicq_u32(vreinterpretq_u32_f32(upperMinApplied),resultUpperLess)));
		c.m_appliedImpulse=vreinterpretq_f32_u32(vorrq_u32(vandq_u32(resultUpperLess,vreinterpretq_u32_f32(c.m_appliedImpulse)),
														   vbicq_u32(vreinterpretq_u32_f32(upperLimit1),resultUpperLess)));
		float32x4_t linearComponentA=vmulq_f32(c.m_contactNormal.getf32x4(),vdupq_n_f32(body1.m_invMass));
		float32x4_t linearComponentB=vmulq_f32((-c.m_contactNormal).getf32x4(),vdupq_n_f32(body2.m_invMass));
		float32x4_t impulseMagnitude=deltaImpulse;
		body1.m_deltaLinearVelocity.setVector(vaddq_f32(body1.m_deltaLinearVelocity.getf32x4(),vmulq_f32(linearComponentA,impulseMagnitude)));
		body1.m_deltaAngularVelocity.setVector(vaddq_f32(body1.m_deltaAngularVelocity.getf32x4(),vmulq_f32(c.m_angularComponentA.getf32x4(),impulseMagnitude)));
		body2.m_deltaLinearVelocity.setVector(vaddq_f32(body2.m_deltaLinearVelocity.getf32x4(),vmulq_f32(linearComponentB,impulseMagnitude)));
		body2.m_deltaAngularVelocity.setVector(vaddq_f32(body2.m_deltaAngularVelocity.getf32x4(),vmulq_f32(c.m_angularComponentB.getf32x4(),impulseMagnitude)));
btSequentialImpulseConstraintSolver::resolveSingleConstraintRowLowerLimitSIMD:
{SSE}

Code: Select all

__m128 cpAppliedImp = _mm_set1_ps(c.m_appliedImpulse);
	__m128	lowerLimit1 = _mm_set1_ps(c.m_lowerLimit);
	__m128	upperLimit1 = _mm_set1_ps(c.m_upperLimit);
	__m128 deltaImpulse = _mm_sub_ps(_mm_set1_ps(c.m_rhs), _mm_mul_ps(_mm_set1_ps(c.m_appliedImpulse),_mm_set1_ps(c.m_cfm)));
	__m128 deltaVel1Dotn	=	_mm_add_ps(_vmathVfDot3(c.m_contactNormal.mVec128,body1.m_deltaLinearVelocity.mVec128), _vmathVfDot3(c.m_relpos1CrossNormal.mVec128,body1.m_deltaAngularVelocity.mVec128));
	__m128 deltaVel2Dotn	=	_mm_add_ps(_vmathVfDot3((-c.m_contactNormal).mVec128,body2.m_deltaLinearVelocity.mVec128) ,_vmathVfDot3(c.m_relpos2CrossNormal.mVec128,body2.m_deltaAngularVelocity.mVec128));
	deltaImpulse	=	_mm_sub_ps(deltaImpulse,_mm_mul_ps(deltaVel1Dotn,_mm_set1_ps(c.m_jacDiagABInv)));
	deltaImpulse	=	_mm_sub_ps(deltaImpulse,_mm_mul_ps(deltaVel2Dotn,_mm_set1_ps(c.m_jacDiagABInv)));
	btSimdScalar sum = _mm_add_ps(cpAppliedImp,deltaImpulse);
	btSimdScalar resultLowerLess;
	resultLowerLess = _mm_cmplt_ps(sum,lowerLimit1);
	__m128 lowMinApplied = _mm_sub_ps(lowerLimit1,cpAppliedImp);
	deltaImpulse = _mm_or_ps( _mm_and_ps(resultLowerLess, lowMinApplied), _mm_andnot_ps(resultLowerLess, deltaImpulse) );
	c.m_appliedImpulse = _mm_or_ps( _mm_and_ps(resultLowerLess, lowerLimit1), _mm_andnot_ps(resultLowerLess, sum) );
	__m128	linearComponentA = _mm_mul_ps(c.m_contactNormal.mVec128,_mm_set1_ps(body1.m_invMass));
	__m128	linearComponentB = _mm_mul_ps((-c.m_contactNormal).mVec128,_mm_set1_ps(body2.m_invMass));
	__m128 impulseMagnitude = deltaImpulse;
	body1.m_deltaLinearVelocity.mVec128 = _mm_add_ps(body1.m_deltaLinearVelocity.mVec128,_mm_mul_ps(linearComponentA,impulseMagnitude));
	body1.m_deltaAngularVelocity.mVec128 = _mm_add_ps(body1.m_deltaAngularVelocity.mVec128 ,_mm_mul_ps(c.m_angularComponentA.mVec128,impulseMagnitude));
	body2.m_deltaLinearVelocity.mVec128 = _mm_add_ps(body2.m_deltaLinearVelocity.mVec128,_mm_mul_ps(linearComponentB,impulseMagnitude));
	body2.m_deltaAngularVelocity.mVec128 = _mm_add_ps(body2.m_deltaAngularVelocity.mVec128 ,_mm_mul_ps(c.m_angularComponentB.mVec128,impulseMagnitude));
{VFP}

Code: Select all

float32x4_t cpAppliedImp=vdupq_n_f32(c.m_appliedImpulse);
		float32x4_t lowerLimit1=vdupq_n_f32(c.m_lowerLimit);
		float32x4_t upperLimit1=vdupq_n_f32(c.m_upperLimit);
		float32x4_t deltaImpulse=vsubq_f32(vdupq_n_f32(c.m_rhs),vmulq_f32(vdupq_n_f32(c.m_appliedImpulse),vdupq_n_f32(c.m_cfm)));
		float32x4_t deltaVel1Dotn=vaddq_f32(vmathVfDot3(c.m_contactNormal.getf32x4(),body1.m_deltaLinearVelocity.getf32x4()),
											vmathVfDot3(c.m_relpos1CrossNormal.getf32x4(),body1.m_deltaAngularVelocity.getf32x4()));
		float32x4_t deltaVel2Dotn=vaddq_f32(vmathVfDot3((-c.m_contactNormal).getf32x4(),body2.m_deltaLinearVelocity.getf32x4()),
											vmathVfDot3(c.m_relpos2CrossNormal.getf32x4(),body2.m_deltaAngularVelocity.getf32x4()));
		deltaImpulse=vsubq_f32(deltaImpulse,vmulq_f32(deltaVel1Dotn,vdupq_n_f32(c.m_jacDiagABInv)));
		deltaImpulse=vsubq_f32(deltaImpulse,vmulq_f32(deltaVel2Dotn,vdupq_n_f32(c.m_jacDiagABInv)));
		float32x4_t sum=vaddq_f32(cpAppliedImp,deltaImpulse);
		uint32x4_t resultLowerLess=vcltq_f32(sum,lowerLimit1);
		float32x4_t lowMinApplied=vsubq_f32(lowerLimit1,cpAppliedImp);
		deltaImpulse=vreinterpretq_f32_u32(vorrq_u32(vandq_u32(resultLowerLess,vreinterpretq_u32_f32(lowMinApplied)),
													 vbicq_u32(vreinterpretq_u32_f32(deltaImpulse),resultLowerLess)));
		c.m_appliedImpulse=vreinterpretq_f32_u32(vorrq_u32(vandq_u32(resultLowerLess,vreinterpretq_u32_f32(lowerLimit1)),
														   vbicq_u32(vreinterpretq_u32_f32(sum),resultLowerLess)));
		float32x4_t linearComponentA=vmulq_f32(c.m_contactNormal.getf32x4(),vdupq_n_f32(body1.m_invMass));
		float32x4_t linearComponentB=vmulq_f32((-c.m_contactNormal).getf32x4(),vdupq_n_f32(body2.m_invMass));
		float32x4_t impulseMagnitude=deltaImpulse;
		body1.m_deltaLinearVelocity.setVector(vaddq_f32(body1.m_deltaLinearVelocity.getf32x4(),vmulq_f32(linearComponentA,impulseMagnitude)));
		body1.m_deltaAngularVelocity.setVector(vaddq_f32(body1.m_deltaAngularVelocity.getf32x4(),vmulq_f32(c.m_angularComponentA.getf32x4(),impulseMagnitude)));
		body2.m_deltaLinearVelocity.setVector(vaddq_f32(body2.m_deltaLinearVelocity.getf32x4(),vmulq_f32(linearComponentB,impulseMagnitude)));
		body2.m_deltaAngularVelocity.setVector(vaddq_f32(body2.m_deltaAngularVelocity.getf32x4(),vmulq_f32(c.m_angularComponentB.getf32x4(),impulseMagnitude)));
btVector3.h:
{VFP}

Code: Select all

SIMD_FORCE_INLINE float32x4_t getf32x4(void) const
		{
			return vld1q_f32(m_floats);
		}

		SIMD_FORCE_INLINE void setVector(float32x4_t v)
		{
			vst1q_f32(m_floats,v);
		}
btSolverBody.h:
{SSE}

Code: Select all

struct	btSimdScalar
{
	SIMD_FORCE_INLINE	btSimdScalar()
	{

	}

	SIMD_FORCE_INLINE	btSimdScalar(float	fl)
	:m_vec128 (_mm_set1_ps(fl))
	{
	}

	SIMD_FORCE_INLINE	btSimdScalar(__m128 v128)
		:m_vec128(v128)
	{
	}
	union
	{
		__m128	m_vec128;
		float	m_floats[4];
		int		m_ints[4];
	};
	SIMD_FORCE_INLINE	__m128	get128()
	{
		return m_vec128;
	}

	SIMD_FORCE_INLINE	const __m128	get128() const
	{
		return m_vec128;
	}

	SIMD_FORCE_INLINE	void	set128(__m128 v128)
	{
		m_vec128 = v128;
	}

	SIMD_FORCE_INLINE	operator       __m128()       
	{ 
		return m_vec128; 
	}
	SIMD_FORCE_INLINE	operator const __m128() const 
	{ 
		return m_vec128; 
	}
	
	SIMD_FORCE_INLINE	operator float() const 
	{ 
		return m_floats[0]; 
	}

};

///@brief Return the elementwise product of two btSimdScalar
SIMD_FORCE_INLINE btSimdScalar 
operator*(const btSimdScalar& v1, const btSimdScalar& v2) 
{
	return btSimdScalar(_mm_mul_ps(v1.get128(),v2.get128()));
}

///@brief Return the elementwise product of two btSimdScalar
SIMD_FORCE_INLINE btSimdScalar 
operator+(const btSimdScalar& v1, const btSimdScalar& v2) 
{
	return btSimdScalar(_mm_add_ps(v1.get128(),v2.get128()));
}
{VFP}

Code: Select all

struct	btSimdScalar
{
	SIMD_FORCE_INLINE	btSimdScalar()
	{

	}

	SIMD_FORCE_INLINE	btSimdScalar(float	fl)
	:m_vec128 (vdupq_n_f32(fl))
	{
	}

	SIMD_FORCE_INLINE	btSimdScalar(float32x4_t v128)
		:m_vec128(v128)
	{
	}
	union
	{
		float32x4_t	m_vec128;
		float	m_floats[4];
		int		m_ints[4];
	};
	SIMD_FORCE_INLINE	float32x4_t	get128()
	{
		return m_vec128;
	}

	SIMD_FORCE_INLINE	const float32x4_t	get128() const
	{
		return m_vec128;
	}

	SIMD_FORCE_INLINE	void	set128(float32x4_t v128)
	{
		m_vec128 = v128;
	}

	SIMD_FORCE_INLINE	operator       float32x4_t()
	{ 
		return m_vec128; 
	}
	SIMD_FORCE_INLINE	operator const float32x4_t() const 
	{ 
		return m_vec128; 
	}
	
	SIMD_FORCE_INLINE	operator float() const 
	{ 
		return m_floats[0];
	}

};

///@brief Return the elementwise product of two btSimdScalar
SIMD_FORCE_INLINE btSimdScalar 
operator*(const btSimdScalar& v1, const btSimdScalar& v2) 
{
	return btSimdScalar(vmulq_f32(v1.get128(),v2.get128()));
}

///@brief Return the elementwise product of two btSimdScalar
SIMD_FORCE_INLINE btSimdScalar 
operator+(const btSimdScalar& v1, const btSimdScalar& v2) 
{
	return btSimdScalar(vaddq_f32(v1.get128(),v2.get128()));
}

Re: VFP optimizations for iPhone ARM

Posted: Wed Jun 02, 2010 5:44 am
by noangel
I want to make it fastest on ARM with VFP. What else can be optimized to get higher performance?

Re: VFP optimizations for iPhone ARM

Posted: Wed Jun 02, 2010 5:51 am
by noangel
{SSE}

Code: Select all

#define vec_splat(x, e) _mm_shuffle_ps(x, x, _MM_SHUFFLE(e,e,e,e))
static inline __m128 _vmathVfDot3( __m128 vec0, __m128 vec1 )
{
    __m128 result = _mm_mul_ps( vec0, vec1);
    return _mm_add_ps( vec_splat( result, 0 ), _mm_add_ps( vec_splat( result, 1 ), vec_splat( result, 2 ) ) );
}
{VFP}

Code: Select all

SIMD_FORCE_INLINE float32x4_t vmathVfDot3(float32x4_t v0,float32x4_t v1)
{
	float32x4_t result=vmulq_f32(v0,v1);
	return vaddq_f32(vdupq_n_f32(vgetq_lane_f32(result,0)),vaddq_f32(vdupq_n_f32(vgetq_lane_f32(result,1)),vdupq_n_f32(vgetq_lane_f32(result,2))));
}
Is it OK?

Re: VFP optimizations for iPhone ARM

Posted: Wed Jun 02, 2010 6:02 pm
by Erwin Coumans
Haven't had time yet to look into ARM/Neon optimizations, but I hope to get time for this at some stage.

We could create some unit tests, to test the SIMD optimizations on various platforms (an issue request is here http://code.google.com/p/bullet/issues/detail?id=384)

Thanks,
Erwin

Re: VFP optimizations for iPhone ARM

Posted: Thu Jun 03, 2010 4:50 pm
by noangel
Thank You for answer!
I tried code above. It seems to be working at least for my test program.
Now I need to make engine as fast as possible on ARM using VFP SIMD.
I think I should use code profiler to find what else to optimize.
But any hint will be helpful.

Re: VFP optimizations for iPhone ARM

Posted: Thu Jun 10, 2010 4:02 pm
by warmi
Use Shark .. it will identify bottleneck code in terms of ARM asm code listing annotated with stalls and other stuff on per instruction basis.

Something like http://www.warmi.net/tmp/fast.png

BTW ... your code is not really handling VFP but NEON ( the latest iPhones/iPad - the original iPhone/iPods are VFP only)

Re: VFP optimizations for iPhone ARM

Posted: Fri Jun 11, 2010 9:04 am
by kester
If you are porting the constraint solver, there are a couple of improvements you can make for most vector units:

1. The dot products are not usually implemented in hardware very well.

Instead of dot(n, v1) - dot(n, v2) + dot(pxn1, w1) + dot(pxn2, w2), you can write:

dot( n * v1 - n * v2 + pxn1 * w1 + pxn2 * w2, 1) which makes better use of the vector units.

2. The limits can be implemented with min and max instead of the compare and masks.