SIMD Vector3 Math

Ernegien · Post by **Ernegien** » Sat Jun 23, 2007 5:41 pm

For those that wish to see for themselves the performance gains associated with SIMD operations, here's a few functions to benchmark against the other standard ones. I'm still fairly new to SIMD operations in assembly, but if anyone needs help with some of the math optimisation, feel free to ask

Code: Select all

#pragma once

static const float SZero = 0;

 __declspec(align(16)) struct Vector3D
{
	#pragma region Constructor
	public: Vector3D()
			{
				X = 0;
				Y = 0;
				Z = 0;
				W = 0;
			}
	public: Vector3D(float x, float y, float z)
			{
				X = x;
				Y = y;
				Z = z;
				W = 0;
			}
	#pragma endregion

	#pragma region Destructor
	public: ~Vector3D(void)
			{

			}
	#pragma endregion

	#pragma region Fields
	public: float X;
	public: float Y;
	public: float Z;
	private: float W;
	#pragma endregion

	#pragma region Properties

	#pragma endregion

	#pragma region Operator Overloads

	#pragma endregion

	#pragma region Methods
	public: void NormalizePrecise()
			{
				_asm
				{
					//get length
					mov		eax, this
					movd	xmm0, dword ptr ds:[eax]
					movd	xmm1, dword ptr ds:[eax + 4]
					movd	xmm2, dword ptr ds:[eax + 8]
					mulss	xmm0, xmm0
					mulss	xmm1, xmm1
					mulss	xmm2, xmm2
					addss	xmm0, xmm1
					addss	xmm0, xmm2
					sqrtss	xmm0, xmm0
					ucomiss	xmm0, SZero
					je		Done

					//duplicate length across register
					pshufd	xmm0, xmm0, 0

					//divide by length
					movdqa	xmm1, xmmword ptr ds:[eax]
					divps	xmm1, xmm0

					//store result
					movdqa	xmmword ptr ds:[eax], xmm1
					Done:
				}
			}
	public: void Normalize()
			{
				_asm
				{
					//get recipricol length
					mov		eax, this
					movd	xmm0, dword ptr ds:[eax]
					movd	xmm1, dword ptr ds:[eax + 4]
					movd	xmm2, dword ptr ds:[eax + 8]
					mulss	xmm0, xmm0
					mulss	xmm1, xmm1
					mulss	xmm2, xmm2
					addss	xmm0, xmm1
					addss	xmm0, xmm2
					ucomiss	xmm0, SZero
					je		Done
					rsqrtss	xmm0, xmm0

					//duplicate length across register
					pshufd	xmm0, xmm0, 0

					//multiply by recipricol length (divide by length)
					movdqa	xmm1, xmmword ptr ds:[eax]
					mulps	xmm1, xmm0

					//store result
					movdqa	xmmword ptr ds:[eax], xmm1
					Done:
				}
			}
	public: void Absolute()
			{
				_asm
				{
					mov		eax, this
					and		dword ptr ds:[eax], 07FFFFFFFh
					and		dword ptr ds:[eax + 4], 07FFFFFFFh
					and		dword ptr ds:[eax + 8], 07FFFFFFFh
				}
			}
	public: void Maximize(const Vector3D &v1)
			{
				_asm
				{
					//get parameter information
					mov		eax, v1
					movdqa	xmm0, xmmword ptr ds:[eax]
					mov		eax, this

					//compute maximum
					maxps	xmm0, xmmword ptr ds:[eax]

					//store result
					movdqa	xmmword ptr ds:[eax], xmm0
				}
			}
	public: void Minimize(const Vector3D &v1)
			{
				_asm
				{
					//get parameter information
					mov		eax, v1
					movdqa	xmm0, xmmword ptr ds:[eax]
					mov		eax, this

					//compute minimum
					minps	xmm0, xmmword ptr ds:[eax]

					//store result
					movdqa	xmmword ptr ds:[eax], xmm0
				}
			}
	public: void Cross(const Vector3D &v1)
			{
				_asm
				{
					//get parameter information
					mov		eax, v1
					movdqa	xmm0, xmmword ptr ds:[eax]
					mov		eax, this
					movdqa	xmm1, xmmword ptr ds:[eax]

					//align vectors to be multiplied
					pshufd	xmm2, xmm0, 11001001b	//(v1.Y, v1.Z, v1.X)
					pshufd	xmm4, xmm1,	11001001b	//(Y, Z, X)
					pshufd	xmm3, xmm0, 11010010b	//(v1.Z, v1.X, v1.Y)
					pshufd	xmm5, xmm1,	11010010b	//(Z, X, Y)

					//perform cross-product
					mulps	xmm4, xmm3
					mulps	xmm5, xmm2
					subps	xmm4, xmm5

					//store result
					movdqa	xmmword ptr ds:[eax], xmm4
				}
			}
	public: void Lerp(const Vector3D &v1, float interpolator)
			{
				_asm
				{
					//get parameter information
					mov		eax, v1
					movdqa	xmm0, xmmword ptr ds:[eax]
					mov		eax, this
					movdqa	xmm1, xmmword ptr ds:[eax]

					//duplicate interpolator across register
					pshufd	xmm2, interpolator, 0

					//interpolate
					subps	xmm0, xmm1
					mulps	xmm0, xmm2
					addps	xmm0, xmm1

					//store result
					movdqa	xmmword ptr ds:[eax], xmm0
				}
			}

	#pragma endregion
};

Please note that this requires your vector to be alligned on a 16-byte boundary. Also, this was compiled as a win32 project using Visual Studios 2005. You may need to modify the assembly a bit depending on which compiler you use.

Dirk Gregorius · Post by **Dirk Gregorius** » Sun Jun 24, 2007 10:27 am

What do you want to show with this? Do you expect that replacing your math library with as a generic SIMD implementation will give huge performance improvements?

Ernegien · Post by **Ernegien** » Sun Jun 24, 2007 4:05 pm

Huge, no. The compiler does a fairly decent job...but I've experienced a 50-100% increase in speed in methods that could directly benefit from such optimisations. It isn't much, but every little tick counts ;P

Dirk Gregorius · Post by **Dirk Gregorius** » Sun Jun 24, 2007 5:22 pm

Did you experience these improvements in a real application or just by measuring a SIMD cross product vs. a usual implementation? My experience is that a SIMD implementation as you suggest here brings pretty much nothing on the PC. Even worth it can actually slow down things. On the other hand using SIMD for time critical code can bring huge improvements, e.g.

http://www.intel.com/cd/ids/developer/a ... .htm?prn=Y

Implementing a SIMD math library looks like a trivial straight forward thing, but actually it isn't. Also there is a huge difference between PC and PPC.

So again, what is the sense of this post? Do you want to teach people writing assembly code?

Ernegien · Post by **Ernegien** » Sun Jun 24, 2007 6:54 pm

Interesting article

Anyways, I guess the purpose of my post is to inform people about the benefits SIMD has to offer. Sure, you won't notice any solid performance gains if you're only executing these every once in a while...but when you are calling them on a continuous basis, there will be a noticeable difference (like any other optimised piece of code). Even if its only a few thousand more executions per second, it's still worth it in my opinion. Now I'm not saying to go crazy and convert everything to SIMD, because for the most part you're right, there's no need. But little things like avoiding a square root when normalizing a vector are too good to pass up

I'll update my post above with some more I've managed to convert, all of which offer considerable performance gains over their C++ counterpart.

Pierre · Post by **Pierre** » Mon Jun 25, 2007 7:32 am

http://www.farbrausch.de/~fg/articles/u ... ector.html

*yawn*

Dirk Gregorius · Post by **Dirk Gregorius** » Mon Jun 25, 2007 8:46 am

From the text:

On programming forums, every once in a while someone appears who wants to optimize his program by replacing his current 3D vector class by one that uses SSE opcodes, in the hope he'll make his program run 4 times as fast.

What is new is that now people who state that they are fairly new to SIMD operations start giving advices on programming forums. I wonder when we get the first post here that states using C++ is idiotic and that we should use Java or C# instead "since it is only 5% slower"....

Ernegien · Post by **Ernegien** » Mon Jun 25, 2007 6:23 pm

I never made such claims, other than providing a few examples of SIMD implementation in vector math operations, which do offer a small performance increase (the Normalize() and Cross() methods in particular) over other standard methods.

The article brings up some good points, but intrinsics aren't any better than what I'm currently doing, aside from eliminating a single procedure call, which will most likely get canceled out by poor compiler optimisation anyways. That and maybe a few other dependencies that can be easily avoided by switching up the opcode orders or register assignments in some of my functions, but for the most part, I believe I did a fairly good job doing it by hand...feel free to correct me if I'm wrong though ;P

And again, I'm not here to argue that SIMD solves everything...but these general concepts can be easily used to help speed up certain segments of code that have lots of calculations and take more time to execute.

Dirk Gregorius · Post by **Dirk Gregorius** » Tue Jun 26, 2007 8:32 am

May I ask on which experience you base all your statements here? Did you test this in a MLOC project or did you write some simple testbed like this:

Code: Select all

int main( void )
{
BEGIN_PROFILE( "NON_SSE_CROSS" );
v = cross( v1, v2 );
END_PROFILE();

BEGIN_PROFILE( "SSE_CROSS" ); 
v = cross_sse( v1, v2 );
END_PROFILE();

if ( time_non_sse < time_sse )
  printf( "I did a pretty good job!" );

return 0;
}