Vnorm = V / |V|
Now I figured I'd do this with SIMD because I need all the speed I can get (but of course, optimizations are not top priority this early on!)
Here's what I came up with
void Vector3::normalize(void)
{
// Must pad with a trailing 0, to store in 128-bit register
ALIGNED_16 platform::F32_t vector[] = {this->x, this->y, this->z, 0};
__m128 simdvector;
__m128 result;
simdvector = _mm_load_ps(vector);
// (X^2, Y^2, Z^2, 0^2)
result = _mm_mul_ps(simdvector, simdvector);
// Add all elements together, giving us (X^2 + Y^2 + Z^2 + 0^2)
result = _mm_hadd_ps(result, result);
result = _mm_hadd_ps(result, result);
// Calculate square root, giving us sqrt(X^2 + Y^2 + Z^2 + 0^2)
result = _mm_sqrt_ps(result);
// Calculate reciprocal, giving us 1 / sqrt(X^2 + Y^2 + Z^2 + 0^2)
result = _mm_rcp_ps(result);
// Finally, multiply the result with our original vector.
simdvector = _mm_mul_ps(simdvector, result);
_mm_store_ps(vector, simdvector);
this->x = vector[0];
this->y = vector[1];
this->z = vector[2];
}
What I'm worried about though is that I might be loosing precision when using floats to calculate the normalized vector. For instance, XNA uses doubles for the intermediate calculations, but stores the result as a float. I tried normalizing the same vector in my engine and in XNA, and noticed slightly varying results. Nothing big, but by the 4th or 5th decimal, the normalized vectors would differ.
Does anyone have any input on this? Can I keep it as is? Or should I use two __m128 registers storing two 64-bit values each (instead of one __m128 storing 4x 32 bit values as I currently do). This would double the number of intrinsics in my code, but give me better precision.
(Thanks Andrew for helping me with the subscript formatting!)
Inga kommentarer:
Skicka en kommentar