lördag 13 augusti 2011

SIMD Vector normalization.

I've begun to work on one of the Camera classes in the engine, and when thinking about how to create my view matrices I found that I do not yet have any methods for normalizing my 3D vectors. Normalizing a vector is just about finding the length of the vector and then dividing each element of the vector by it:

Vnorm = V / |V|

Now I figured I'd do this with SIMD because I need all the speed I can get (but of course, optimizations are not top priority this early on!)
Here's what I came up with

void Vector3::normalize(void)
{
    // Must pad with a trailing 0, to store in 128-bit register
    ALIGNED_16 platform::F32_t vector[] = {this->x, this->y, this->z, 0};
    __m128 simdvector;
    __m128 result;
    simdvector = _mm_load_ps(vector);
    
    // (X^2, Y^2, Z^2, 0^2)
    result = _mm_mul_ps(simdvector, simdvector);

    // Add all elements together, giving us (X^2 + Y^2 + Z^2 + 0^2)
    result = _mm_hadd_ps(result, result);
    result = _mm_hadd_ps(result, result);
    
    // Calculate square root, giving us sqrt(X^2 + Y^2 + Z^2 + 0^2)
    result = _mm_sqrt_ps(result);

    // Calculate reciprocal, giving us 1 / sqrt(X^2 + Y^2 + Z^2 + 0^2)
    result = _mm_rcp_ps(result);

    // Finally, multiply the result with our original vector.
    simdvector = _mm_mul_ps(simdvector, result);

    _mm_store_ps(vector, simdvector);

    this->x = vector[0];
    this->y = vector[1];
    this->z = vector[2];
}

What I'm worried about though is that I might be loosing precision when using floats to calculate the normalized vector. For instance, XNA uses doubles for the intermediate calculations, but stores the result as a float. I tried normalizing the same vector in my engine and in XNA, and noticed slightly varying results. Nothing big, but by the 4th or 5th decimal, the normalized vectors would differ.

Does anyone have any input on this? Can I keep it as is? Or should I use two __m128 registers storing two 64-bit values each (instead of one __m128 storing 4x 32 bit values as I currently do). This would double the number of intrinsics in my code, but give me better precision.

(Thanks Andrew for helping me with the subscript formatting!)

Inga kommentarer:

Skicka en kommentar