Resume Writing Projects
Writing

Modernizing SFMT-19937 for Zig

A development note on moving SFMT-19937 from a C-shaped port to Zig vector code, and why the useful benchmark only appeared in ReleaseFast.

View on GitHub

The first version of SFMT-19937 in zig-prng was correct, but it still thought like C. It carried a 128-bit word as a struct of four u32 values, walked loops with manual indices, and used scalar helper functions to imitate operations that the original algorithm expected the CPU to do as SIMD.

That is a reasonable starting point for a port. It is also a good way to hide the actual shape of the algorithm from Zig.

SFMT is not a scalar generator with a SIMD paint job. The name is literal: SIMD-oriented Fast Mersenne Twister. If a modern implementation makes the vector path slower, something specific has gone wrong.

ReleaseFast throughput

The takeoff only appears when the optimizer can see the SIMD shape.

5.0x

Struct baselineOriginal C-shaped state representation332 M/s
@Vector rewriteByte-lane shifts expressed with @shuffle1,618 M/s
Documented finalCorrectness held against SFMT 1.5.1 vectors1,675 M/s

Debug mode told the wrong story: the vector rewrite looked slower because the conversions and helper calls were still visible to the compiler.

Baseline82 M/sNaive vector48 M/sPointer rollback57 M/s@shuffle70 M/s
One billion generated u32 values per run. The checksum stayed fixed at 0x7c7d388d.

Establishing the invariant

Before changing the implementation, I added a benchmark that generated one billion u32 values and reported throughput. The first measurement was boring in the useful way: 82 M/s in Debug mode, with the reference vectors still passing.

The important number was not the throughput. It was the checksum, 0x7c7d388d. That value became the guardrail. Every refactor could move the code around, but it had to keep producing the same stream.

The easy cleanup did not move speed

The first pass replaced C-style while loops with Zig range for loops where the iteration count was fixed. That made the recurrence code read like Zig instead of a translation unit with different punctuation.

It did not change performance. That was expected. Loop syntax was not the bottleneck, and the benchmark stayed at 82 M/s.

I left the initByArray loops alone. They carry interdependent counters and modular arithmetic, so the explicit while form is clearer there. Idiomatic code is not code that removes every older-looking construct. It is code that uses the control flow that matches the job.

The vector rewrite got slower first

The next pass replaced the 128-bit word struct with @Vector(4, u32). On paper, that was the right move. The state now had a type that matched the algorithm.

The benchmark disagreed. Debug throughput dropped to 48 M/s.

My first guess was that passing 128-bit vectors by value had introduced extra copying. Changing the recursion helper back to pointer parameters improved the number to 57 M/s, but that was still slower than the baseline. The signature was not the real problem.

The real problem was a shift.

The byte-lane shift was the leak

The SSE2 reference uses two different kinds of right shift. One shifts each u32 element. The other shifts the whole 128-bit register by bytes. Those are different operations, even though they sit near each other in the recurrence.

The Zig code was simulating the byte-lane shift with scalar u64 composition. It packed halves of the register, shifted them, then unpacked the result. That preserved the bits, but it destroyed the compiler’s ability to see a single vector operation.

The fix was to express the operation as a byte shuffle. Reinterpret the vector as sixteen bytes, use @shuffle with a zero vector as the second input, and let negative mask lanes pull zeros into the vacated bytes. That matches the _mm_srli_si128 semantics without routing through scalar arithmetic.

Once that changed, the structure of the code matched the structure of the algorithm. The ReleaseFast benchmark finally showed it: 332 M/s for the old struct-based version, 1,618 M/s for the vector implementation, and 1,675 M/s after the final cleanup.

Debug mode was useful, then misleading

Debug mode helped catch regressions quickly, but it was the wrong place to judge the vector rewrite. The extra conversions and helper calls had not been inlined away, so the benchmark was measuring scaffolding that ReleaseFast would remove.

That does not make the Debug measurements useless. They still told me something was off during the false starts. They just could not answer the final performance question.

The lesson is narrow but durable: if a SIMD-designed algorithm gets slower after moving to vector types, inspect the places where the abstraction leaks. In this case, correctness hid the problem. The byte-lane shift was bit-exact, but the implementation no longer looked like SIMD to the compiler.

The finished version keeps compatibility with the SFMT 1.5.1 reference vectors for both init_gen_rand and init_by_array. It also reads like the implementation it wanted to be: vector state, vector recurrence, byte-lane shifts written as byte-lane shifts.