Versioned Read Consistency Test
This project implements a small C11 benchmark to test the correctness of a lock-free cache read pattern using a versioned read (seqlock-style) approach.
Overview
We simulate a cache bucket where:
- A single writer thread repeatedly updates a key-value tuple and a version field.
- A reader thread tries to read the tuple without locking, using the version field to detect concurrent writes.
There are different possible patterns depending on the memory model and compiler optimizations.
The most correct version following C11 semantics appears to be:
- An acquire read of the version field (
s1) - A load of the key/value data
- A memory barrier (
atomic_thread_fence) of orderseq_cst - A second relaxed read of the version field (
s2) - If
s1 != s2, the reader discards the result and retries
This pattern should ensure that the reader only observes consistent data if no write occurred in between.
Purpose
This project serves as a litmus test for:
- Memory consistency under relaxed atomics
- Correct usage of
atomic_thread_fence - Portability across Linux, macOS (including ARM/M1), and Windows
It is useful for validating memory ordering assumptions on your system and compiler.
Findings
Apple M1 Pro
We try different variations on an Apply M1 Pro chip.
Here are versions that did not work:
relaxedfirst read,acquiresecond read, no fencerelaxedfirst read,seq_cstsecond read, no fenceacquirefirst read,relaxedsecond read, no fenceacquirefirst read,acquiresecond read, no fenceacquirefirst read,seq_cstsecond read, no fence
Here are versions where no data race was detected:
relaxedfirst read,relaxedsecond read, no fence (??)relaxedfirst read,relaxedsecond read,acquirefencerelaxedfirst read,relaxedsecond read,seq_cstfenceacquirefirst read,relaxedsecond read,acquirefenceacquirefirst read,relaxedsecond read,seq_cstfence
Performance-wise the versions with a relaxed first read run in 0.21 sec while the versions with an acquire first read run in 0.58 sec.
These results seem strange. One would expect that with no enforced ordering at all, the compiler and cpu would be allowed to reorder loads before the first load or after the second load, resulting in race conditions. The first read might allow relaxed because it is immediately followed by a branch test based on the loaded value, so perhaps this disables speculative execution? But one should not count on that.
Intel x86
We try different variations on an Intel Core i5-13600KF.
In this test, there is no data race detected in any of the cases.
All versions appear to have the same performance.
Building
You need a C11-compatible compiler and CMake 3.1+.
mkdir build
cd build
cmake ..
makeRunning
./test_atomicsYou should see output like:
Detected no data races!
Reader time: 1.234567 seconds (324783.22 iterations/sec)
Portability
- Works on Linux, macOS (Intel and ARM/M1), and Windows
License
This code is released under the MIT License.