GitHunt

Cuckoo Hashing Performance Benchmarks

This repository contains a comparative study of various Cuckoo Hashing implementations. The project uses a dataset of 200 model names fetched from Hugging Face to evaluate the load factor, timing, and displacement (collision) characteristics of different hashing variants.

Overview of Variants

Variant Number of Tables Number of Hash Functions Key Features Theoretical Load Factor
Original Cuckoo Hashing 2 2 Basic displacement of keys, constant-time lookups ~50%
d-ary Cuckoo Hashing 3 3 Uses multiple tables and hash functions, higher stability ~91%
Cuckoo Hashing with a Stash 2 2 Uses a small auxiliary stash for overflow keys ~99%
Blocked Cuckoo Hashing 2 2 Buckets hold multiple keys, improves cache locality ~95%

Benchmark Results

Testing was performed by inserting 200 items into a system with a total capacity of approximately 400 slots, targeting a 50% load factor.

GCC Performance Summary (-O3)

Variant Avg Time (s) Avg Load Factor Avg Displacements
Blocked Cuckoo (b=4) 0.00000690 0.5000 41.00
d-ary Cuckoo (d=3) 0.00000820 0.4975 143.00
Cuckoo Hashing with Stash 0.00001350 0.5000 346.00
Original Cuckoo Hashing 0.00001440 0.4975 308.00

Compiler Comparison: GCC vs. CCC

A comparison was conducted between the standard GCC and the Claude C Compiler (CCC).

Variant GCC Avg Time (s) CCC Avg Time (s) Performance Gap
Original 0.00001440 0.00004230 ~2.9x (GCC faster)
d-ary 0.00000820 0.00003150 ~3.8x (GCC faster)
Stash 0.00001350 0.00004120 ~3.0x (GCC faster)
Blocked 0.00000690 0.00001900 ~2.7x (GCC faster)

Requirements

To run the full benchmark and automation suite, you need the following:

  • GCC: The standard GNU C Compiler.
  • CCC: The Claude C Compiler (optional, script will auto-detect if missing).
  • Python 3.x: Used for data generation and report formatting.
  • Python requests library: Required for generate_csv.py to fetch model names.
pip install requests

Getting Started

The easiest way to regenerate the data, rebuild the binaries, and run the benchmarks is via the master automation script:

./run.sh

Options

  • --no-ccc: Force skip the CCC compiler benchmarks even if ccc is found in your path.

Repository Structure

  • run.sh: Master automation script that manages data generation, compilation, and reporting.
  • generate_csv.py: Python script used to fetch 200 model names from Hugging Face and generate test data.
  • models_test_data.csv: The generated dataset used for testing.
  • variants/:
    • original.c, dary.c, stash.c, blocked.c: C implementations of each variant.
    • report.py: Format the raw execution results into a side-by-side compiler comparison.
    • asm/: Assembly outputs generated by both GCC and CCC.
    • compare_compilers.sh: Legacy shell-based benchmark script.
  • cuckoo_testbed.c: Primary C testbed for initial variant benchmarking.
  • LICENSE: MIT License.

Assembly Analysis

Assembly outputs are included in variants/asm/ to allow for close inspection of how each compiler optimizes the tight hashing and eviction loops. GCC generally produces more compact code (7-9KB) compared to CCC (14-16KB), reflecting its more aggressive optimization strategies.

License

This project is licensed under the MIT License - see the LICENSE file for details.

jon2allen/cuckoo_hash | GitHunt