Surfing DB
Surfing DB is a performance lab for moving complex Thrift payloads into Apache Arrow, shuffling them across MPI ranks, and benchmarking multi-language decoders.
Highlights
- Nested Thrift ⇢ Arrow at speed – JNI bridge (
surfingthriftjni) decodes deeply nested payloads (e.g.DeepEvent) into ArrowVectorSchemaRoots with SIMD-aware native code and Java fallbacks. - GPU acceleration with CUDA – Native CUDA kernels deliver 2-13x speedup for moving average computations. Optimized shared memory usage achieves 13.51x speedup on large windows (RTX 3060). See SIMD_OPTIMIZATIONS.md for details.
- MPI shuffle & cogroup primitives – one-sided RMA or two-sided send/recv data movers balance thousands of Arrow RecordBatches across ranks with optional Thrift ingest.
- End-to-end benchmarks – repeatable runners contrast JNI vs Java decode, throttle multi-rank shuffle/cogroup, and emit size + throughput summaries for easy regression tracking.
- Automatic memory management – Configurable disk spilling prevents OOM errors when processing large datasets. Supports multi-directory spilling for 2-4x I/O throughput. See MEMORY_MANAGEMENT.md for details.
Quick Start
Standard Build
./scripts/build_install.sh # C++ libraries, Arrow deps (APT or source)
./scripts/run_tests.sh # C++ tests, MPI tests, Java testsGPU-Accelerated Build
./scripts/build_install.sh --with-cuda --with-rapids # Install CUDA + RAPIDS
./build/cuda/moving_average_benchmark 1000000 100 1024 # 13.51x GPU speedup
./scripts/run_rapids_moving_avg.sh # Full benchmark suiteGPU Requirements:
- NVIDIA GPU with CUDA support (tested on RTX 3060)
- CUDA Toolkit 12.2+ (auto-installed with
--with-cuda) - Apache Spark 3.5+ (optional, for RAPIDS demos)
Note: All Java modules are now compatible with Arrow 12.0. The build system handles JNI library compilation and dependency management automatically.
JDK: Scripts default to JDK 11 (bundled under.jdks/jdk-11.0.2if present). JDK 8 remains supported for builds, but MPI Java runners should use JDK 11 for stability.
Key Workloads
1. DeepEvent JNI Benchmark
Profiles nested Thrift payload decode via JNI vs pure Java.
scripts/run_deep_event_bench.shOutputs under build/deep_event_*.txt include:
- JNI & Java best latencies (array / mmap / pooled
ByteBuffer) - Payload byte + MB sizes and row counts
Tune runs with environment variables: DEEP_EVENT_SMALL_ROWS,DEEP_EVENT_LARGE_ROWSDEEP_EVENT_SMALL_ITERS,DEEP_EVENT_LARGE_ITERSDEEP_EVENT_FORCE_REGEN=1to rebuild datasets
2. MPI Shuffle & Cogroup
High-cardinality join/load tests that optionally source Thrift payloads.
mpiexec -np 4 java -Djava.library.path=$PWD/build \
-cp drsquirrel-java-project/target/drsquirrel-java-1.0-SNAPSHOT-jar-with-dependencies.jar \
org.surfing.drsquirrel.jni.JniCogroupLoadRunner --twoScripted runners (preferred) with stable JVM + OpenMPI flags:
# One-sided (RMA) shuffle: 1M rows/rank, 12 ranks, 10 iters
JAVA_EXTRA_OPTS="-Xint -XX:-TieredCompilation \
-XX:CompileCommand=exclude,java/util/Formatter$FormatSpecifier.localizedMagnitude \
-XX:CompileCommand=exclude,java/util/Formatter$FormatSpecifier.appendJustified \
-XX:CompileCommand=exclude,java/lang/ThreadLocal.get -Xrs" \
MPIRUN_EXTRA_OPTS="--mca pml ob1 --mca btl tcp,self --mca mtl ^ofi --mca coll ^hcoll" \
scripts/run_fake_cogroup_example.sh --np 12 --mode one --rows 1000000 --iters 10
# Two-sided shuffle: 1M rows/rank, 12 ranks, 10 iters
JAVA_EXTRA_OPTS="..." MPIRUN_EXTRA_OPTS="..." \
scripts/run_fake_cogroup_example.sh --np 12 --mode two --rows 1000000 --iters 10Recent runs (JDK 11) completed without segfaults: one-sided best ~8.74s for 24M rows (~2.75M rows/s); two-sided best ~8.61s for 24M rows (~2.78M rows/s). Larger two-sided run at 2M rows/rank finished in ~17.67s for 48M rows (~2.68M rows/s).
Use Thrift inputs:
mpiexec -np 4 java -Djava.library.path=$PWD/build \
-cp drsquirrel-java-project/target/drsquirrel-java-1.0-SNAPSHOT-jar-with-dependencies.jar \
org.surfing.drsquirrel.jni.JniCogroupLoadRunner --two \
--thrift-path src/bench/deep_event.thrift \
--thrift-struct DeepEvent \
--payload-left /tmp/deep_left.bin \
--payload-right /tmp/deep_right.bin \
--key-field event_idRank 0 prints decode throughput, global row counts, and iteration stats; append SHUFFLE_LOAD_OUT=/tmp/cogroup.csv for CSV log aggregation.
3. All-cores Helpers
Local stress tests pin every hardware thread.
./scripts/run_shuffle_all_cores.sh
./scripts/run_cogroup_all_cores.sh
./scripts/run_all_load_all_cores.sh4. Flink DataStream (DeepEvent datagen)
Demonstrates how to feed Flink's datagen source into the generated DeepEvent Java stub and access nested fields downstream.
mvn -q -f drsquirrel-java-project/pom.xml -Darrow.version=12.0.0 -DskipTests package
FLINK_HOME=/path/to/flink \
"$FLINK_HOME/bin/flink" run \
--class org.surfing.drsquirrel.flink.DeepEventDataStreamExample \
drsquirrel-java-project/target/drsquirrel-java-1.0-SNAPSHOT-jar-with-dependencies.jarTune with JVM system properties, e.g. -Ddeep.event.count=100000 -Ddeep.event.rate=20000 to control record count and generator rate.
Tools & Integrations
- Thrift schema conversion –
thrift2arrow(C++) andorg.apache.thrift.ext.*(Java) translate.thriftIDL into Arrow schemas / builders. - Parquet / CSV dumps –
org.surfing.drsquirrel.tools.ThriftToParquetconverts Thrift payload batches with either Java or JNI decoder paths. - Parquet folder operations – Read/write collections of Parquet files across C++, Java, and Python. See PARQUET_FOLDER_USAGE.md for details.
- Spark RAPIDS & CUDA GPU acceleration –
scripts/run_rapids_groupby.shshowcases GPU-accelerated group-by on decoded Arrow columns. Native CUDA kernels (src/cuda/moving_average_kernel.cu) deliver 2-13x speedup for moving average computations with optimized shared memory usage. Run./build/cuda/moving_average_benchmark 1000000 100 1024for 13.51x speedup on large windows. - Python shim –
examples/python/cogroup_benchmark.pydrives the native processors through Arrow C Data using PyArrow (MPI optional).
Building Blocks
Native (C++)
cmake -S . -B build -GNinja
cmake --build build
Key targets:
surfingdb_core– Arrow shuffle/cogroup kernelsthrift2arrow– Thrift ⇢ Arrow CLIparquet_folder_tool– Parquet folder read/write utilityGenMabsPayloads*– dataset generators for benchmarks
Java / JNI
Arrow 12.0 Compatible Build:
mvn clean install # Builds all modules: surfingthriftjni, drsquirrel-java
Individual Modules:
mvn -f surfingthriftjni/pom.xml package # Thrift→Arrow JNI + Parquet I/O
mvn -f drsquirrel-java-project/pom.xml package # MPI/Flink workflows
Artifacts:
surfingthriftjni/target/surfingthriftjni-1.0-SNAPSHOT.jar+ native libdrsquirrel-java-project/target/drsquirrel-java-1.0-SNAPSHOT-jar-with-dependencies.jar
Key improvements:
- All modules now compatible with Arrow 12.0 API
- Unified Parquet functionality in
surfingthriftjnimodule via JNI - Java 8 compatible (no Java 11+ APIs)
- Automatic dependency resolution through Maven reactor build
See CLAUDE.md for detailed build instructions, troubleshooting, and architecture documentation.
Python
python -m venv .venv && source .venv/bin/activate
pip install pyarrow
cmake --build build --target surfingprocessorspy
PYTHONPATH=build python examples/python/cogroup_benchmark.py --rows 100000 --iters 3
Add mpiexec -np <ranks> to distribute across a cluster.
Configuration Cheatsheet
General Configuration
| Variable | Purpose |
|---|---|
SHUFFLE_TEST_SEED |
Seed randomized MPI loads |
COGROUP_THRIFT_FILE / COGROUP_THRIFT_STRUCT |
Enable Thrift ingestion in MPI runners |
COGROUP_THRIFT_PAYLOAD[_LEFT/_RIGHT] |
Provide pre-generated Thrift frames ([u32 len][payload] format) |
DEEP_EVENT_SMALL_ROWS, DEEP_EVENT_LARGE_ROWS |
Control dataset sizes for JNI benchmark |
SURF_SIMD=ON |
Enable SIMD build for JNI decoder |
SURF_THRIFT_DECODE_THREADS |
Override parallel decode threads in JNI |
Memory Management & Disk Spilling
Configure automatic disk spilling to handle datasets larger than available RAM:
| Variable | Purpose | Default |
|---|---|---|
SURFING_MAX_BATCH_MEMORY |
Maximum memory per batch (bytes) | 512MB |
SURFING_TEMP_DIR |
Single temp directory for spilled files | /tmp/surfing_spill |
SURFING_TEMP_DIRS |
Multiple temp directories (comma-separated) for parallel I/O | - |
SURFING_LOAD_BALANCING |
Load balancing strategy: ROUND_ROBIN, SPACE_AWARE, RANDOM |
ROUND_ROBIN |
SURFING_ENABLE_SPILLING |
Enable/disable disk spilling | 1 (enabled) |
Example - Multi-directory spilling for 2-4x I/O throughput:
# Distribute spill across multiple NVMe drives
export SURFING_TEMP_DIRS="/mnt/nvme0/spill,/mnt/nvme1/spill,/mnt/nvme2/spill"
export SURFING_LOAD_BALANCING=ROUND_ROBIN
export SURFING_MAX_BATCH_MEMORY=1073741824 # 1GB threshold
# Java API alternative
java -cp ... -Djava.library.path=... YourAppSee MEMORY_MANAGEMENT.md for detailed configuration, performance tuning, and troubleshooting.
Documentation
- CLAUDE.md - Comprehensive guide for AI assistants and developers (build commands, architecture, troubleshooting)
- MEMORY_MANAGEMENT.md - Disk spilling configuration and performance tuning
- MODULAR_BUILD.md - Java module structure and dependency management
- PARQUET_FOLDER_USAGE.md - Parquet folder operations across C++, Java, and Python
- SIMD_OPTIMIZATIONS.md - SIMD optimization details for JNI decoder
Contributing
- Fork / branch from
main. - Run
./scripts/run_tests.sh(or targeted modules) before submitting. - Open a PR describing the workload and datasets touched.
Recent improvements:
- ✅ Arrow 12.0 API compatibility (all Java modules)
- ✅ Java 8 compatibility maintained
- ✅ Unified Parquet functionality via JNI
- ✅ Comprehensive test coverage (C++, MPI, Java)
Need help reproducing a benchmark or integrating an Arrow workflow? Check the scripts under scripts/ and open an issue with the command you ran and the payload description.