Benchmarks
Performance measurements on Apple M-series, Python 3.13, PyArrow 19.x.
Serialization Performance
Time to serialize/deserialize a single RPC request or response payload via Arrow IPC.
| Payload Type | Serialize | Deserialize |
|---|---|---|
| Primitive (single float64) | 0.8 μs | 0.6 μs |
| String parameter | 1.2 μs | 0.9 μs |
| List[float] (100 elements) | 3.5 μs | 2.8 μs |
| Dict[str, int] (50 keys) | 8.2 μs | 6.1 μs |
| Nested dataclass | 12 μs | 9.5 μs |
| Complex (dataclass + lists + enums) | 18 μs | 14 μs |
| 1K-row batch (3 columns) | 25 μs | 12 μs |
| 100K-row batch (3 columns) | 1.8 ms | 0.9 ms |
End-to-End Unary Latency
Full round-trip time for a simple unary call (add(a=1.0, b=2.0) → float) including serialization, transport, dispatch, and deserialization.
| Transport | P50 | P99 | Throughput |
|---|---|---|---|
| Pipe (in-process) | 5.2 μs | 8.1 μs | ~190K calls/s |
| Shared Memory (100 MB) | 5.8 μs | 9.2 μs | ~170K calls/s |
| Unix Socket | 33 μs | 52 μs | ~30K calls/s |
| Subprocess | 52 μs | 85 μs | ~19K calls/s |
| HTTP (in-process WSGI) | 520 μs | 820 μs | ~1.9K calls/s |
Streaming Throughput
Sustained streaming performance for producer and exchange patterns.
| Pattern | Throughput | Latency |
|---|---|---|
| Producer (pipe, 1K-row batches) | 48,000 batches/s | 21 μs/batch |
| Producer (subprocess) | 12,000 batches/s | 83 μs/batch |
| Exchange (pipe) | 32,000 round-trips/s | 31 μs/round-trip |
| Exchange (subprocess) | 8,500 round-trips/s | 118 μs/round-trip |
| SHM Producer (100 MB segment) | 29 GB/s | ~3.4 μs/batch |
The shared memory transport achieves 29 GB/s by writing Arrow IPC batches directly to a memory-mapped segment and sending only a pointer (offset + length) over the pipe.
Cross-Language Performance
Performance of cross-language RPC via subprocess transport. The Python client communicates with servers implemented in different languages.
| Scenario | Unary Latency | Producer |
|---|---|---|
| Python client → Go server (subprocess) | ~120 μs | ~15K batches/s |
| Python client → TypeScript server (subprocess) | ~140 μs | ~12K batches/s |
| Python client → Python server (subprocess) | ~52 μs | ~12K batches/s |
Methodology
- All benchmarks run with 1,000+ warm-up iterations followed by 10,000 measured iterations
- P50 and P99 are wall-clock time including all framework overhead
- Schema generation is cached (first-call overhead ~50 μs, subsequent calls use cached schema)
- Shared memory benchmarks use pre-allocated 100 MB segments
- HTTP benchmarks use in-process WSGI (no network round-trip)
- Cross-language benchmarks include subprocess spawn overhead in first-call measurements (amortized in throughput numbers)