Benchmarks
Reproducible performance comparisons between four libraries:
| Library | Language | Engine |
|---|---|---|
| rpdfium (latest) | Ruby | PDFium (FFI) |
| pypdfium2 | Python | PDFium (FFI) — the “pure PDFium speed floor” |
| pdfplumber | Python | pdfminer.six (pure Python) |
| hexapdf | Ruby | pure Ruby |
Three metrics per combination:
- Execution time — minimum of 3 timed runs after a warm-up
- Memory — peak RSS of the isolated runner process, max of 3 runs
- Correctness — fraction of known ground-truth data actually recovered
The benchmark scripts and sample files live in the
benchmark/
directory of the repository.
Sample files — 5 tiers of complexity
All sample PDFs are synthetic, generated by benchmark/generate_pdfs.rb:
| File | Pages | Complexity |
|---|---|---|
01_simple.pdf |
1 | Short text + one small ruled table |
02_medium.pdf |
6 | Text + one ruled table per page |
03_complex.pdf |
16 | Mixed: text, ruled tables, borderless columns, prestamped form |
04_heavy.pdf |
60 | Dense text + a ruled table on every page |
05_academic.pdf |
520 | Simulated journal article: condensed two-column body, ruled + borderless tables, embedded figures, footnotes, academic annotations |
Correctness scoring
Because the PDFs are generated, the ground truth is known exactly
(pdfs/expected.json):
- text — unique sentinel tokens embedded in the running text; score = fraction recovered. Whitespace-insensitive, so libraries are not penalized for different word-spacing reconstruction.
- tables — unique cell values (SKU codes, row totals) of every ruled table; score = fraction recovered as table cells with default settings.
The check runs outside the timed section. pypdfium2 has no table layer and
runs only the text task. hexapdf has no built-in table extraction either,
but it exposes the needed primitives (glyph boxes + path segments), so the
tables row uses the
benchmark/examples/hexapdf_table_extraction.rb
reference — a minimal lines-based extractor built on them.
Running the suite yourself
export PDFIUM_LIBRARY_PATH=/path/to/libpdfium.{so,dylib,dll}
pip install pdfplumber pypdfium2 # optional baselines
gem install hexapdf # optional baseline
ruby benchmark/run.rb # Markdown results table on stdout
ruby benchmark/run.rb --runs 5 # more timed runs per combination
ruby benchmark/run.rb --json # machine-readable output
Results
All numbers come from the four synthetic PDFs generated by
benchmark/generate_pdfs.rb, so the suite is fully reproducible from the
repository with no external data.