Benchmarks

Sample files — 5 tiers of complexity
Correctness scoring
Running the suite yourself
Results

Reproducible performance comparisons between four libraries:

Library	Language	Engine
rpdfium (latest)	Ruby	PDFium (FFI)
pypdfium2	Python	PDFium (FFI) — the “pure PDFium speed floor”
pdfplumber	Python	pdfminer.six (pure Python)
hexapdf	Ruby	pure Ruby

Three metrics per combination:

Execution time — minimum of 3 timed runs after a warm-up
Memory — peak RSS of the isolated runner process, max of 3 runs
Correctness — fraction of known ground-truth data actually recovered

The benchmark scripts and sample files live in the benchmark/ directory of the repository.

Sample files — 5 tiers of complexity

All sample PDFs are synthetic, generated by benchmark/generate_pdfs.rb:

File	Pages	Complexity
`01_simple.pdf`	1	Short text + one small ruled table
`02_medium.pdf`	6	Text + one ruled table per page
`03_complex.pdf`	16	Mixed: text, ruled tables, borderless columns, prestamped form
`04_heavy.pdf`	60	Dense text + a ruled table on every page
`05_academic.pdf`	520	Simulated journal article: condensed two-column body, ruled + borderless tables, embedded figures, footnotes, academic annotations

Correctness scoring

Because the PDFs are generated, the ground truth is known exactly (pdfs/expected.json):

text — unique sentinel tokens embedded in the running text; score = fraction recovered. Whitespace-insensitive, so libraries are not penalized for different word-spacing reconstruction.
tables — unique cell values (SKU codes, row totals) of every ruled table; score = fraction recovered as table cells with default settings.

The check runs outside the timed section. pypdfium2 has no table layer and runs only the text task. hexapdf has no built-in table extraction either, but it exposes the needed primitives (glyph boxes + path segments), so the tables row uses the benchmark/examples/hexapdf_table_extraction.rb reference — a minimal lines-based extractor built on them.

Running the suite yourself

export PDFIUM_LIBRARY_PATH=/path/to/libpdfium.{so,dylib,dll}
pip install pdfplumber pypdfium2    # optional baselines
gem install hexapdf                 # optional baseline

ruby benchmark/run.rb               # Markdown results table on stdout
ruby benchmark/run.rb --runs 5      # more timed runs per combination
ruby benchmark/run.rb --json        # machine-readable output

Results

Raw text — plain-text extraction across the four tiers
Tables — table-detection pipelines compared

All numbers come from the four synthetic PDFs generated by benchmark/generate_pdfs.rb, so the suite is fully reproducible from the repository with no external data.

Benchmarks

Sample files — 5 tiers of complexity

Correctness scoring

Running the suite yourself

Results

Table of contents