Benchmarks

  1. Sample files — 5 tiers of complexity
  2. Correctness scoring
  3. Running the suite yourself
  4. Results

Reproducible performance comparisons between four libraries:

Library Language Engine
rpdfium (latest) Ruby PDFium (FFI)
pypdfium2 Python PDFium (FFI) — the “pure PDFium speed floor”
pdfplumber Python pdfminer.six (pure Python)
hexapdf Ruby pure Ruby

Three metrics per combination:

  • Execution time — minimum of 3 timed runs after a warm-up
  • Memory — peak RSS of the isolated runner process, max of 3 runs
  • Correctness — fraction of known ground-truth data actually recovered

The benchmark scripts and sample files live in the benchmark/ directory of the repository.

Sample files — 5 tiers of complexity

All sample PDFs are synthetic, generated by benchmark/generate_pdfs.rb:

File Pages Complexity
01_simple.pdf 1 Short text + one small ruled table
02_medium.pdf 6 Text + one ruled table per page
03_complex.pdf 16 Mixed: text, ruled tables, borderless columns, prestamped form
04_heavy.pdf 60 Dense text + a ruled table on every page
05_academic.pdf 520 Simulated journal article: condensed two-column body, ruled + borderless tables, embedded figures, footnotes, academic annotations

Correctness scoring

Because the PDFs are generated, the ground truth is known exactly (pdfs/expected.json):

  • text — unique sentinel tokens embedded in the running text; score = fraction recovered. Whitespace-insensitive, so libraries are not penalized for different word-spacing reconstruction.
  • tables — unique cell values (SKU codes, row totals) of every ruled table; score = fraction recovered as table cells with default settings.

The check runs outside the timed section. pypdfium2 has no table layer and runs only the text task. hexapdf has no built-in table extraction either, but it exposes the needed primitives (glyph boxes + path segments), so the tables row uses the benchmark/examples/hexapdf_table_extraction.rb reference — a minimal lines-based extractor built on them.

Running the suite yourself

export PDFIUM_LIBRARY_PATH=/path/to/libpdfium.{so,dylib,dll}
pip install pdfplumber pypdfium2    # optional baselines
gem install hexapdf                 # optional baseline

ruby benchmark/run.rb               # Markdown results table on stdout
ruby benchmark/run.rb --runs 5      # more timed runs per combination
ruby benchmark/run.rb --json        # machine-readable output

Results

  • Raw text — plain-text extraction across the four tiers
  • Tables — table-detection pipelines compared

All numbers come from the four synthetic PDFs generated by benchmark/generate_pdfs.rb, so the suite is fully reproducible from the repository with no external data.


Table of contents