Table benchmark

Synthetic suite

Runs table detection on every page with default settings and scores the fraction of known table cells (SKU codes and row totals from the ruled tables) recovered. pypdfium2 has no table layer and is excluded.

hexapdf has no built-in table extraction, but it exposes the two primitives the pipeline needs — per-glyph boxes (decode_text_with_positioning) and stroked path segments (Content::Processor). The benchmark/examples/hexapdf_table_extraction.rb reference (~120 lines) builds a minimal lines-based extractor on them, and the hexapdf row below uses it. It is a proof of concept — :lines only, a single snap epsilon, no :text fallback or join tolerances — not a peer of rpdfium’s full pdfplumber port.

Apple M-series (arm64, macOS), best of 3 runs after a warm-up. Reproduce with ruby benchmark/run.rb.

Synthetic suite

PDF	Library	Time	Peak RSS	Correctness
`01_simple.pdf` (1 pg, 1 table)	rpdfium	15 ms	34 MB	100%
	pdfplumber	17 ms	42 MB	100%
	hexapdf	24 ms	25 MB	100%
`02_medium.pdf` (6 pg, 6 tables)	rpdfium	38 ms	35 MB	100%
	pdfplumber	111 ms	57 MB	100%
	hexapdf	54 ms	25 MB	100%
`03_complex.pdf` (16 pg, mixed)	rpdfium	124 ms	38 MB	100%
	pdfplumber	187 ms	71 MB	100%
	hexapdf	88 ms	26 MB	100%
`04_heavy.pdf` (60 pg, 60 tables)	rpdfium	496 ms	39 MB	100%
	pdfplumber	3.05 s	442 MB	100%
	hexapdf	779 ms	29 MB	100%
`05_academic.pdf` (520 pg, ~104 ruled tables)	rpdfium	15.46 s	104 MB	100%
	pdfplumber	68.04 s	5179 MB	100%
	hexapdf	13.22 s	37 MB	100%

Observations:

All three recover 100% of the ruled-table cells on every tier — these are clean generated grids, the easy case. Correctness diverges on real-world tables (dashed rules, partial borders, misaligned cells), which is exactly where rpdfium’s snap/join tolerances and :text fallback earn their cost and the 120-line reference would start dropping cells.
rpdfium is the fastest up to the heavy tier (496 ms vs hexapdf’s 779 ms and pdfplumber’s 3.05 s on 04_heavy). Two layers earn this. First, the table/word pipeline pulls chars through a geometry-only fast path that skips the FFI reads and per-char allocation the cell filter never uses. Second, the batch helpers (extract_tables, extract_text) now stream pages — each page is closed the moment its data is read, freeing its native handles and char caches instead of retaining every visited page for the document’s lifetime. Peak RSS on the heavy tier fell from 119 MB to 39 MB and no longer grows with the page count.
On the 520-page academic tier the minimal hexapdf reference edges rpdfium out on time (13.22 s vs 15.46 s). At that scale the full pipeline’s per-page cost — borderless :text attempts, rectangle / multi-table search, annotation parsing on figure/footnote-heavy pages — dominates, while the :lines-only reference skips all of it. rpdfium still recovers the same cells and stays ~4.4× faster than pdfplumber while using ~50× less memory (104 MB vs 5.2 GB). hexapdf also leads on 03_complex (88 ms). A fair comparison only on clean ruled grids.
rpdfium stays linear and robust: ~5.9× faster than pdfplumber on the heavy tier, ~4.4× on the academic tier, with ~11–50× less memory.
03_complex.pdf also contains borderless tables and a prestamped form — neither counts toward the ground truth (recovering them needs the :text strategy or font filtering, not default settings).