Table benchmark
Runs table detection on every page with default settings and scores the fraction of known table cells (SKU codes and row totals from the ruled tables) recovered. pypdfium2 has no table layer and is excluded.
hexapdf has no built-in table extraction, but it exposes the two primitives
the pipeline needs — per-glyph boxes (decode_text_with_positioning) and
stroked path segments (Content::Processor). The
benchmark/examples/hexapdf_table_extraction.rb
reference (~120 lines) builds a minimal lines-based extractor on them, and the
hexapdf row below uses it. It is a proof of concept — :lines only, a single
snap epsilon, no :text fallback or join tolerances — not a peer of
rpdfium’s full pdfplumber port.
Apple M-series (arm64, macOS), best of 3 runs after a warm-up. Reproduce
with ruby benchmark/run.rb.
Synthetic suite
| Library | Time | Peak RSS | Correctness | |
|---|---|---|---|---|
01_simple.pdf (1 pg, 1 table) |
rpdfium | 15 ms | 34 MB | 100% |
| pdfplumber | 17 ms | 42 MB | 100% | |
| hexapdf | 24 ms | 25 MB | 100% | |
02_medium.pdf (6 pg, 6 tables) |
rpdfium | 38 ms | 35 MB | 100% |
| pdfplumber | 111 ms | 57 MB | 100% | |
| hexapdf | 54 ms | 25 MB | 100% | |
03_complex.pdf (16 pg, mixed) |
rpdfium | 124 ms | 38 MB | 100% |
| pdfplumber | 187 ms | 71 MB | 100% | |
| hexapdf | 88 ms | 26 MB | 100% | |
04_heavy.pdf (60 pg, 60 tables) |
rpdfium | 496 ms | 39 MB | 100% |
| pdfplumber | 3.05 s | 442 MB | 100% | |
| hexapdf | 779 ms | 29 MB | 100% | |
05_academic.pdf (520 pg, ~104 ruled tables) |
rpdfium | 15.46 s | 104 MB | 100% |
| pdfplumber | 68.04 s | 5179 MB | 100% | |
| hexapdf | 13.22 s | 37 MB | 100% |
Observations:
- All three recover 100% of the ruled-table cells on every tier — these
are clean generated grids, the easy case. Correctness diverges on
real-world tables (dashed rules, partial borders, misaligned cells), which
is exactly where rpdfium’s snap/join tolerances and
:textfallback earn their cost and the 120-line reference would start dropping cells. - rpdfium is the fastest up to the heavy tier (496 ms vs hexapdf’s 779 ms
and pdfplumber’s 3.05 s on
04_heavy). Two layers earn this. First, the table/word pipeline pulls chars through a geometry-only fast path that skips the FFI reads and per-char allocation the cell filter never uses. Second, the batch helpers (extract_tables,extract_text) now stream pages — each page is closed the moment its data is read, freeing its native handles and char caches instead of retaining every visited page for the document’s lifetime. Peak RSS on the heavy tier fell from 119 MB to 39 MB and no longer grows with the page count. - On the 520-page academic tier the minimal hexapdf reference edges rpdfium
out on time (13.22 s vs 15.46 s). At that scale the full pipeline’s
per-page cost — borderless
:textattempts, rectangle / multi-table search, annotation parsing on figure/footnote-heavy pages — dominates, while the:lines-only reference skips all of it. rpdfium still recovers the same cells and stays ~4.4× faster than pdfplumber while using ~50× less memory (104 MB vs 5.2 GB). hexapdf also leads on03_complex(88 ms). A fair comparison only on clean ruled grids. - rpdfium stays linear and robust: ~5.9× faster than pdfplumber on the heavy tier, ~4.4× on the academic tier, with ~11–50× less memory.
03_complex.pdfalso contains borderless tables and a prestamped form — neither counts toward the ground truth (recovering them needs the:textstrategy or font filtering, not default settings).