rpdfium
Ruby bindings for PDFium, the PDF engine that powers Chrome’s viewer. Text extraction with character-level metadata, vector path access, image extraction, form fields, page rendering, and pdfplumber-style table detection.
Try it on the example PDF used throughout these guides:
require "rpdfium"
Rpdfium.open("example.pdf") do |doc|
doc.each do |page|
puts page.text
Rpdfium::Table::Extractor.new(page).extract.each do |table|
table.each { |row| puts row.inspect }
end
end
end
Why rpdfium
The Ruby ecosystem has pdf-reader (text only, slow on complex docs),
origami (security-research focused), and hexapdf — a capable library that
extracts text with character-level positioning and exposes the vector-path
primitives you need to build table extraction yourself (the benchmark suite
ships a ~120-line reference
that does exactly this); it is AGPL / commercially licensed. rpdfium is an
Apache-2.0 alternative that ships those higher-level pipelines out of the box —
pdfplumber-style table detection and page rendering on top of character
metadata — binding the same battle-tested C++ engine that powers Chrome’s PDF
viewer, so it stays fast and light on large, complex documents.
In practice it matches the speed of Python’s pypdfium2 on text extraction
and is up to ~52× faster than pdfplumber while using up to ~13× less
memory on dense documents. See Benchmarks for the reproducible
suite.
At a glance
| Capability | Where |
|---|---|
| Open documents, metadata, bookmarks, attachments | Documents & pages |
| Plain + bbox-bounded text, per-character metadata | Text & characters |
| Vector path geometry (lines, segments) | Vector paths |
| Embedded image extraction | Images |
| Annotations, links | Annotations & links |
| Page rendering to PNG / raw bytes | Rendering |
| pdfplumber-style table detection | Tables |
| AcroForm / XFA fields | Interactive forms |
| Data extraction from filled forms | Form-aware extraction |
| Tagged-PDF logical structure | Struct tree |
| Performance vs pypdfium2 / pdfplumber | Benchmarks |
License
Apache-2.0, same as PDFium itself.