rpdfium

Ruby bindings for PDFium, the PDF engine that powers Chrome’s viewer. Text extraction with character-level metadata, vector path access, image extraction, form fields, page rendering, and pdfplumber-style table detection.

Get started View on GitHub


Try it on the example PDF used throughout these guides:

require "rpdfium"

Rpdfium.open("example.pdf") do |doc|
  doc.each do |page|
    puts page.text
    Rpdfium::Table::Extractor.new(page).extract.each do |table|
      table.each { |row| puts row.inspect }
    end
  end
end

Why rpdfium

The Ruby ecosystem has pdf-reader (text only, slow on complex docs), origami (security-research focused), and hexapdf — a capable library that extracts text with character-level positioning and exposes the vector-path primitives you need to build table extraction yourself (the benchmark suite ships a ~120-line reference that does exactly this); it is AGPL / commercially licensed. rpdfium is an Apache-2.0 alternative that ships those higher-level pipelines out of the box — pdfplumber-style table detection and page rendering on top of character metadata — binding the same battle-tested C++ engine that powers Chrome’s PDF viewer, so it stays fast and light on large, complex documents.

In practice it matches the speed of Python’s pypdfium2 on text extraction and is up to ~52× faster than pdfplumber while using up to ~13× less memory on dense documents. See Benchmarks for the reproducible suite.

At a glance

Capability Where
Open documents, metadata, bookmarks, attachments Documents & pages
Plain + bbox-bounded text, per-character metadata Text & characters
Vector path geometry (lines, segments) Vector paths
Embedded image extraction Images
Annotations, links Annotations & links
Page rendering to PNG / raw bytes Rendering
pdfplumber-style table detection Tables
AcroForm / XFA fields Interactive forms
Data extraction from filled forms Form-aware extraction
Tagged-PDF logical structure Struct tree
Performance vs pypdfium2 / pdfplumber Benchmarks

License

Apache-2.0, same as PDFium itself.