Skip to main content

Quick navigation

Top-Level Functions

Start here when you need quick file-level tasks: open, merge, and batch rendering.

Open a PDF document

Opens a PDF document and returns a Document instance. Signature: sopdf.open(path, password, *, stream)
sopdf.open(
    path: str | pathlib.Path | None = None,
    password: str | None = None,
    *,
    stream: bytes | None = None,
) -> Document
path
str | Path | None
default:"None"
File-system path to the PDF. Mutually exclusive with stream.
password
str | None
default:"None"
Password for encrypted PDFs. Pass None if no password is required.
stream
bytes | None
default:"None"
Open from raw bytes in memory instead of a file. Mutually exclusive with path.
Returns: Document — The opened document object.
ExceptionCondition
PasswordErrorThe document requires a password that was not provided or is incorrect.
FileDataErrorThe file is corrupted or cannot be parsed as a valid PDF.
# Open from a file path
doc = sopdf.open("report.pdf")

# Open an encrypted document
doc = sopdf.open("secure.pdf", password="hunter2")

# Open from raw bytes in memory
with open("report.pdf", "rb") as f:
    doc = sopdf.open(stream=f.read())

# Recommended: use a context manager for automatic resource cleanup
with sopdf.open("report.pdf") as doc:
    print(doc.page_count)

Merge multiple PDF files

Merges multiple PDF files into a single output file, in the order provided. Signature: sopdf.merge(inputs, output)
sopdf.merge(
    inputs: list[str | pathlib.Path],
    output: str | pathlib.Path,
) -> None
inputs
list[str | Path]
Ordered list of PDF file paths to concatenate.
output
str | Path
Destination file path for the merged PDF.
ExceptionCondition
ValueErrorinputs list is empty.
PasswordErrorOne of the input files requires a password.
FileDataErrorOne of the input files cannot be read.
sopdf.merge(
    ["intro.pdf", "body.pdf", "appendix.pdf"],
    output="book.pdf",
)

Render multiple pages to image bytes

Renders a list of pages to encoded image bytes. Signature: sopdf.render_pages(pages, *, dpi, format, alpha, parallel)
sopdf.render_pages(
    pages: list[Page],
    *,
    dpi: int = 72,
    format: str = "png",
    alpha: bool = False,
    parallel: bool = False,
) -> list[bytes]
pages
list[Page]
List of page objects to render, typically from doc.pages.
dpi
int
default:"72"
Rendering resolution in dots per inch. Common values: 72 (screen preview), 150 (high quality), 300 (print quality).
format
str
default:"\"png\""
Output image format: "png" or "jpeg".
alpha
bool
default:"False"
Whether to include an alpha (transparency) channel. Only effective for PNG.
parallel
bool
default:"False"
Whether to use multiprocessing for rendering. Bypasses the GIL for significant speedup on multi-core machines.
Recommended parameter presets
ScenarioRecommended parameters
Screen previewdpi=72, format="png", alpha=False, parallel=False
High-quality exportdpi=150, format="png", alpha=False, parallel=False
Large-document throughputdpi=300, format="png", alpha=False, parallel=True
Returns: list[bytes] — A list of encoded image bytes, one entry per page, in the same order as pages.
with sopdf.open("report.pdf") as doc:
    # Sequential rendering
    images = sopdf.render_pages(doc.pages, dpi=150)

    # Parallel rendering with multiprocessing (recommended for large documents)
    images = sopdf.render_pages(doc.pages, dpi=300, parallel=True)

Render multiple pages and write files

Renders pages and writes the results to a directory as page_0.png, page_1.png, etc. Signature: sopdf.render_pages_to_files(pages, output_dir, *, dpi, format, alpha, parallel)
sopdf.render_pages_to_files(
    pages: list[Page],
    output_dir: str | pathlib.Path,
    *,
    dpi: int = 72,
    format: str = "png",
    alpha: bool = False,
    parallel: bool = False,
) -> None
pages
list[Page]
List of page objects to render.
output_dir
str | Path
Output directory path. Created automatically if it does not exist.
dpi
int
default:"72"
Rendering resolution in dots per inch.
format
str
default:"\"png\""
Output image format: "png" or "jpeg".
alpha
bool
default:"False"
Whether to include an alpha channel (PNG only).
parallel
bool
default:"False"
Whether to use multiprocessing for rendering.
Recommended parameter presets
ScenarioRecommended parameters
Preview thumbnailsdpi=72, format="png", parallel=False
Archive snapshotsdpi=150, format="png", parallel=False
Multi-core batch outputdpi=300, format="png", parallel=True
with sopdf.open("report.pdf") as doc:
    sopdf.render_pages_to_files(doc.pages, "output/", dpi=150, parallel=True)
# Produces: output/page_0.png, output/page_1.png, ...
Back to top

Document Object Operations

Focus on this section after you have a Document instance and need page management, splitting, merging, or saving. Document represents an open PDF document. It should never be constructed directly — always obtain one via sopdf.open().

Properties

Total page count

Member: doc.page_count or len(doc)
doc.page_count -> int
The total number of pages in the document (read-only).
len(doc) -> int
len(doc) is equivalent to doc.page_count.

Metadata

doc.metadata -> Metadata
Document metadata — readable and writable via a Metadata proxy object.
# Read
print(doc.metadata.title)
print(doc.metadata.creation_datetime)  # Python datetime

# Write (lazily initialises pikepdf, marks document dirty)
doc.metadata.title  = "Annual Report 2025"
doc.metadata.author = "Kevin Qiu"
doc.save("updated.pdf")

Document outline

doc.outline -> Outline
Document outline (table of contents) as an Outline tree. Returns an object with len == 0 when the document has no bookmarks. Uses pypdfium2 — no pikepdf cost for read-only access.
for item in doc.outline.items:
    print(f"[p{item.page + 1}] {item.title}")

flat = doc.outline.to_list()  # PyMuPDF-compatible flat list

Encryption status

doc.is_encrypted -> bool
Whether the document is password-protected (read-only). Returns True even when the correct password has been provided and the document opened successfully.

Page sequence

doc.pages -> _PageList
Lazy sequence of all pages (read-only). Supports iteration and slicing. Commonly used with render_pages().

Page Access

Access page by index

Signature: doc[index] / doc.load_page(index)
doc[index: int] -> Page
doc.load_page(index: int) -> Page
Retrieves a page by 0-based index. Negative indices are supported (doc[-1] returns the last page).
ExceptionCondition
PageErrorIndex is out of range.
first_page = doc[0]
last_page  = doc[-1]
third_page = doc.load_page(2)

Iteration

for page in doc:
    print(page.number)

Split

Split document by pages

Signature: doc.split(pages, output)
doc.split(
    pages: list[int],
    output: str | pathlib.Path | None = None,
) -> Document
Extracts specified pages from the current document and returns a new Document object.
pages
list[int]
List of 0-based page indices to extract. The output order matches the list order.
output
str | Path | None
default:"None"
If provided, the new document is also written to this path. Otherwise, it is returned in memory only.
Returns: Document — A new document containing the specified pages.
# Extract the first 3 pages and save to disk
chapter = doc.split(pages=[0, 1, 2], output="chapter1.pdf")

# Extract to memory only, no disk write
excerpt = doc.split(pages=[4, 5, 6])

Split into single-page files

Signature: doc.split_each(output_dir)
doc.split_each(output_dir: str | pathlib.Path) -> None
Saves each page as a separate PDF file. Files are named page_0.pdf, page_1.pdf, etc.
output_dir
str | Path
Output directory path. Created automatically if it does not exist.
doc.split_each("pages/")
# Produces: pages/page_0.pdf, pages/page_1.pdf, ...

Merge

Append pages from another document

Signature: doc.append(other)
doc.append(other: Document) -> None
Appends all pages of another document to the end of this document. After calling this method, the document is marked as modified and must be saved via save() or to_bytes() to persist the change.
other
Document
The document whose pages will be appended.
with sopdf.open("part1.pdf") as doc_a, sopdf.open("part2.pdf") as doc_b:
    doc_a.append(doc_b)
    doc_a.save("combined.pdf")

Save

Save to file

Signature: doc.save(path, *, compress, garbage, linearize)
doc.save(
    path: str | pathlib.Path,
    *,
    compress: bool = True,
    garbage: bool = False,
    linearize: bool = False,
) -> None
Writes the document to disk.
path
str | Path
Destination file path.
compress
bool
default:"True"
Whether to compress content streams. Can significantly reduce file size.
garbage
bool
default:"False"
Whether to generate object streams for additional structural compression.
linearize
bool
default:"False"
Whether to linearize the PDF for optimized sequential network access (Fast Web View).
# Basic save (compression enabled by default)
doc.save("output.pdf")

# Maximum compression
doc.save("output.pdf", compress=True, garbage=True)

# Strip encryption (open with the correct password, then save)
doc.save("unlocked.pdf")

Export as bytes

Signature: doc.to_bytes(*, compress)
doc.to_bytes(compress: bool = True) -> bytes
Serializes the document to bytes without writing to disk. Useful for in-memory processing or serving a PDF over a network.
compress
bool
default:"True"
Whether to compress content streams.
Returns: bytes — The complete PDF file contents as bytes.
pdf_bytes = doc.to_bytes()

# Return directly as a Flask HTTP response
from flask import Response
return Response(doc.to_bytes(), mimetype="application/pdf")

Lifecycle

Close document

Signature: doc.close()
doc.close() -> None
Closes the document and releases all file handles and memory resources. Using a with statement is recommended over calling this directly.

Context Manager

with sopdf.open("file.pdf") as doc:
    ...
# close() is called automatically on exit
Back to top

Page Object Operations

Use this section for single-page workflows such as rendering, text extraction, and text search. Page represents a single page within a document. Obtained via doc[i] or doc.load_page(i) — never constructed directly.

Properties

Page index

Member: page.number
page.number -> int
The 0-based index of this page (read-only).

Page dimensions

Member: page.rect
page.rect -> Rect
The page dimensions as a Rect in PDF points (1 pt = 1/72 inch) (read-only). Use rect.width and rect.height to get the page size.

Page rotation

Member: page.rotation
page.rotation -> int          # read current rotation
page.rotation = degrees: int  # set rotation
The page rotation in degrees. Must be one of 0, 90, 180, 270 (read/write).
ExceptionCondition
PageErrorSet to a value other than 0, 90, 180, or 270.

Rendering

Render to image bytes

Signature: page.render(*, dpi, format, alpha)
page.render(
    *,
    dpi: int = 72,
    format: str = "png",
    alpha: bool = False,
) -> bytes
Renders the page to encoded image bytes.
dpi
int
default:"72"
Rendering resolution in dots per inch. Use 72 for screen preview, 300 for print quality.
format
str
default:"\"png\""
Output image format: "png" or "jpeg".
alpha
bool
default:"False"
Whether to include an alpha (transparency) channel. Only effective for PNG; JPEG does not support transparency.
Recommended parameter presets
ScenarioRecommended parameters
On-screen previewdpi=72, format="png", alpha=False
Crisp snapshotdpi=150, format="png", alpha=False
Print-grade imagedpi=300, format="png", alpha=False
Returns: bytes — Encoded image bytes (PNG or JPEG).
png_bytes  = page.render(dpi=150)
jpeg_bytes = page.render(dpi=150, format="jpeg")
png_alpha  = page.render(dpi=72, alpha=True)

Render and save image

Signature: page.render_to_file(path, *, dpi, format, alpha)
page.render_to_file(
    path: str | pathlib.Path,
    *,
    dpi: int = 72,
    format: str = "png",
    alpha: bool = False,
) -> None
Renders the page and writes the image to a file. Parameters are identical to render().
path
str | Path
Output file path (including extension).
dpi
int
default:"72"
Rendering resolution in dots per inch.
format
str
default:"\"png\""
Output image format: "png" or "jpeg".
alpha
bool
default:"False"
Whether to include an alpha channel (PNG only).
page.render_to_file("page0.png", dpi=300)
page.render_to_file("page0.jpg", dpi=150, format="jpeg")

Text Extraction

Extract plain text

Signature: page.get_text(*, rect)
page.get_text(
    *,
    rect: Rect | None = None,
) -> str
Extracts plain text from the page.
rect
Rect | None
default:"None"
Restrict extraction to this rectangular region. Extracts the full page when None.
Returns: str — The extracted plain text.
full_text = page.get_text()

# Extract from a specific region only
region = Rect(0, 0, 300, 100)
header_text = page.get_text(rect=region)

Extract text blocks

Signature: page.get_text_blocks(*, rect, format)
page.get_text_blocks(
    *,
    rect: Rect | None = None,
    format: str = "list",
) -> list
Extracts structured text blocks with bounding boxes.
rect
Rect | None
default:"None"
Restrict extraction to this rectangular region. Extracts the full page when None.
format
str
default:"\"list\""
Return format. "list" returns a list of TextBlock objects; "dict" returns a list of plain dictionaries with "text" and "rect" keys.
Returns: format="list"list[TextBlock]; format="dict"list[dict], each of the form {"text": "...", "rect": {"x0": ..., "y0": ..., "x1": ..., "y1": ...}}
blocks = page.get_text_blocks()
for block in blocks:
    print(block.text, block.rect)

# Return as dictionaries (convenient for JSON serialization)
dicts = page.get_text_blocks(format="dict")

Search text positions

Signature: page.search(query, *, match_case)
page.search(
    query: str,
    *,
    match_case: bool = False,
) -> list[Rect]
Searches the page for a text string and returns the bounding rectangles of all matches.
query
str
The text string to search for.
match_case
bool
default:"False"
Whether the search is case-sensitive. Case-insensitive by default.
Returns: list[Rect] — Bounding rectangles for each match. Returns an empty list if no matches are found.
hits = page.search("invoice")
for rect in hits:
    print(f"Match at {rect}")

# Case-sensitive search
hits = page.search("PDF", match_case=True)

Search text with context blocks

Signature: page.search_text_blocks(query, *, match_case)
page.search_text_blocks(
    query: str,
    *,
    match_case: bool = False,
) -> list[dict]
Searches for text and returns each match along with the surrounding text block for context.
query
str
The text string to search for.
match_case
bool
default:"False"
Whether the search is case-sensitive.
Returns: list[dict], each element contains:
KeyTypeDescription
"text"strFull text content of the block containing the match.
"rect"RectBounding rectangle of the containing text block.
"match_rect"RectPrecise bounding rectangle of the matched keyword itself.
results = page.search_text_blocks("total amount")
for r in results:
    print(r["text"])        # full paragraph containing the keyword
    print(r["match_rect"])  # exact position of the keyword
Back to top

Data Types

Refer to this section when you need to understand response structures (for example Rect, TextBlock, and Metadata) for downstream processing.

Rect

Represents a rectangular region. Coordinates are in PDF points (pt, where 1 pt = 1/72 inch). The coordinate system has its origin at the top-left corner of the page, with x increasing rightward and y increasing downward.
Rect(x0: float, y0: float, x1: float, y1: float)
Constructor Parameters
ParameterTypeDescription
x0floatLeft edge (x-coordinate of the top-left corner).
y0floatTop edge (y-coordinate of the top-left corner).
x1floatRight edge (x-coordinate of the bottom-right corner).
y1floatBottom edge (y-coordinate of the bottom-right corner).
Core properties (common)
PropertyTypeDescription
x0floatLeft edge.
y0floatTop edge.
x1floatRight edge.
y1floatBottom edge.
widthfloatRectangle width, equal to x1 - x0.
heightfloatRectangle height, equal to y1 - y0.
All geometric operations return new Rect instances — the original is immutable.
r = Rect(10, 20, 200, 300)
print(r.width)    # 190.0
print(r.height)   # 280.0

# Containment check
print(r.contains(Rect(50, 50, 100, 100)))  # True
print(r.contains((50, 50)))                # True (point)

# Intersection
a = Rect(0, 0, 100, 100)
b = Rect(50, 50, 150, 150)
print(a.intersect(b))  # Rect(50, 50, 100, 100)

# Unpack
x0, y0, x1, y1 = r

TextBlock

Represents a single block of text on a page, together with its bounding box.
TextBlock(text: str, rect: Rect)
Attribute / MethodTypeDescription
textstrThe text content of the block.
rectRectBounding rectangle of the block on the page.
to_dict()dictConverts to {"text": ..., "rect": {"x0": ..., "y0": ..., "x1": ..., "y1": ...}}.
blocks = page.get_text_blocks()
for block in blocks:
    print(block.text)
    print(block.rect.width, block.rect.height)
    print(block.to_dict())

Metadata

Read/write proxy for the PDF Document Info dictionary. Obtained via doc.metadata — never constructed directly. Read path (zero pikepdf cost): each property calls pypdfium2.get_metadata_dict() after auto-syncing. Write path (lazy pikepdf init): each setter calls _ensure_pike(), writes to pike_doc.docinfo, and marks the document dirty. The next read auto-syncs. Core fields (common)
PropertyTypeDescription
titlestr | NoneDocument title (/Title). Read/write.
authorstr | NoneAuthor name (/Author). Read/write.
subjectstr | NoneDocument subject (/Subject). Read/write.
keywordsstr | NoneSearch keywords (/Keywords). Read/write.
PDF date string format: D:YYYYMMDDHHmmSSOHH'mm' (prefix D: and timezone optional).
with sopdf.open("report.pdf") as doc:
    meta = doc.metadata

    # Read individual fields
    print(meta.title)
    print(meta.creation_datetime)   # datetime(2024, 1, 1, 12, 0, tzinfo=...)

    # Write
    meta.title  = "New Title"
    meta.author = "Kevin Qiu"
    doc.save("updated.pdf")

    # Dict-style read (backward compat)
    d = meta.to_dict()
    print(d["title"])
    print(meta["author"])

OutlineItem

An immutable bookmark node in the document outline.
@dataclass(frozen=True)
class OutlineItem:
    title:    str
    page:     int                          # 0-based; -1 = no destination page
    level:    int                          # 0 = top-level
    children: tuple[OutlineItem, ...] = ()
Attribute / MethodTypeDescription
titlestrBookmark label as displayed in the reader TOC panel.
pageint0-based target page index; -1 when the item has no destination.
levelintNesting depth; 0 = top-level item.
childrentuple[OutlineItem, ...]Nested child items (frozen tuple).
to_dict()dictSerialize to a plain dict (recursive).

Outline

Read-only bookmark tree manager. Obtained via doc.outline — never constructed directly. The tree is built once on first access using pypdfium2’s TOC data — no pikepdf initialisation needed.
MemberReturnsDescription
itemslist[OutlineItem]Top-level outline items (each may have nested children).
to_list()list[dict]Flat DFS traversal. Each entry: {"level": int, "title": str, "page": int}. Compatible with PyMuPDF get_toc() output.
len(outline)intTotal number of nodes across all nesting levels.
iter(outline)Iterate over top-level items.
bool(outline)boolTrue when the document has at least one outline item.
with sopdf.open("textbook.pdf") as doc:
    outline = doc.outline
    print(outline)          # Outline(top_level=2, total=4)
    print(bool(outline))    # True

    # Recursive tree traversal
    def print_tree(items, indent=0):
        for item in items:
            print("  " * indent + f"[p{item.page + 1}] {item.title}")
            print_tree(item.children, indent + 1)

    print_tree(outline.items)

    # Flat list (PyMuPDF-compatible)
    for row in outline.to_list():
        print(f"{'  ' * row['level']}{row['title']}  →  p{row['page'] + 1}")
Back to top

Exceptions

Read this section first when integrating PDF processing into production services and designing robust error handling. All exceptions inherit from PDFError, which inherits from the built-in RuntimeError.
RuntimeError
└── PDFError
    ├── PasswordError
    ├── FileDataError
    └── PageError
ExceptionWhen RaisedHandling GuidanceRecoverable
PDFErrorBase class for all sopdf exceptions. Catch this to handle any sopdf error.Use as a top-level fallback for logging and unified user messaging.Depends on subtype
PasswordErrorOpening an encrypted PDF with a missing or incorrect password.Prompt again for password and limit retries.Yes
FileDataErrorPDF file is corrupted, has an invalid format, or cannot be parsed.Ask user to re-upload or replace the source file.No
PageErrorPage index is out of range, or rotation is set to an invalid value (not 0/90/180/270).Validate index/range and rotation before calling.Yes
Recommended catch order: catch specific exceptions first (PasswordError / FileDataError / PageError), then PDFError as a final fallback.
import sopdf

try:
    doc = sopdf.open("file.pdf", password="wrong")
except sopdf.PasswordError:
    print("Incorrect password")
except sopdf.FileDataError:
    print("File is corrupted")
except sopdf.PDFError as e:
    print(f"PDF error: {e}")
Back to top