API Reference - SoMark 文档

Top-Level Functions
Document Object Operations
Page Object Operations
Data Types
Exceptions

Top-Level Functions

Start here when you need quick file-level tasks: open, merge, and batch rendering.

Open a PDF document

Opens a PDF document and returns a Document instance. Signature: sopdf.open(path, password, *, stream)

sopdf.open(
    path: str | pathlib.Path | None = None,
    password: str | None = None,
    *,
    stream: bytes | None = None,
) -> Document

path

str | Path | None

default:"None"

File-system path to the PDF. Mutually exclusive with stream.

password

str | None

default:"None"

Password for encrypted PDFs. Pass None if no password is required.

stream

bytes | None

default:"None"

Open from raw bytes in memory instead of a file. Mutually exclusive with path.

Returns: Document — The opened document object.

Exception	Condition
`PasswordError`	The document requires a password that was not provided or is incorrect.
`FileDataError`	The file is corrupted or cannot be parsed as a valid PDF.

# Open from a file path
doc = sopdf.open("report.pdf")

# Open an encrypted document
doc = sopdf.open("secure.pdf", password="hunter2")

# Open from raw bytes in memory
with open("report.pdf", "rb") as f:
    doc = sopdf.open(stream=f.read())

# Recommended: use a context manager for automatic resource cleanup
with sopdf.open("report.pdf") as doc:
    print(doc.page_count)

Merge multiple PDF files

Merges multiple PDF files into a single output file, in the order provided. Signature: sopdf.merge(inputs, output)

sopdf.merge(
    inputs: list[str | pathlib.Path],
    output: str | pathlib.Path,
) -> None

inputs

list[str | Path]

Ordered list of PDF file paths to concatenate.

output

str | Path

Destination file path for the merged PDF.

Exception	Condition
`ValueError`	`inputs` list is empty.
`PasswordError`	One of the input files requires a password.
`FileDataError`	One of the input files cannot be read.

sopdf.merge(
    ["intro.pdf", "body.pdf", "appendix.pdf"],
    output="book.pdf",
)

Render multiple pages to image bytes

Renders a list of pages to encoded image bytes. Signature: sopdf.render_pages(pages, *, dpi, format, alpha, parallel)

sopdf.render_pages(
    pages: list[Page],
    *,
    dpi: int = 72,
    format: str = "png",
    alpha: bool = False,
    parallel: bool = False,
) -> list[bytes]

pages

list[Page]

List of page objects to render, typically from doc.pages.

dpi

int

default:"72"

Rendering resolution in dots per inch. Common values: 72 (screen preview), 150 (high quality), 300 (print quality).

format

str

default:"\"png\""

Output image format: "png" or "jpeg".

alpha

bool

default:"False"

Whether to include an alpha (transparency) channel. Only effective for PNG.

parallel

bool

default:"False"

Whether to use multiprocessing for rendering. Bypasses the GIL for significant speedup on multi-core machines.

Recommended parameter presets

Scenario	Recommended parameters
Screen preview	`dpi=72, format="png", alpha=False, parallel=False`
High-quality export	`dpi=150, format="png", alpha=False, parallel=False`
Large-document throughput	`dpi=300, format="png", alpha=False, parallel=True`

Returns: list[bytes] — A list of encoded image bytes, one entry per page, in the same order as pages.

with sopdf.open("report.pdf") as doc:
    # Sequential rendering
    images = sopdf.render_pages(doc.pages, dpi=150)

    # Parallel rendering with multiprocessing (recommended for large documents)
    images = sopdf.render_pages(doc.pages, dpi=300, parallel=True)

Render multiple pages and write files

Renders pages and writes the results to a directory as page_0.png, page_1.png, etc. Signature: sopdf.render_pages_to_files(pages, output_dir, *, dpi, format, alpha, parallel)

sopdf.render_pages_to_files(
    pages: list[Page],
    output_dir: str | pathlib.Path,
    *,
    dpi: int = 72,
    format: str = "png",
    alpha: bool = False,
    parallel: bool = False,
) -> None

pages

list[Page]

List of page objects to render.

output_dir

str | Path

Output directory path. Created automatically if it does not exist.

dpi

int

default:"72"

Rendering resolution in dots per inch.

format

str

default:"\"png\""

Output image format: "png" or "jpeg".

alpha

bool

default:"False"

Whether to include an alpha channel (PNG only).

parallel

bool

default:"False"

Whether to use multiprocessing for rendering.

Recommended parameter presets

Scenario	Recommended parameters
Preview thumbnails	`dpi=72, format="png", parallel=False`
Archive snapshots	`dpi=150, format="png", parallel=False`
Multi-core batch output	`dpi=300, format="png", parallel=True`

with sopdf.open("report.pdf") as doc:
    sopdf.render_pages_to_files(doc.pages, "output/", dpi=150, parallel=True)
# Produces: output/page_0.png, output/page_1.png, ...

Document Object Operations

Focus on this section after you have a Document instance and need page management, splitting, merging, or saving. Document represents an open PDF document. It should never be constructed directly — always obtain one via sopdf.open().

Properties

Total page count

Member: doc.page_count or len(doc)

doc.page_count -> int

The total number of pages in the document (read-only).

len(doc) -> int

len(doc) is equivalent to doc.page_count.

Metadata

doc.metadata -> Metadata

Document metadata — readable and writable via a Metadata proxy object.

# Read
print(doc.metadata.title)
print(doc.metadata.creation_datetime)  # Python datetime

# Write (lazily initialises pikepdf, marks document dirty)
doc.metadata.title  = "Annual Report 2025"
doc.metadata.author = "Kevin Qiu"
doc.save("updated.pdf")

Document outline

doc.outline -> Outline

Document outline (table of contents) as an Outline tree. Returns an object with len == 0 when the document has no bookmarks. Uses pypdfium2 — no pikepdf cost for read-only access.

for item in doc.outline.items:
    print(f"[p{item.page + 1}] {item.title}")

flat = doc.outline.to_list()  # PyMuPDF-compatible flat list

Encryption status

doc.is_encrypted -> bool

Whether the document is password-protected (read-only). Returns True even when the correct password has been provided and the document opened successfully.

Page sequence

doc.pages -> _PageList

Lazy sequence of all pages (read-only). Supports iteration and slicing. Commonly used with render_pages().

Page Access

Access page by index

Signature: doc[index] / doc.load_page(index)

doc[index: int] -> Page
doc.load_page(index: int) -> Page

Retrieves a page by 0-based index. Negative indices are supported (doc[-1] returns the last page).

Exception	Condition
`PageError`	Index is out of range.

first_page = doc[0]
last_page  = doc[-1]
third_page = doc.load_page(2)

Iteration

for page in doc:
    print(page.number)

Split

Split document by pages

Signature: doc.split(pages, output)

doc.split(
    pages: list[int],
    output: str | pathlib.Path | None = None,
) -> Document

Extracts specified pages from the current document and returns a new Document object.

pages

list[int]

List of 0-based page indices to extract. The output order matches the list order.

output

str | Path | None

default:"None"

If provided, the new document is also written to this path. Otherwise, it is returned in memory only.

Returns: Document — A new document containing the specified pages.

# Extract the first 3 pages and save to disk
chapter = doc.split(pages=[0, 1, 2], output="chapter1.pdf")

# Extract to memory only, no disk write
excerpt = doc.split(pages=[4, 5, 6])

Split into single-page files

Signature: doc.split_each(output_dir)

doc.split_each(output_dir: str | pathlib.Path) -> None

Saves each page as a separate PDF file. Files are named page_0.pdf, page_1.pdf, etc.

output_dir

str | Path

Output directory path. Created automatically if it does not exist.

doc.split_each("pages/")
# Produces: pages/page_0.pdf, pages/page_1.pdf, ...

Merge

Append pages from another document

Signature: doc.append(other)

doc.append(other: Document) -> None

Appends all pages of another document to the end of this document. After calling this method, the document is marked as modified and must be saved via save() or to_bytes() to persist the change.

other

Document

The document whose pages will be appended.

with sopdf.open("part1.pdf") as doc_a, sopdf.open("part2.pdf") as doc_b:
    doc_a.append(doc_b)
    doc_a.save("combined.pdf")

Save

Save to file

Signature: doc.save(path, *, compress, garbage, linearize)

doc.save(
    path: str | pathlib.Path,
    *,
    compress: bool = True,
    garbage: bool = False,
    linearize: bool = False,
) -> None

Writes the document to disk.

path

str | Path

Destination file path.

compress

bool

default:"True"

Whether to compress content streams. Can significantly reduce file size.

garbage

bool

default:"False"

Whether to generate object streams for additional structural compression.

linearize

bool

default:"False"

Whether to linearize the PDF for optimized sequential network access (Fast Web View).

# Basic save (compression enabled by default)
doc.save("output.pdf")

# Maximum compression
doc.save("output.pdf", compress=True, garbage=True)

# Strip encryption (open with the correct password, then save)
doc.save("unlocked.pdf")

Export as bytes

Signature: doc.to_bytes(*, compress)

doc.to_bytes(compress: bool = True) -> bytes

Serializes the document to bytes without writing to disk. Useful for in-memory processing or serving a PDF over a network.

compress

bool

default:"True"

Whether to compress content streams.

Returns: bytes — The complete PDF file contents as bytes.

pdf_bytes = doc.to_bytes()

# Return directly as a Flask HTTP response
from flask import Response
return Response(doc.to_bytes(), mimetype="application/pdf")

Lifecycle

Close document

Signature: doc.close()

doc.close() -> None

Closes the document and releases all file handles and memory resources. Using a with statement is recommended over calling this directly.

Context Manager

with sopdf.open("file.pdf") as doc:
    ...
# close() is called automatically on exit

Page Object Operations

Use this section for single-page workflows such as rendering, text extraction, and text search. Page represents a single page within a document. Obtained via doc[i] or doc.load_page(i) — never constructed directly.

Properties

Page index

Member: page.number

page.number -> int

The 0-based index of this page (read-only).

Page dimensions

Member: page.rect

page.rect -> Rect

The page dimensions as a Rect in PDF points (1 pt = 1/72 inch) (read-only). Use rect.width and rect.height to get the page size.

Page rotation

Member: page.rotation

page.rotation -> int          # read current rotation
page.rotation = degrees: int  # set rotation

The page rotation in degrees. Must be one of 0, 90, 180, 270 (read/write).

Exception	Condition
`PageError`	Set to a value other than 0, 90, 180, or 270.

Rendering

Render to image bytes

Signature: page.render(*, dpi, format, alpha)

page.render(
    *,
    dpi: int = 72,
    format: str = "png",
    alpha: bool = False,
) -> bytes

Renders the page to encoded image bytes.

dpi

int

default:"72"

Rendering resolution in dots per inch. Use 72 for screen preview, 300 for print quality.

format

str

default:"\"png\""

Output image format: "png" or "jpeg".

alpha

bool

default:"False"

Whether to include an alpha (transparency) channel. Only effective for PNG; JPEG does not support transparency.

Recommended parameter presets

Scenario	Recommended parameters
On-screen preview	`dpi=72, format="png", alpha=False`
Crisp snapshot	`dpi=150, format="png", alpha=False`
Print-grade image	`dpi=300, format="png", alpha=False`

Returns: bytes — Encoded image bytes (PNG or JPEG).

png_bytes  = page.render(dpi=150)
jpeg_bytes = page.render(dpi=150, format="jpeg")
png_alpha  = page.render(dpi=72, alpha=True)

Render and save image

Signature: page.render_to_file(path, *, dpi, format, alpha)

page.render_to_file(
    path: str | pathlib.Path,
    *,
    dpi: int = 72,
    format: str = "png",
    alpha: bool = False,
) -> None

Renders the page and writes the image to a file. Parameters are identical to render().

path

str | Path

Output file path (including extension).

dpi

int

default:"72"

Rendering resolution in dots per inch.

format

str

default:"\"png\""

Output image format: "png" or "jpeg".

alpha

bool

default:"False"

Whether to include an alpha channel (PNG only).

page.render_to_file("page0.png", dpi=300)
page.render_to_file("page0.jpg", dpi=150, format="jpeg")

Text Extraction

Extract plain text

Signature: page.get_text(*, rect)

page.get_text(
    *,
    rect: Rect | None = None,
) -> str

Extracts plain text from the page.

rect

Rect | None

default:"None"

Restrict extraction to this rectangular region. Extracts the full page when None.

Returns: str — The extracted plain text.

full_text = page.get_text()

# Extract from a specific region only
region = Rect(0, 0, 300, 100)
header_text = page.get_text(rect=region)

Extract text blocks

Signature: page.get_text_blocks(*, rect, format)

page.get_text_blocks(
    *,
    rect: Rect | None = None,
    format: str = "list",
) -> list

Extracts structured text blocks with bounding boxes.

rect

Rect | None

default:"None"

Restrict extraction to this rectangular region. Extracts the full page when None.

format

str

default:"\"list\""

Return format. "list" returns a list of TextBlock objects; "dict" returns a list of plain dictionaries with "text" and "rect" keys.

Returns: format="list" → list[TextBlock]; format="dict" → list[dict], each of the form {"text": "...", "rect": {"x0": ..., "y0": ..., "x1": ..., "y1": ...}}

blocks = page.get_text_blocks()
for block in blocks:
    print(block.text, block.rect)

# Return as dictionaries (convenient for JSON serialization)
dicts = page.get_text_blocks(format="dict")

Text Search

Search text positions

Signature: page.search(query, *, match_case)

page.search(
    query: str,
    *,
    match_case: bool = False,
) -> list[Rect]

Searches the page for a text string and returns the bounding rectangles of all matches.

query

str

The text string to search for.

match_case

bool

default:"False"

Whether the search is case-sensitive. Case-insensitive by default.

Returns: list[Rect] — Bounding rectangles for each match. Returns an empty list if no matches are found.

hits = page.search("invoice")
for rect in hits:
    print(f"Match at {rect}")

# Case-sensitive search
hits = page.search("PDF", match_case=True)

Search text with context blocks

Signature: page.search_text_blocks(query, *, match_case)

page.search_text_blocks(
    query: str,
    *,
    match_case: bool = False,
) -> list[dict]

Searches for text and returns each match along with the surrounding text block for context.

query

str

The text string to search for.

match_case

bool

default:"False"

Whether the search is case-sensitive.

Returns: list[dict], each element contains:

Key	Type	Description
`"text"`	`str`	Full text content of the block containing the match.
`"rect"`	`Rect`	Bounding rectangle of the containing text block.
`"match_rect"`	`Rect`	Precise bounding rectangle of the matched keyword itself.

results = page.search_text_blocks("total amount")
for r in results:
    print(r["text"])        # full paragraph containing the keyword
    print(r["match_rect"])  # exact position of the keyword

Data Types

Refer to this section when you need to understand response structures (for example Rect, TextBlock, and Metadata) for downstream processing.

Rect

Represents a rectangular region. Coordinates are in PDF points (pt, where 1 pt = 1/72 inch). The coordinate system has its origin at the top-left corner of the page, with x increasing rightward and y increasing downward.

Rect(x0: float, y0: float, x1: float, y1: float)

Constructor Parameters

Parameter	Type	Description
`x0`	`float`	Left edge (x-coordinate of the top-left corner).
`y0`	`float`	Top edge (y-coordinate of the top-left corner).
`x1`	`float`	Right edge (x-coordinate of the bottom-right corner).
`y1`	`float`	Bottom edge (y-coordinate of the bottom-right corner).

Core properties (common)

Property	Type	Description
`x0`	`float`	Left edge.
`y0`	`float`	Top edge.
`x1`	`float`	Right edge.
`y1`	`float`	Bottom edge.
`width`	`float`	Rectangle width, equal to `x1 - x0`.
`height`	`float`	Rectangle height, equal to `y1 - y0`.

All geometric operations return new Rect instances — the original is immutable.

r = Rect(10, 20, 200, 300)
print(r.width)    # 190.0
print(r.height)   # 280.0

# Containment check
print(r.contains(Rect(50, 50, 100, 100)))  # True
print(r.contains((50, 50)))                # True (point)

# Intersection
a = Rect(0, 0, 100, 100)
b = Rect(50, 50, 150, 150)
print(a.intersect(b))  # Rect(50, 50, 100, 100)

# Unpack
x0, y0, x1, y1 = r

TextBlock

Represents a single block of text on a page, together with its bounding box.

TextBlock(text: str, rect: Rect)

Attribute / Method	Type	Description
`text`	`str`	The text content of the block.
`rect`	`Rect`	Bounding rectangle of the block on the page.
`to_dict()`	`dict`	Converts to `{"text": ..., "rect": {"x0": ..., "y0": ..., "x1": ..., "y1": ...}}`.

blocks = page.get_text_blocks()
for block in blocks:
    print(block.text)
    print(block.rect.width, block.rect.height)
    print(block.to_dict())

Metadata

Read/write proxy for the PDF Document Info dictionary. Obtained via doc.metadata — never constructed directly. Read path (zero pikepdf cost): each property calls pypdfium2.get_metadata_dict() after auto-syncing. Write path (lazy pikepdf init): each setter calls _ensure_pike(), writes to pike_doc.docinfo, and marks the document dirty. The next read auto-syncs. Core fields (common)

Property	Type	Description
`title`	`str \| None`	Document title (`/Title`). Read/write.
`author`	`str \| None`	Author name (`/Author`). Read/write.
`subject`	`str \| None`	Document subject (`/Subject`). Read/write.
`keywords`	`str \| None`	Search keywords (`/Keywords`). Read/write.

PDF date string format: D:YYYYMMDDHHmmSSOHH'mm' (prefix D: and timezone optional).

with sopdf.open("report.pdf") as doc:
    meta = doc.metadata

    # Read individual fields
    print(meta.title)
    print(meta.creation_datetime)   # datetime(2024, 1, 1, 12, 0, tzinfo=...)

    # Write
    meta.title  = "New Title"
    meta.author = "Kevin Qiu"
    doc.save("updated.pdf")

    # Dict-style read (backward compat)
    d = meta.to_dict()
    print(d["title"])
    print(meta["author"])

OutlineItem

An immutable bookmark node in the document outline.

@dataclass(frozen=True)
class OutlineItem:
    title:    str
    page:     int                          # 0-based; -1 = no destination page
    level:    int                          # 0 = top-level
    children: tuple[OutlineItem, ...] = ()

Attribute / Method	Type	Description
`title`	`str`	Bookmark label as displayed in the reader TOC panel.
`page`	`int`	0-based target page index; `-1` when the item has no destination.
`level`	`int`	Nesting depth; `0` = top-level item.
`children`	`tuple[OutlineItem, ...]`	Nested child items (frozen tuple).
`to_dict()`	`dict`	Serialize to a plain dict (recursive).

Outline

Read-only bookmark tree manager. Obtained via doc.outline — never constructed directly. The tree is built once on first access using pypdfium2’s TOC data — no pikepdf initialisation needed.

Member	Returns	Description
`items`	`list[OutlineItem]`	Top-level outline items (each may have nested `children`).
`to_list()`	`list[dict]`	Flat DFS traversal. Each entry: `{"level": int, "title": str, "page": int}`. Compatible with PyMuPDF `get_toc()` output.
`len(outline)`	`int`	Total number of nodes across all nesting levels.
`iter(outline)`	—	Iterate over top-level items.
`bool(outline)`	`bool`	`True` when the document has at least one outline item.

with sopdf.open("textbook.pdf") as doc:
    outline = doc.outline
    print(outline)          # Outline(top_level=2, total=4)
    print(bool(outline))    # True

    # Recursive tree traversal
    def print_tree(items, indent=0):
        for item in items:
            print("  " * indent + f"[p{item.page + 1}] {item.title}")
            print_tree(item.children, indent + 1)

    print_tree(outline.items)

    # Flat list (PyMuPDF-compatible)
    for row in outline.to_list():
        print(f"{'  ' * row['level']}{row['title']}  →  p{row['page'] + 1}")

Exceptions

Read this section first when integrating PDF processing into production services and designing robust error handling. All exceptions inherit from PDFError, which inherits from the built-in RuntimeError.

RuntimeError
└── PDFError
    ├── PasswordError
    ├── FileDataError
    └── PageError

Exception	When Raised	Handling Guidance	Recoverable
`PDFError`	Base class for all sopdf exceptions. Catch this to handle any sopdf error.	Use as a top-level fallback for logging and unified user messaging.	Depends on subtype
`PasswordError`	Opening an encrypted PDF with a missing or incorrect password.	Prompt again for password and limit retries.	Yes
`FileDataError`	PDF file is corrupted, has an invalid format, or cannot be parsed.	Ask user to re-upload or replace the source file.	No
`PageError`	Page index is out of range, or rotation is set to an invalid value (not 0/90/180/270).	Validate index/range and rotation before calling.	Yes

Recommended catch order: catch specific exceptions first (PasswordError / FileDataError / PageError), then PDFError as a final fallback.

import sopdf

try:
    doc = sopdf.open("file.pdf", password="wrong")
except sopdf.PasswordError:
    print("Incorrect password")
except sopdf.FileDataError:
    print("File is corrupted")
except sopdf.PDFError as e:
    print(f"PDF error: {e}")

SoPDF

​Quick navigation

​Top-Level Functions

​Open a PDF document

​Merge multiple PDF files

​Render multiple pages to image bytes

​Render multiple pages and write files

​Document Object Operations

​Properties

​Total page count

​Metadata

​Document outline

​Encryption status

​Page sequence

​Page Access

​Access page by index

​Iteration

​Split

​Split document by pages

​Split into single-page files

​Merge

​Append pages from another document

​Save

​Save to file

​Export as bytes

​Lifecycle

​Close document

​Context Manager

​Page Object Operations

​Properties

​Page index

​Page dimensions

​Page rotation

​Rendering

​Render to image bytes

​Render and save image

​Text Extraction

​Extract plain text

​Extract text blocks

​Text Search

​Search text positions

​Search text with context blocks

​Data Types

​Rect

​TextBlock

​Metadata

​OutlineItem

​Outline

​Exceptions

Quick navigation

Top-Level Functions

Open a PDF document

Merge multiple PDF files

Render multiple pages to image bytes

Render multiple pages and write files

Document Object Operations

Properties

Total page count

Metadata

Document outline

Encryption status

Page sequence

Page Access

Access page by index

Iteration

Split

Split document by pages

Split into single-page files

Merge

Append pages from another document

Save

Save to file

Export as bytes

Lifecycle

Close document

Context Manager

Page Object Operations

Properties

Page index

Page dimensions

Page rotation

Rendering

Render to image bytes

Render and save image

Text Extraction

Extract plain text

Extract text blocks

Text Search

Search text positions

Search text with context blocks

Data Types

Rect

TextBlock

Metadata

OutlineItem

Outline

Exceptions