Skip to main content
POST
/
parse
/
sync
Python
import json
import requests

url = "https://somark.tech/api/v1/parse/sync"

data = {
    "output_formats": ["markdown", "json"],
    "api_key": "sk-***",
    "element_formats": json.dumps({
        "image": "url",
        "formula": "latex",
        "table": "html",
        "cs": "image",
    }),
    "feature_config": json.dumps({
        "enable_text_cross_page": False,
        "enable_table_cross_page": False,
        "enable_title_level_recognition": False,
        "enable_inline_image": True,
        "enable_table_image": True,
        "enable_image_understanding": True,
        "keep_header_footer": False,
    }),
}

files = {"file": ("example.pdf", open("example.pdf", "rb"))}

response = requests.post(url, data=data, files=files)
print(response.json())
{
  "code": 0,
  "message": "任务成功",
  "data": {
    "task_id": "a1b2c3d4e5f6",
    "error": null,
    "metadata": {
      "page_num": 10,
      "file_type": ".pdf"
    },
    "result": {
      "file_name": "document.pdf",
      "outputs": {
        "markdown": "# 第一章 引言\n\n本文档介绍了...",
        "json": {
          "pages": [
            {
              "page_num": 0,
              "blocks": [
                {
                  "idx": 0,
                  "type": "title",
                  "bbox": [
                    72,
                    50,
                    540,
                    80
                  ],
                  "content": "第一章 引言",
                  "format": "text",
                  "captions": [],
                  "img_url": "",
                  "title_level": 1
                },
                {
                  "idx": 1,
                  "type": "text",
                  "bbox": [
                    72,
                    100,
                    540,
                    200
                  ],
                  "content": "本文档介绍了...",
                  "format": "text",
                  "captions": [],
                  "img_url": ""
                }
              ],
              "page_size": {
                "h": 1684,
                "w": 1190
              },
              "merge_content_from_pre_page": false
            }
          ]
        }
      }
    }
  }
}
Path change: This endpoint path has been changed from /extract/acc_sync to /parse/sync. The old path will be discontinued on December 31, 2026. Please migrate to the new path before then. Parameter change: extract_config has been renamed to feature_config. Please replace extract_config with feature_config in your requests.
Available output formatsDefaultDescription
json / markdown / zip["markdown", "json"]Multiple selections supported. Uses the default when omitted. zip packages the Markdown output and all image files into an archive. When output_formats includes zip, element_formats.image must be file
FieldAvailable output formatsDefaultDescription
imageurl / base64 / file / noneurlSingle selection only. When image is set to file, output_formats must include zip. none means images are not returned
formulalatex / mathml / asciilatexSingle selection only. Specifies the output format for formulas
tablemarkdown / html / imagehtmlSingle selection only. In markdown mode, merged cells are automatically split into independent cells and filled with the same content
csimageimageSingle selection only. Output format for chemical structures; smiles format is coming soon

feature_config

FieldDefaultDescription
enable_text_cross_pagefalseCross-page text merging: merge text blocks spanning pages into continuous paragraphs
enable_table_cross_pagefalseCross-page table merging: merge tables spanning pages into a single table
enable_title_level_recognitionfalseHeading level recognition: detect document heading hierarchy (H1/H2/H3…)
enable_inline_imagetrueInline images: return images inside text paragraphs
enable_table_imagetrueImages in tables: return images inside table cells
enable_image_understandingtrueImage understanding: perform semantic understanding and structured description of document images
keep_header_footerfalseKeep headers and footers: headers and footers are filtered by default; enable this if you need to preserve them

Body

multipart/form-data
file
file
required

待解析的文件,支持 PDF、图片、Office 格式

api_key
string
required

API 密钥,格式 sk-***

output_formats
enum<string>[]

输出格式,可多选。不传时默认为 ["markdown", "json"]。支持 json / markdown / zip,其中 zip 将所有输出文件打包为压缩包

Available options:
json,
markdown,
zip
element_formats
object

元素格式配置,控制各类元素的格式

feature_config
object

特色功能配置(参数已从 extract_config 更名为 feature_config)

Response

200 - application/json

解析成功

code
integer

状态码,0 为成功,非 0错误码

Example:

0

message
string
Example:

"任务成功"

data
object