Sync parsing - SoMark 文档

POST

parse

sync

Python

import json
import requests

url = "https://somark.tech/api/v1/parse/sync"

data = {
    "output_formats": ["markdown", "json"],
    "api_key": "sk-***",
    "element_formats": json.dumps({
        "image": "url",
        "formula": "latex",
        "table": "html",
        "cs": "image",
    }),
    "feature_config": json.dumps({
        "enable_text_cross_page": False,
        "enable_table_cross_page": False,
        "enable_title_level_recognition": False,
        "enable_inline_image": False,
        "enable_table_image": True,
        "enable_image_understanding": True,
        "keep_header_footer": False,
    }),
}

files = {"file": ("example.pdf", open("example.pdf", "rb"))}

response = requests.post(url, data=data, files=files)
print(response.json())

{
  "code": 0,
  "message": "任务成功",
  "data": {
    "task_id": "a1b2c3d4e5f6",
    "error": null,
    "metadata": {
      "page_num": 10,
      "file_type": ".pdf"
    },
    "result": {
      "file_name": "document.pdf",
      "outputs": {
        "markdown": "# 第一章 引言\n\n本文档介绍了...",
        "json": {
          "pages": [
            {
              "page_num": 0,
              "blocks": [
                {
                  "idx": 0,
                  "type": "title",
                  "bbox": [
                    72,
                    50,
                    540,
                    80
                  ],
                  "content": "第一章 引言",
                  "format": "text",
                  "captions": [],
                  "img_url": "",
                  "title_level": 1
                },
                {
                  "idx": 1,
                  "type": "text",
                  "bbox": [
                    72,
                    100,
                    540,
                    200
                  ],
                  "content": "本文档介绍了...",
                  "format": "text",
                  "captions": [],
                  "img_url": ""
                }
              ],
              "page_size": {
                "h": 1684,
                "w": 1190
              },
              "merge_content_from_pre_page": false
            }
          ]
        }
      }
    }
  }
}

Path change: This endpoint path has been changed from /extract/acc_sync to /parse/sync. The old path will be discontinued on December 31, 2026. Please migrate to the new path before then. Parameter change: extract_config has been renamed to feature_config. Please replace extract_config with feature_config in your requests.

output_formats

Available output formats	Default	Description
`json` / `markdown` / `zip`	`["markdown", "json"]`	Multiple selections supported. Uses the default when omitted. `zip` packages the Markdown output and all image files into an archive. When `output_formats` includes `zip`, `element_formats.image` must be `file`

element_formats

Field	Available output formats	Default	Description
`image`	`url` / `base64` / `file` / `none`	`url`	Single selection only. When `image` is set to `file`, `output_formats` must include `zip`. `none` means images are not returned
`formula`	`latex` / `mathml` / `ascii`	`latex`	Single selection only. Specifies the output format for formulas
`table`	`markdown` / `html` / `image`	`html`	Single selection only. In `markdown` mode, merged cells are automatically split into independent cells and filled with the same content
`cs`	`image`	`image`	Single selection only. Output format for chemical structures; `smiles` format is coming soon

feature_config

Field	Default	Description
`enable_text_cross_page`	`false`	Cross-page text merging: merge text blocks spanning pages into continuous paragraphs
`enable_table_cross_page`	`false`	Cross-page table merging: merge tables spanning pages into a single table
`enable_title_level_recognition`	`false`	Heading level recognition: detect document heading hierarchy (H1/H2/H3…)
`enable_inline_image`	`false`	Inline images: return images inside text paragraphs
`enable_table_image`	`true`	Images in tables: return images inside table cells
`enable_image_understanding`	`true`	Image understanding: perform semantic understanding and structured description of document images
`keep_header_footer`	`false`	Keep headers and footers: headers and footers are filtered by default; enable this if you need to preserve them

If you need auth, usage limits, or sync vs async guidance, read the API overview first. For large files and batch jobs, switch to Async parsing — Submit Task.

Body

multipart/form-data

file

required

待解析的文件，支持 PDF、图片、Word、PPT 和 Excel 格式

api_key

string

required

API 密钥，格式 sk-***

output_formats

enum<string>[]

输出格式，可多选。不传时默认为 ["markdown", "json"]。支持 json / markdown / zip，其中 zip 将所有输出文件打包为压缩包

Available options:

json,

markdown,

zip

element_formats

object

元素格式配置，控制各类元素的格式

Show child attributes

feature_config

object

特色功能配置（参数已从 extract_config 更名为 feature_config）

Show child attributes

Response

200 - application/json

解析成功

code

integer

状态码，0 为成功，非 0 见错误码

Example:

0

message

string

Example:

"任务成功"

data

object

Show child attributes

Error Codes Async parsing — Submit Task