名称: deepread
描述: 永不静默失败的 OCR。具备智能质量审查标记的多轮次文档处理 API。通过 AI 驱动的置信度评分,从 PDF 中提取文本和结构化数据。免费额度 - 每月 2,000 页。
永不静默失败的 OCR。处理 PDF 并使用 AI 驱动的置信度评分提取结构化数据,准确告知您哪些字段需要人工复核。
DeepRead 是一个生产级文档处理 API,通过智能质量评估,将人工复核需求从 100% 降低至约 10%。
核心功能:
- 文本提取:将 PDF 转换为清晰的 Markdown 文本
- 结构化数据:提取带有置信度评分的 JSON 字段
- 质量标记:AI 判断哪些字段需要人工验证 (hil_flag)
- 多轮次处理:多次验证轮次以确保最高准确率
- 多模型共识:模型间交叉验证以提高可靠性
- 免费额度:每月 2,000 页(无需信用卡)
注册并创建 API 密钥:
# 访问控制面板
https://www.deepread.tech/dashboard
# 或使用此直接链接
https://www.deepread.tech/dashboard/?utm_source=clawdhub
保存您的 API 密钥:
export DEEPREAD_API_KEY="sk_live_your_key_here"
添加到您的 clawdbot.config.json5 文件中:
{
skills: {
entries: {
"deepread": {
enabled: true,
apiKey: "sk_live_your_key_here"
}
}
}
}
方案 A:使用 Webhook(推荐)
# 上传 PDF 并设置 Webhook 通知
curl -X POST https://api.deepread.tech/v1/process \
-H "X-API-Key: $DEEPREAD_API_KEY" \
-F "file=@document.pdf" \
-F "webhook_url=https://your-app.com/webhooks/deepread"
# 立即返回
{
"id": "550e8400-e29b-41d4-a716-446655440000",
"status": "queued"
}
# 处理完成后(2-5 分钟),您的 Webhook 将收到结果
方案 B:轮询获取结果
# 上传 PDF 但不设置 Webhook
curl -X POST https://api.deepread.tech/v1/process \
-H "X-API-Key: $DEEPREAD_API_KEY" \
-F "file=@document.pdf"
# 立即返回
{
"id": "550e8400-e29b-41d4-a716-446655440000",
"status": "queued"
}
# 轮询直至处理完成
curl https://api.deepread.tech/v1/jobs/550e8400-e29b-41d4-a716-446655440000 \
-H "X-API-Key: $DEEPREAD_API_KEY"
提取文本为清晰的 Markdown:
# 使用 Webhook(推荐)
curl -X POST https://api.deepread.tech/v1/process \
-H "X-API-Key: $DEEPREAD_API_KEY" \
-F "file=@invoice.pdf" \
-F "webhook_url=https://your-app.com/webhook"
# 或轮询等待完成
curl -X POST https://api.deepread.tech/v1/process \
-H "X-API-Key: $DEEPREAD_API_KEY" \
-F "file=@invoice.pdf"
# 然后轮询
curl https://api.deepread.tech/v1/jobs/JOB_ID \
-H "X-API-Key: $DEEPREAD_API_KEY"
处理完成后的响应:
{
"id": "550e8400-...",
"status": "completed",
"result": {
"text": "# 发票\n\n**供应商:** Acme Corp\n**总计:** $1,250.00..."
}
}
提取特定字段并附带置信度评分:
curl -X POST https://api.deepread.tech/v1/process \
-H "X-API-Key: $DEEPREAD_API_KEY" \
-F "file=@invoice.pdf" \
-F 'schema={
"type": "object",
"properties": {
"vendor": {
"type": "string",
"description": "供应商公司名称"
},
"total": {
"type": "number",
"description": "发票总金额"
},
"invoice_date": {
"type": "string",
"description": "发票日期,格式为 MM/DD/YYYY"
}
}
}'
响应包含置信度标记:
{
"status": "completed",
"result": {
"text": "# 发票\n\n**供应商:** Acme Corp...",
"data": {
"vendor": {
"value": "Acme Corp",
"hil_flag": false,
"found_on_page": 1
},
"total": {
"value": 1250.00,
"hil_flag": false,
"found_on_page": 1
},
"invoice_date": {
"value": "2024-10-??",
"hil_flag": true,
"reason": "日期部分模糊",
"found_on_page": 1
}
},
"metadata": {
"fields_requiring_review": 1,
"total_fields": 3,
"review_percentage": 33.3
}
}
}
提取数组和嵌套对象:
curl -X POST https://api.deepread.tech/v1/process \
-H "X-API-Key: $DEEPREAD_API_KEY" \
-F "file=@invoice.pdf" \
-F 'schema={
"type": "object",
"properties": {
"vendor": {"type": "string"},
"total": {"type": "number"},
"line_items": {
"type": "array",
"items": {
"type": "object",
"properties": {
"description": {"type": "string"},
"quantity": {"type": "number"},
"price": {"type": "number"}
}
}
}
}
}'
获取每页的 OCR 结果及质量标记:
curl -X POST https://api.deepread.tech/v1/process \
-H "X-API-Key: $DEEPREAD_API_KEY" \
-F "file=@contract.pdf" \
-F "include_pages=true"
响应:
{
"result": {
"text": "所有页面的合并文本...",
"pages": [
{
"page_number": 1,
"text": "# 合同协议\n\n...",
"hil_flag": false
},
{
"page_number": 2,
"text": "条款和条??...",
"hil_flag": true,
"reason": "多个无法识别的字符"
}
],
"metadata": {
"pages_requiring_review": 1,
"total_pages": 2
}
}
}
PDF → 转换 → 旋转校正 → OCR → 多模型验证 → 提取 → 完成
管道自动处理:
- 文档旋转和方向校正
- 多轮次验证以确保准确性
- 跨模型共识以提高可靠性
- 字段级置信度评分
AI 将提取的文本与原始图像进行比较,并设置 hil_flag:
hil_flag: false = 清晰、有信心的提取 → 自动处理hil_flag: true = 不确定的提取 → 需要人工复核AI 在以下情况下标记提取结果:
- 文本为手写、模糊或质量低下
- 存在多种可能的解释
- 字符部分可见或不清楚
- 文档中未找到该字段
这是多模态 AI 判断,而非基于规则的判断。
为特定文档类型创建可重用、优化的模式:
# 列出您的蓝图
curl https://api.deepread.tech/v1/blueprints \
-H "X-API-Key: $DEEPREAD_API_KEY"
# 使用蓝图代替内联模式
curl -X POST https://api.deepread.tech/v1/process \
-H "X-API-Key: $DEEPREAD_API_KEY" \
-F "file=@invoice.pdf" \
-F "blueprint_id=660e8400-e29b-41d4-a716-446655440001"
优势:
- 相比基础模式,准确率提升 20-30%
- 可在类似文档中重复使用
- 支持版本控制和回滚
如何创建蓝图:
# 根据训练数据创建蓝图
curl -X POST https://api.deepread.tech/v1/optimize \
-H "X-API-Key: $DEEPREAD_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"name": "utility_invoice",
"description": "针对水电费发票优化",
"document_type": "invoice",
"initial_schema": {
"type": "object",
"properties": {
"vendor": {"type": "string", "description": "供应商名称"},
"total": {"type": "number", "description": "总金额"}
}
},
"training_documents": ["doc1.pdf", "doc2.pdf", "doc3.pdf"],
"ground_truth_data": [
{"vendor": "Acme Power", "total": 125.50},
{"vendor": "City Electric", "total": 89.25}
],
"target_accuracy": 95.0,
"max_iterations": 5
}'
# 返回:{"job_id": "...", "blueprint_id": "...", "status": "pending"}
# 检查优化状态
curl https://api.deepread.tech/v1/blueprints/jobs/JOB_ID \
-H "X-API-Key: $DEEPREAD_API_KEY"
# 使用蓝图(完成后)
curl -X POST https://api.deepread.tech/v1/process \
-H "X-API-Key: $DEEPREAD_API_KEY" \
-F "file=@invoice.pdf" \
-F "blueprint_id=BLUEPRINT_ID"
处理完成后接收通知,无需轮询:
curl -X POST https://api.deepread.tech/v1/process \
-H "X-API-Key: $DEEPREAD_API_KEY" \
-F "file=@invoice.pdf" \
-F "webhook_url=https://your-app.com/webhooks/deepread"
处理完成后,您的 Webhook 将收到此负载:
{
"job_id": "550e8400-...",
"status": "completed",
"created_at": "2025-01-27T10:00:00Z",
"completed_at": "2025-01-27T10:02:30Z",
"result": {
"text": "...",
"data": {...}
},
"preview_url": "https://preview.deepread.tech/abc1234"
}
优势:
- 无需轮询
- 完成后即时通知
- 延迟更低
- 更适合生产工作流
无需身份验证即可共享 OCR 结果:
# 请求预览 URL
curl -X POST https://api.deepread.tech/v1/process \
-H "X-API-Key: $DEEPREAD_API_KEY" \
-F "file=@document.pdf" \
-F "include_images=true"
# 在响应中获取预览 URL
{
"result": {
"text": "...",
"data": {...}
},
"preview_url": "https://preview.deepread.tech/Xy9aB12"
}
公共预览端点:
# 无需身份验证
curl https://api.deepread.tech/v1/preview/Xy9aB12
升级: https://www.deepread.tech/dashboard/billing?utm_source=clawdhub
每个响应都包含配额信息:
X-RateLimit-Limit: 2000
X-RateLimit-Remaining: 1847
X-RateLimit-Used: 153
X-RateLimit-Reset: 1730419200
✅ 推荐:Webhook 通知
curl -X POST https://api.deepread.tech/v1/process \
-H "X-API-Key: $DEEPREAD_API_KEY" \
-F "file=@document.pdf" \
-F "webhook_url=https://your-app.com/webhook"
仅在以下情况下使用轮询:
- 测试/开发环境
- 无法暴露 Webhook 端点
- 需要同步响应
✅ 良好:描述性字段说明
{
"vendor": {
"type": "string",
"description": "供应商公司名称。通常在发票的页眉或左上角。"
}
}
❌ 不佳:无说明
{
"vendor": {"type": "string"}
}
仅在无法使用 Webhooks 时,每 5-10 秒轮询一次:
import time
import requests
def wait_for_result(job_id, api_key):
while True:
response = requests.get(
f"https://api.deepread.tech/v1/jobs/{job_id}",
headers={"X-API-Key": api_key}
)
result = response.json()
if result["status"] == "completed":
return result["result"]
elif result["status"] == "failed":
raise Exception(f"任务失败:{result.get('error')}")
time.sleep(5)
将有信心的字段与不确定的字段分开处理:
def process_extraction(data):
confident = {}
needs_review = []
for field, field_data in data.items():
if field_data["hil_flag"]:
needs_review.append({
"field": field,
"value": field_data["value"],
"reason": field_data.get("reason")
})
else:
confident[field] = field_data["value"]
# 自动处理有信心的字段
save_to_database(confident)
# 将不确定的字段发送到复核队列
if needs_review:
send_to_review_queue(needs_review)
quota_exceeded{"detail": "月度页面配额已用完"}
解决方案: 升级到 PRO 套餐或等待下一个计费周期。
invalid_schema{"detail": "模式必须是有效的 JSON Schema"}
解决方案: 确保模式是有效的 JSON 并包含 type 和 properties。
file_too_large{"detail": "文件大小超过 50MB 限制"}
解决方案: 压缩 PDF 或拆分为更小的文件。
failed{"status": "failed", "error": "无法处理 PDF"}
常见原因:
- PDF 文件损坏
- 受密码保护的 PDF
- 不支持的 PDF 版本
- 图像质量太低,无法进行 OCR
```json
{
"type": "object",
"properties": {
"invoice_number": {
"type": "string",
"description": "唯一的发票 ID"
},
"invoice_date": {
"type": "string",
"description": "发票日期,格式为 MM/DD/YYYY"
},
"vendor": {
"type": "string",
"description": "供应商公司名称"
},
"total": {
"type": "number",
"description": "含税总金额"
},
"line_items": {
"type": "array",
"items": {
"type": "object",
"properties": {
"description": {"type": "string"},
"quantity": {"type