OA0

OA0 是一个探索 AI 的社区

现在注册

已注册用户请登录

OA0 › 技能包 › anycrawl：集成 Scraping/Crawl 技术的全能网页内容抓取工具

anycrawl：集成 Scraping/Crawl 技术的全能网页内容抓取工具

mobile · 2026-02-03 14:30:40 · 17 次点击 · 0 条评论

AnyCrawl 技能

AnyCrawl API 与 OpenClaw 的集成 - 通过高性能多线程爬虫抓取、爬取和搜索网页内容。

设置

方法 1：环境变量（推荐）

export ANYCRAWL_API_KEY="your-api-key"

通过添加到 ~/.bashrc 或 ~/.zshrc 使其永久生效：

echo 'export ANYCRAWL_API_KEY="your-api-key"' >> ~/.bashrc
source ~/.bashrc

在此获取您的 API 密钥：https://anycrawl.dev

方法 2：OpenClaw 网关配置

openclaw config.patch --set ANYCRAWL_API_KEY="your-api-key"

函数

1. anycrawl_scrape

抓取单个 URL 并将其转换为适合 LLM 的结构化数据。

参数：
- url (字符串，必需)：要抓取的 URL
- engine (字符串，可选)：抓取引擎 - "cheerio"（默认）、"playwright"、"puppeteer"
- formats (数组，可选)：输出格式 - ["markdown"]、["html"]、["text"]、["json"]、["screenshot"]
- timeout (数字，可选)：超时时间（毫秒，默认：30000）
- wait_for (数字，可选)：提取前的延迟（毫秒，仅浏览器引擎）
- wait_for_selector (字符串/对象/数组，可选)：等待 CSS 选择器出现
- include_tags (数组，可选)：仅包含这些 HTML 标签（例如 ["h1", "p", "article"]）
- exclude_tags (数组，可选)：排除这些 HTML 标签
- proxy (字符串，可选)：代理 URL（例如 "http://proxy:port"）
- json_options (对象，可选)：使用模式/提示进行 JSON 提取
- extract_source (字符串，可选)："markdown"（默认）或 "html"

示例：

// 使用默认 cheerio 进行基本抓取
anycrawl_scrape({ url: "https://example.com" })

// 使用 Playwright 抓取 SPA 应用
anycrawl_scrape({ 
  url: "https://spa-example.com",
  engine: "playwright",
  formats: ["markdown", "screenshot"]
})

// 提取结构化 JSON
anycrawl_scrape({
  url: "https://product-page.com",
  engine: "cheerio",
  json_options: {
    schema: {
      type: "object",
      properties: {
        product_name: { type: "string" },
        price: { type: "number" },
        description: { type: "string" }
      },
      required: ["product_name", "price"]
    },
    user_prompt: "从本页面提取产品详情"
  }
})

2. anycrawl_search

搜索 Google 并返回结构化结果。

参数：
- query (字符串，必需)：搜索查询
- engine (字符串，可选)：搜索引擎 - "google"（默认）
- limit (数字，可选)：每页最大结果数（默认：10）
- offset (数字，可选)：要跳过的结果数（默认：0）
- pages (数字，可选)：要检索的页数（默认：1，最大：20）
- lang (字符串，可选)：语言区域设置（例如 "en"、"zh"、"vi"）
- safe_search (数字，可选)：0（关闭）、1（中等）、2（高）
- scrape_options (对象，可选)：使用这些选项抓取每个结果 URL

示例：

// 基本搜索
anycrawl_search({ query: "OpenAI ChatGPT" })

// 越南语多页搜索
anycrawl_search({ 
  query: "hướng dẫn Node.js",
  pages: 3,
  lang: "vi"
})

// 搜索并自动抓取结果
anycrawl_search({
  query: "best AI tools 2026",
  limit: 5,
  scrape_options: {
    engine: "cheerio",
    formats: ["markdown"]
  }
})

3. anycrawl_crawl_start

开始爬取整个网站（异步任务）。

参数：
- url (字符串，必需)：开始爬取的种子 URL
- engine (字符串，可选)："cheerio"（默认）、"playwright"、"puppeteer"
- strategy (字符串，可选)："all"、"same-domain"（默认）、"same-hostname"、"same-origin"
- max_depth (数字，可选)：距种子 URL 的最大深度（默认：10）
- limit (数字，可选)：要爬取的最大页面数（默认：100）
- include_paths (数组，可选)：要包含的路径模式（例如 ["/blog/*"]）
- exclude_paths (数组，可选)：要排除的路径模式（例如 ["/admin/*"]）
- scrape_paths (数组，可选)：仅抓取匹配这些模式的 URL
- scrape_options (对象，可选)：每页的抓取选项

示例：

// 爬取整个网站
anycrawl_crawl_start({ 
  url: "https://docs.example.com",
  engine: "cheerio",
  max_depth: 5,
  limit: 50
})

// 仅爬取博客文章
anycrawl_crawl_start({
  url: "https://example.com",
  strategy: "same-domain",
  include_paths: ["/blog/*"],
  exclude_paths: ["/blog/tags/*"],
  scrape_options: {
    formats: ["markdown"]
  }
})

// 仅爬取产品页面
anycrawl_crawl_start({
  url: "https://shop.example.com",
  strategy: "same-domain",
  scrape_paths: ["/products/*"],
  limit: 200
})

4. anycrawl_crawl_status

检查爬取任务状态。

参数：
- job_id (字符串，必需)：爬取任务 ID

示例：

anycrawl_crawl_status({ job_id: "7a2e165d-8f81-4be6-9ef7-23222330a396" })

5. anycrawl_crawl_results

获取爬取结果（分页）。

参数：
- job_id (字符串，必需)：爬取任务 ID
- skip (数字，可选)：要跳过的结果数（默认：0）

示例：

// 获取前 100 条结果
anycrawl_crawl_results({ job_id: "xxx", skip: 0 })

// 获取接下来 100 条结果
anycrawl_crawl_results({ job_id: "xxx", skip: 100 })

6. anycrawl_crawl_cancel

取消正在运行的爬取任务。

参数：
- job_id (字符串，必需)：爬取任务 ID

7. anycrawl_search_and_scrape

快速助手：搜索 Google 然后抓取顶部结果。

参数：
- query (字符串，必需)：搜索查询
- max_results (数字，可选)：要抓取的最大结果数（默认：3）
- scrape_engine (字符串，可选)：用于抓取的引擎（默认："cheerio"）
- formats (数组，可选)：输出格式（默认：["markdown"]）
- lang (字符串，可选)：搜索语言

示例：

anycrawl_search_and_scrape({
  query: "latest AI news",
  max_results: 5,
  formats: ["markdown"]
})

引擎选择指南

引擎	最佳适用场景	速度	JS 渲染
`cheerio`	静态 HTML、新闻、博客	⚡ 最快	❌ 否
`playwright`	SPA、复杂 Web 应用	🐢 较慢	✅ 是
`puppeteer`	Chrome 特定场景、指标	🐢 较慢	✅ 是

响应格式

所有响应都遵循此结构：

{
  "success": true,
  "data": { ... },
  "message": "可选消息"
}

错误响应：

{
  "success": false,
  "error": "错误类型",
  "message": "人类可读的消息"
}

常见错误代码

400 - 错误请求（验证错误）
401 - 未授权（API 密钥无效）
402 - 需要付费（积分不足）
404 - 未找到
429 - 超出速率限制
500 - 内部服务器错误

API 限制

根据您的套餐应用速率限制
爬取任务在 24 小时后过期
最大爬取限制：取决于积分

链接

API 文档：https://docs.anycrawl.dev
网站：https://anycrawl.dev
演练场：https://anycrawl.dev/playground

技能包地址：https://github.com/openclaw/skills/tree/main/skills/techlaai/anycrawl/SKILL.md

17 次点击 ∙ 0 人收藏

登录后收藏

0 条回复