OA0

OA0 是一个探索 AI 的社区

现在注册

已注册用户请登录

OA0 › 代码 › OneFileLLM — 将代码库或文档压缩成单文件供 LLM 消化

OneFileLLM — 将代码库或文档压缩成单文件供 LLM 消化

ball · 2026-05-01 11:00:18 · 2 次点击 · 0 条评论

OneFileLLM

面向LLM的内容聚合器——将多源数据聚合并结构化到单个XML文件中，方便LLM上下文使用。

描述

OneFileLLM是一个命令行工具，可以自动从各种来源（本地文件、GitHub仓库、网页、PDF、YouTube字幕等）聚合数据，并将其整合到一个结构化的XML输出文件中，该输出会自动复制到剪贴板，以便与大型语言模型（LLM）一起使用。

安装

git clone https://github.com/jimmc414/onefilellm.git
cd onefilellm
pip install -r requirements.txt

Pip安装

OneFileLLM也可以作为pip包安装。您可以直接安装并使用CLI和Python API，无需克隆仓库：

pip install onefilellm

命令行界面（CLI）

本项目还可以作为命令行工具安装，允许您直接在终端中运行onefilellm。

CLI安装

要安装CLI，请在项目的根目录下运行以下命令：

pip install -e .

这将安装包为“可编辑”模式，意味着您对源代码所做的任何更改将立即在命令行工具中生效。

CLI使用

安装后，您可以使用onefilellm命令代替python onefilellm.py。

概要：
onefilellm [选项] [输入源...]

示例：

onefilellm ./docs/ https://github.com/user/project/issues/123

所有其他命令行参数和选项与基于脚本的方法相同。

对于GitHub API访问（推荐）：

export GITHUB_TOKEN="your_personal_access_token"

Python API

通过pip安装后，可以直接从Python代码中调用OneFileLLM。

from onefilellm import run

# 以编程方式处理输入
run(["./docs/"])

命令帮助

usage: onefilellm.py [-h] [-c]
                     [-f {text,markdown,json,html,yaml,doculing,markitdown}]
                     [--alias-add NAME [COMMAND_STRING ...]]
                     [--alias-remove NAME] [--alias-list] [--alias-list-core]
                     [--crawl-max-depth CRAWL_MAX_DEPTH]
                     [--crawl-max-pages CRAWL_MAX_PAGES]
                     [--crawl-user-agent CRAWL_USER_AGENT]
                     [--crawl-delay CRAWL_DELAY]
                     [--crawl-include-pattern CRAWL_INCLUDE_PATTERN]
                     [--crawl-exclude-pattern CRAWL_EXCLUDE_PATTERN]
                     [--crawl-timeout CRAWL_TIMEOUT] [--crawl-include-images]
                     [--crawl-no-include-code] [--crawl-no-extract-headings]
                     [--crawl-follow-links] [--crawl-no-clean-html]
                     [--crawl-no-strip-js] [--crawl-no-strip-css]
                     [--crawl-no-strip-comments] [--crawl-respect-robots]
                     [--crawl-concurrency CRAWL_CONCURRENCY]
                     [--crawl-restrict-path] [--crawl-no-include-pdfs]
                     [--crawl-no-ignore-epubs] [--help-topic [TOPIC]]
                     [inputs ...]

OneFileLLM - 面向LLM的内容聚合器

位置参数：
  inputs                要处理的输入路径、URL或别名

选项：
  -h, --help            显示此帮助信息并退出
  -c, --clipboard       从剪贴板处理文本
  -f, --format {text,markdown,json,html,yaml,doculing,markitdown}
                        覆盖文本输入的格式检测
  --help-topic [TOPIC]  显示特定主题的帮助（basic, aliases, crawling, pipelines, examples, config）

## 快速入门示例

### 本地文件和目录
```bash
python onefilellm.py research_paper.pdf config.yaml src/
python onefilellm.py *.py requirements.txt docs/ README.md
python onefilellm.py notebook.ipynb --format json
python onefilellm.py large_dataset.csv logs/ --format text

GitHub仓库和议题

python onefilellm.py https://github.com/microsoft/vscode
python onefilellm.py https://github.com/openai/whisper/tree/main/whisper
python onefilellm.py https://github.com/microsoft/vscode/pull/12345
python onefilellm.py https://github.com/kubernetes/kubernetes/issues?state=all
python onefilellm.py https://github.com/kubernetes/kubernetes/issues?state=open
python onefilellm.py https://github.com/kubernetes/kubernetes/issues?state=closed

您可以通过指定state查询参数来检索仓库的议题。使用state=all（默认）获取所有议题，state=open仅获取打开的议题，state=closed获取已关闭的议题。

使用特定分支或标签

是否可以使用此工具处理GitHub仓库的不同分支？

是的。当您提供包含分支的GitHub URL（例如 https://github.com/openai/whisper/tree/main/whisper）时，工具会解析tree/部分，并在请求中添加ref参数，从而检索指定的分支或标签。

网页文档和API

python onefilellm.py https://docs.python.org/3/tutorial/
python onefilellm.py https://react.dev/learn/thinking-in-react
python onefilellm.py https://docs.stripe.com/api
python onefilellm.py https://kubernetes.io/docs/concepts/

多媒体和学术资源

python onefilellm.py https://www.youtube.com/watch?v=dQw4w9WgXcQ
python onefilellm.py https://arxiv.org/abs/2103.00020
python onefilellm.py arxiv:1706.03762 PMID:35177773
python onefilellm.py doi:10.1038/s41586-021-03819-2

多个输入

python onefilellm.py https://github.com/jimmc414/hey-claude https://modelcontextprotocol.io/llms-full.txt https://github.com/anthropics/anthropic-sdk-python https://github.com/anthropics/anthropic-cookbook
python onefilellm.py https://github.com/openai/whisper/tree/main/whisper https://www.youtube.com/watch?v=dQw4w9WgXcQ ALIAS_MCP
python onefilellm.py https://github.com/microsoft/vscode/pull/12345 https://arxiv.org/abs/2103.00020 
python onefilellm.py https://github.com/kubernetes/kubernetes/issues https://pytorch.org/docs

输入流

python onefilellm.py --clipboard --format markdown
cat large_dataset.json | python onefilellm.py - --format json
curl -s https://api.github.com/repos/microsoft/vscode | python onefilellm.py -
echo 'Quick analysis task' | python onefilellm.py -

别名系统

创建简单和复杂的别名

python onefilellm.py --alias-add mcp "https://github.com/anthropics/mcp"
python onefilellm.py --alias-add modern-web \
  "https://github.com/facebook/react https://reactjs.org/docs/ https://github.com/vercel/next.js"

动态占位符

# 创建包含{}的占位符
python onefilellm.py --alias-add gh-search "https://github.com/search?q={}"
python onefilellm.py --alias-add gh-user "https://github.com/{}"
python onefilellm.py --alias-add arxiv-search "https://arxiv.org/search/?query={}"

# 动态使用占位符
python onefilellm.py gh-search "machine learning transformers"
python onefilellm.py gh-user "microsoft"
python onefilellm.py arxiv-search "attention mechanisms"

复杂的生态系统别名

python onefilellm.py --alias-add ai-research \
  "arxiv:1706.03762 https://github.com/huggingface/transformers https://pytorch.org/docs"
python onefilellm.py --alias-add k8s-ecosystem \
  "https://github.com/kubernetes/kubernetes https://kubernetes.io/docs/ https://github.com/istio/istio"

# 将多个别名与实时源结合
python onefilellm.py ai-research k8s-ecosystem modern-web \
  conference_notes.pdf local_experiments/

别名管理

python onefilellm.py --alias-list              # 显示所有别名
python onefilellm.py --alias-list-core         # 仅显示核心别名
python onefilellm.py --alias-remove old-alias  # 删除用户别名
cat ~/.onefilellm_aliases/aliases.json         # 查看原始JSON

  --alias-add NAME [COMMAND_STRING ...]
                        添加或更新用户定义的别名。NAME后的多个参数将合并为COMMAND_STRING。
  --alias-remove NAME   删除用户定义的别名。
  --alias-list          列出所有有效的别名（用户定义别名覆盖核心别名）。
  --alias-list-core     仅列出预装（核心）别名。

网络爬虫选项：
  --crawl-max-depth CRAWL_MAX_DEPTH
                        最大爬取深度（默认：3）
  --crawl-max-pages CRAWL_MAX_PAGES
                        最大爬取页面数（默认：1000）
  --crawl-user-agent CRAWL_USER_AGENT
                        网络请求的用户代理（默认：OneFileLLMCrawler/1.1）
  --crawl-delay CRAWL_DELAY
                        请求之间的延迟（秒）（默认：0.25）
  --crawl-include-pattern CRAWL_INCLUDE_PATTERN
                        包含的URL的正则表达式模式
  --crawl-exclude-pattern CRAWL_EXCLUDE_PATTERN
                        排除的URL的正则表达式模式
  --crawl-timeout CRAWL_TIMEOUT
                        请求超时（秒）（默认：20）
  --crawl-include-images
                        在输出中包含图片URL
  --crawl-no-include-code
                        从输出中排除代码块
  --crawl-no-extract-headings
                        排除标题提取
  --crawl-follow-links  跟随链接到外部域名
  --crawl-no-clean-html 禁用可读性清洗
  --crawl-no-strip-js   保留JavaScript代码
  --crawl-no-strip-css  保留CSS样式
  --crawl-no-strip-comments
                        保留HTML注释
  --crawl-respect-robots
                        尊重robots.txt（默认：忽略，以保持向后兼容性）
  --crawl-concurrency CRAWL_CONCURRENCY
                        并发请求数（默认：3）
  --crawl-restrict-path
                        将爬取限制在起始URL下的路径
  --crawl-no-include-pdfs
                        跳过PDF文件
  --crawl-no-ignore-epubs
                        包含EPUB文件

高级网络爬虫

全面的文档网站

python onefilellm.py https://docs.python.org/3/ \
  --crawl-max-depth 4 --crawl-max-pages 800 \
  --crawl-include-pattern ".*/(tutorial|library|reference)/" \
  --crawl-exclude-pattern ".*/(whatsnew|faq)/"

企业API文档

python onefilellm.py https://docs.aws.amazon.com/ec2/ \
  --crawl-max-depth 3 --crawl-max-pages 500 \
  --crawl-include-pattern ".*/(UserGuide|APIReference)/" \
  --crawl-respect-robots --crawl-delay 0.5

学术和研究网站

python onefilellm.py https://arxiv.org/list/cs.AI/recent \
  --crawl-max-depth 2 --crawl-max-pages 100 \
  --crawl-include-pattern ".*/(abs|pdf)/" \
  --crawl-include-pdfs --crawl-delay 1.0

与LLM工具集成

多阶段研究分析

python onefilellm.py ai-research protein-folding | \
  llm -m claude-3-haiku "提取关键方法和数据集" | \
  llm -m claude-3-sonnet "识别实验方法" | \
  llm -m gpt-4o "比较不同论文的方法论" | \
  llm -m claude-3-opus "生成新的研究方向"

竞争分析自动化

python onefilellm.py \
  https://github.com/competitor1/product \
  https://competitor1.com/docs/ \
  https://competitor2.com/api/ | \
  llm -m claude-3-haiku "提取功能和能力" | \
  llm -m gpt-4o "比较并识别差距" | \
  llm -m claude-3-opus "生成战略建议"

每日研究监控（cron任务）

0 9 * * * python onefilellm.py \
  https://arxiv.org/list/cs.AI/recent \
  https://arxiv.org/list/cs.LG/recent | \
  llm -m claude-3-haiku "提取重要论文" | \
  llm -m claude-3-sonnet "总结关键进展" | \
  mail -s "每日AI研究简报" researcher@company.com

输出格式

所有输出都封装在XML中，以便更好地进行LLM处理：

<onefilellm_output>
  <source type="[source_type]" [additional_attributes]>
    <[content_type]>
      [提取的内容]
    </[content_type]>
  </source>
</onefilellm_output>

支持的输入类型

本地：文件和目录
GitHub：仓库、议题、拉取请求
网页：带有高级爬取选项的页面
学术：ArXiv论文、DOI、PMID
多媒体：YouTube字幕
流：stdin、剪贴板

核心别名

ofl_repo - OneFileLLM GitHub仓库
ofl_readme - OneFileLLM README文件
gh_search - 带有占位符的GitHub搜索
arxiv_search - 带有占位符的ArXiv搜索

配置

别名存储：~/.onefilellm_aliases/aliases.json
环境变量：
GITHUB_TOKEN - GitHub API访问令牌
OFFLINE_MODE - 设置为1以跳过网络操作
可以在项目根目录使用.env文件

额外帮助

python onefilellm.py --help-topic basic      # 输入源和基本用法
python onefilellm.py --help-topic aliases    # 别名系统及实际示例
python onefilellm.py --help-topic crawling   # 网络爬虫模式和伦理
python onefilellm.py --help-topic pipelines  # 'llm'工具集成工作流
python onefilellm.py --help-topic examples   # 高级用法模式
python onefilellm.py --help-topic config     # 环境和配置

故障排除

YouTube字幕错误：获取YouTube字幕需要yt-dlp工具。如果您看到关于找不到yt-dlp或失败的错误，请使用以下命令安装：

bash pip install yt-dlp

项目地址：https://github.com/jimmc414/onefilellm

2 次点击 ∙ 0 人收藏

登录后收藏

0 条回复