llama.cpp

llama

宣言 / ggml / 运算

使用 C/C++ 进行 LLM 推理

近期 API 变更

热点话题

指南：使用 llama.cpp 的新 WebUI
指南：使用 llama.cpp 运行 gpt-oss
[反馈] 为 llama.cpp 提供更好的打包方案以支持下游用户 🤗
已添加对原生 MXFP4 格式 gpt-oss 模型的支持 | PR | 与 NVIDIA 的合作 | 评论
多模态支持已登陆 llama-server：#12898 | 文档
用于 FIM 补全的 VS Code 扩展：https://github.com/ggml-org/llama.vscode
用于 FIM 补全的 Vim/Neovim 插件：https://github.com/ggml-org/llama.vim
Hugging Face Inference Endpoints 现已开箱即用地支持 GGUF！https://github.com/ggml-org/llama.cpp/discussions/9669
Hugging Face GGUF 编辑器：讨论 | 工具

快速开始

开始使用 llama.cpp 非常简单。以下是几种在您的机器上安装的方法：

使用 brew, nix 或 winget 安装 llama.cpp
使用 Docker 运行 - 查看我们的 Docker 文档
从发布页面下载预编译的二进制文件
通过克隆此仓库从源代码构建 - 查看我们的构建指南

安装完成后，您需要一个模型来运行。请前往获取与量化模型部分了解更多信息。

示例命令：

# 使用本地模型文件
llama-cli -m my_model.gguf

# 或直接从 Hugging Face 下载并运行模型
llama-cli -hf ggml-org/gemma-3-1b-it-GGUF

# 启动 OpenAI 兼容的 API 服务器
llama-server -hf ggml-org/gemma-3-1b-it-GGUF

项目描述

llama.cpp 的主要目标是在广泛的硬件上（本地和云端）以最少的设置实现最先进的 LLM 推理性能。

纯 C/C++ 实现，无任何依赖
Apple Silicon 是一等公民 - 通过 ARM NEON、Accelerate 和 Metal 框架进行优化
对 x86 架构支持 AVX、AVX2、AVX512 和 AMX
对 RISC-V 架构支持 RVV、ZVFH、ZFH、ZICBOP 和 ZIHINTPAUSE
支持 1.5 位、2 位、3 位、4 位、5 位、6 位和 8 位整数量化，以加速推理并减少内存使用
自定义 CUDA 内核，用于在 NVIDIA GPU 上运行 LLM（通过 HIP 支持 AMD GPU，通过 MUSA 支持摩尔线程 GPU）
支持 Vulkan 和 SYCL 后端
CPU+GPU 混合推理，以部分加速超过总 VRAM 容量的模型

llama.cpp 项目是开发 ggml 库新功能的主要试验场。

支持的模型

通常，以下基础模型的微调版本也受支持。添加对新模型支持的说明：[HOWTO-add-model.md](docs/development/HOWTO-add-model.md) #### 纯文本模型 - [X] LLaMA 🦙 - [x] LLaMA 2 🦙🦙 - [x] LLaMA 3 🦙🦙🦙 - [X] [Mistral 7B](https://huggingface.co/mistralai/Mistral-7B-v0.1) - [x] [Mixtral MoE](https://huggingface.co/models?search=mistral-ai/Mixtral) - [x] [DBRX](https://huggingface.co/databricks/dbrx-instruct) - [x] [Jamba](https://huggingface.co/ai21labs) - [X] [Falcon](https://huggingface.co/models?search=tiiuae/falcon) - [X] [Chinese LLaMA / Alpaca](https://github.com/ymcui/Chinese-LLaMA-Alpaca) 和 [Chinese LLaMA-2 / Alpaca-2](https://github.com/ymcui/Chinese-LLaMA-Alpaca-2) - [X] [Vigogne (法语)](https://github.com/bofenghuang/vigogne) - [X] [BERT](https://github.com/ggml-org/llama.cpp/pull/5423) - [X] [Koala](https://bair.berkeley.edu/blog/2023/04/03/koala/) - [X] [Baichuan 1 & 2](https://huggingface.co/models?search=baichuan-inc/Baichuan) + [衍生版本](https://huggingface.co/hiyouga/baichuan-7b-sft) - [X] [Aquila 1 & 2](https://huggingface.co/models?search=BAAI/Aquila) - [X] [Starcoder 模型](https://github.com/ggml-org/llama.cpp/pull/3187) - [X] [Refact](https://huggingface.co/smallcloudai/Refact-1_6B-fim) - [X] [MPT](https://github.com/ggml-org/llama.cpp/pull/3417) - [X] [Bloom](https://github.com/ggml-org/llama.cpp/pull/3553) - [x] [Yi 模型](https://huggingface.co/models?search=01-ai/Yi) - [X] [StableLM 模型](https://huggingface.co/stabilityai) - [x] [Deepseek 模型](https://huggingface.co/models?search=deepseek-ai/deepseek) - [x] [Qwen 模型](https://huggingface.co/models?search=Qwen/Qwen) - [x] [PLaMo-13B](https://github.com/ggml-org/llama.cpp/pull/3557) - [x] [Phi 模型](https://huggingface.co/models?search=microsoft/phi) - [x] [PhiMoE](https://github.com/ggml-org/llama.cpp/pull/11003) - [x] [GPT-2](https://huggingface.co/gpt2) - [x] [Orion 14B](https://github.com/ggml-org/llama.cpp/pull/5118) - [x] [InternLM2](https://huggingface.co/models?search=internlm2) - [x] [CodeShell](https://github.com/WisdomShell/codeshell) - [x] [Gemma](https://ai.google.dev/gemma) - [x] [Mamba](https://github.com/state-spaces/mamba) - [x] [Grok-1](https://huggingface.co/keyfan/grok-1-hf) - [x] [Xverse](https://huggingface.co/models?search=xverse) - [x] [Command-R 模型](https://huggingface.co/models?search=CohereForAI/c4ai-command-r) - [x] [SEA-LION](https://huggingface.co/models?search=sea-lion) - [x] [GritLM-7B](https://huggingface.co/GritLM/GritLM-7B) + [GritLM-8x7B](https://huggingface.co/GritLM/GritLM-8x7B) - [x] [OLMo](https://allenai.org/olmo) - [x] [OLMo 2](https://allenai.org/olmo) - [x] [OLMoE](https://huggingface.co/allenai/OLMoE-1B-7B-0924) - [x] [Granite 模型](https://huggingface.co/collections/ibm-granite/granite-code-models-6624c5cec322e4c148c8b330) - [x] [GPT-NeoX](https://github.com/EleutherAI/gpt-neox) + [Pythia](https://github.com/EleutherAI/pythia) - [x] [Snowflake-Arctic MoE](https://huggingface.co/collections/Snowflake/arctic-66290090abe542894a5ac520) - [x] [Smaug](https://huggingface.co/models?search=Smaug) - [x] [Poro 34B](https://huggingface.co/LumiOpen/Poro-34B) - [x] [Bitnet b1.58 模型](https://huggingface.co/1bitLLM) - [x] [Flan T5](https://huggingface.co/models?search=flan-t5) - [x] [Open Elm 模型](https://huggingface.co/collections/apple/openelm-instruct-models-6619ad295d7ae9f868b759ca) - [x] [ChatGLM3-6b](https://huggingface.co/THUDM/chatglm3-6b) + [ChatGLM4-9b](https://huggingface.co/THUDM/glm-4-9b) + [GLMEdge-1.5b](https://huggingface.co/THUDM/glm-edge-1.5b-chat) + [GLMEdge-4b](https://huggingface.co/THUDM/glm-edge-4b-chat) - [x] [GLM-4-0414](https://huggingface.co/collections/THUDM/glm-4-0414-67f3cbcb34dd9d252707cb2e) - [x] [SmolLM](https://huggingface.co/collections/HuggingFaceTB/smollm-6695016cad7167254ce15966) - [x] [EXAONE-3.0-7.8B-Instruct](https://huggingface.co/LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct) - [x] [FalconMamba 模型](https://huggingface.co/collections/tiiuae/falconmamba-7b-66b9a580324dd1598b0f6d4a) - [x] [Jais](https://huggingface.co/inceptionai/jais-13b-chat) - [x] [Bielik-11B-v2.3](https://huggingface.co/collections/speakleash/bielik-11b-v23-66ee813238d9b526a072408a) - [x] [RWKV-7](https://huggingface.co/collections/shoumenchougou/rwkv7-gxx-gguf) - [x] [RWKV-6](https://github.com/BlinkDL/RWKV-LM) - [x] [QRWKV-6](https://huggingface.co/recursal/QRWKV6-32B-Instruct-Preview-v0.1) - [x] [GigaChat-20B-A3B](https://huggingface.co/ai-sage/GigaChat-20B-A3B-instruct) - [X] [Trillion-7B-preview](https://huggingface.co/trillionlabs/Trillion-7B-preview) - [x] [Ling 模型](https://huggingface.co/collections/inclusionAI/ling-67c51c85b34a7ea0aba94c32) - [x] [LFM2 模型](https://huggingface.co/collections/LiquidAI/lfm2-686d721927015b2ad73eaa38) - [x] [Hunyuan 模型](https://huggingface.co/collections/tencent/hunyuan-dense-model-6890632cda26b19119c9c5e7) - [x] [BailingMoeV2 (Ring/Ling 2.0) 模型](https://huggingface.co/collections/inclusionAI/ling-v2-68bf1dd2fc34c306c1fa6f86) #### 多模态模型 - [x] [LLaVA 1.5 模型](https://huggingface.co/collections/liuhaotian/llava-15-653aac15d994e992e2677a7e), [LLaVA 1.6 模型](https://huggingface.co/collections/liuhaotian/llava-16-65b9e40155f60fd046a5ccf2) - [x] [BakLLaVA](https://huggingface.co/models?search=SkunkworksAI/Bakllava) - [x] [Obsidian](https://huggingface.co/NousResearch/Obsidian-3B-V0.5) - [x] [ShareGPT4V](https://huggingface.co/models?search=Lin-Chen/ShareGPT4V) - [x] [MobileVLM 1.7B/3B 模型](https://huggingface.co/models?search=mobileVLM) - [x] [Yi-VL](https://huggingface.co/models?search=Yi-VL) - [x] [Mini CPM](https://huggingface.co/models?search=MiniCPM) - [x] [Moondream](https://huggingface.co/vikhyatk/moondream2) - [x] [Bunny](https://github.com/BAAI-DCAI/Bunny) - [x] [GLM-EDGE](https://huggingface.co/models?search=glm-edge) - [x] [Qwen2-VL](https://huggingface.co/collections/Qwen/qwen2-vl-66cee7455501d7126940800d) - [x] [LFM2-VL](https://huggingface.co/collections/LiquidAI/lfm2-vl-68963bbc84a610f7638d5ffa)

语言绑定

- Python: [ddh0/easy-llama](https://github.com/ddh0/easy-llama) - Python: [abetlen/llama-cpp-python](https://github.com/abetlen/llama-cpp-python) - Go: [go-skynet/go-llama.cpp](https://github.com/go-skynet/go-llama.cpp) - Node.js: [withcatai/node-llama-cpp](https://github.com/withcatai/node-llama-cpp) - JS/TS (llama.cpp 服务器客户端): [lgrammel/modelfusion](https://modelfusion.dev/integration/model-provider/llamacpp) - JS/TS (可编程提示引擎 CLI): [offline-ai/cli](https://github.com/offline-ai/cli) - JavaScript/Wasm (可在浏览器中运行): [tangledgroup/llama-cpp-wasm](https://github.com/tangledgroup/llama-cpp-wasm) - Typescript/Wasm (更好的 API，可通过 npm 获取): [ngxson/wllama](https://github.com/ng

项目地址：https://github.com/ggerganov/llama.cpp

29 次点击 ∙ 0 人收藏

登录后收藏

0 条回复

llama.cpp — LLM 本地推理引擎

llama.cpp

近期 API 变更

热点话题

快速开始

项目描述