AirLLM — 面向低显存环境的大模型推理优化方案

OA0

OA0 是一个探索 AI 的社区

现在注册

已注册用户请登录

airllm_logo

快速开始 |
配置项 |
MacOS |
示例笔记本 |
常见问题

AirLLM 优化了推理内存使用，允许 70B 大语言模型在单张 4GB GPU 卡上运行推理，而无需量化、蒸馏和剪枝。现在你可以在 8GB 显存 上运行 405B 的 Llama3.1。

AI 智能体推荐：

更新日志

[2024/08/20] v2.11.0: 支持 Qwen2.5

[2024/08/18] v2.10.1 支持 CPU 推理。支持非分片模型。感谢 @NavodPeiris 的杰出工作！

[2024/07/30] 支持 Llama3.1 405B (示例笔记本)。支持 8bit/4bit 量化。

[2024/04/20] AirLLM 已原生支持 Llama3。在 4GB 单 GPU 上运行 Llama3 70B。

[2023/12/25] v2.8.2: 支持 MacOS 运行 70B 大语言模型。

[2023/12/20] v2.7: 支持 AirLLMMixtral。

[2023/12/20] v2.6: 添加 AutoModel，自动检测模型类型，初始化模型时无需提供模型类。

[2023/12/18] v2.5: 添加预取功能以重叠模型加载和计算。速度提升 10%。

[2023/12/03] 添加对 ChatGLM、QWen、Baichuan、Mistral、InternLM 的支持！

[2023/12/02] 添加对 safetensors 的支持。现在支持开放 LLM 排行榜中的所有前 10 名模型。

[2023/12/01] airllm 2.0。支持压缩：推理速度提升 3 倍！

[2023/11/20] airllm 初始版本发布！

Star 历史

快速开始

1. 安装包

首先，安装 airllm pip 包。

pip install airllm

2. 推理

然后，初始化 AutoModel，传入所用模型的 Hugging Face 仓库 ID 或本地路径，即可像常规 Transformer 模型一样进行推理。

（初始化 AutoModel 时，也可以通过 layer_shards_saving_path 指定保存分层拆分模型的路径。）

from airllm import AutoModel

MAX_LENGTH = 128
# 可以使用 Hugging Face 模型仓库 ID：
model = AutoModel.from_pretrained("garage-bAInd/Platypus2-70B-instruct")

# 或者使用模型的本地路径...
#model = AutoModel.from_pretrained("/home/ubuntu/.cache/huggingface/hub/models--garage-bAInd--Platypus2-70B-instruct/snapshots/b585e74bcaae02e52665d9ac6d23f4d0dbc81a0f")

input_text = [
        'What is the capital of United States?',
        #'I like',
    ]

input_tokens = model.tokenizer(input_text,
    return_tensors="pt", 
    return_attention_mask=False, 
    truncation=True, 
    max_length=MAX_LENGTH, 
    padding=False)

generation_output = model.generate(
    input_tokens['input_ids'].cuda(), 
    max_new_tokens=20,
    use_cache=True,
    return_dict_in_generate=True)

output = model.tokenizer.decode(generation_output.sequences[0])

print(output)

注意：在推理过程中，原始模型会首先被分层拆分并保存。请确保 Hugging Face 缓存目录有足够的磁盘空间。

模型压缩 - 3 倍推理加速！

我们刚刚添加了基于分块量化的模型压缩功能。这可以进一步将推理速度提升最多 3 倍，且精度损失几乎可以忽略！（更多性能评估以及我们为何使用分块量化，请参阅此论文）

speed_improvement

如何启用模型压缩加速：

步骤 1. 确保已安装 bitsandbytes：pip install -U bitsandbytes
步骤 2. 确保 airllm 版本高于 2.0.0：pip install -U airllm
步骤 3. 初始化模型时，传入 compression 参数（'4bit' 或 '8bit'）：

model = AutoModel.from_pretrained("garage-bAInd/Platypus2-70B-instruct",
                     compression='4bit' # 指定 '8bit' 表示 8 位分块量化
                    )

模型压缩与量化有何不同？

量化通常需要对权重和激活值都进行量化才能真正加速。这使得保持精度和避免各种输入中异常值的影响变得更加困难。

而在我们的场景中，瓶颈主要在于磁盘加载，我们只需要减小模型加载的大小。因此，我们只需量化权重部分，这更容易确保精度。

配置项

初始化模型时，我们支持以下配置：

compression：支持的选项：4bit、8bit 表示 4 位或 8 位分块量化，或默认 None 表示不压缩。
profiling_mode：支持的选项：True 以输出时间消耗，或默认 False。
layer_shards_saving_path：可选，用于保存拆分后模型的另一个路径。
hf_token：如果需要下载如 meta-llama/Llama-2-7b-hf 这类需要权限的模型，可以在此提供 Hugging Face token。
prefetching：预取以重叠模型加载和计算。默认开启。目前仅 AirLLMLlama2 支持此功能。
delete_original：如果磁盘空间不足，可以设置 delete_original 为 true 来删除原始下载的 Hugging Face 模型，只保留转换后的模型以节省一半磁盘空间。

MacOS

只需安装 airllm 并运行与 Linux 上相同的代码。详见快速开始。

确保已安装 mlx 和 torch。
你可能需要安装 python native，详见此处。
仅支持 Apple silicon。

示例 Python 笔记本

示例 Colab 链接：

其他模型示例（ChatGLM, QWen, Baichuan, Mistral 等）：

* ChatGLM:

from airllm import AutoModel
MAX_LENGTH = 128
model = AutoModel.from_pretrained("THUDM/chatglm3-6b-base")
input_text = ['What is the capital of China?',]
input_tokens = model.tokenizer(input_text,
    return_tensors="pt", 
    return_attention_mask=False, 
    truncation=True, 
    max_length=MAX_LENGTH, 
    padding=True)
generation_output = model.generate(
    input_tokens['input_ids'].cuda(), 
    max_new_tokens=5,
    use_cache= True,
    return_dict_in_generate=True)
model.tokenizer.decode(generation_output.sequences[0])

* QWen:

from airllm import AutoModel
MAX_LENGTH = 128
model = AutoModel.from_pretrained("Qwen/Qwen-7B")
input_text = ['What is the capital of China?',]
input_tokens = model.tokenizer(input_text,
    return_tensors="pt", 
    return_attention_mask=False, 
    truncation=True, 
    max_length=MAX_LENGTH)
generation_output = model.generate(
    input_tokens['input_ids'].cuda(), 
    max_new_tokens=5,
    use_cache=True,
    return_dict_in_generate=True)
model.tokenizer.decode(generation_output.sequences[0])

* Baichuan, InternLM, Mistral 等:

from airllm import AutoModel
MAX_LENGTH = 128
model = AutoModel.from_pretrained("baichuan-inc/Baichuan2-7B-Base")
#model = AutoModel.from_pretrained("internlm/internlm-20b")
#model = AutoModel.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")
input_text = ['What is the capital of China?',]
input_tokens = model.tokenizer(input_text,
    return_tensors="pt", 
    return_attention_mask=False, 
    truncation=True, 
    max_length=MAX_LENGTH)
generation_output = model.generate(
    input_tokens['input_ids'].cuda(), 
    max_new_tokens=5,
    use_cache=True,
    return_dict_in_generate=True)
model.tokenizer.decode(generation_output.sequences[0])

请求支持其他模型：点击此处

支持的模型

AirLLM 支持所有主流模型，包括但不限于：
Llama、Llama2、Llama3、Llama3.1、ChatGLM、QWen、Qwen2.5、Baichuan、Mistral、InternLM、Mixtral、Platypus2、CodeLlama、Vicuna、LongChat、StarCoder、Orca、Vigogne、WizardCoder、WizardLM 等。

致谢

许多代码基于 SimJeg 在 Kaggle 考试竞赛中的杰出工作。向 SimJeg 致以崇高的敬意：

GitHub 账号 @SimJeg,
Kaggle 上的代码,
相关讨论。

常见问题

1. MetadataIncompleteBuffer

safetensors_rust.SafetensorError: Error while deserializing header: MetadataIncompleteBuffer

如果遇到此错误，最可能的原因是磁盘空间不足。拆分模型的过程非常消耗磁盘空间。参见此讨论。你可能需要扩展磁盘空间，清理 Hugging Face .cache 并重新运行。

2. ValueError: max() arg is an empty sequence

很可能你正在使用 Llama2 类加载 QWen 或 ChatGLM 模型。尝试以下方法：

对于 QWen 模型：

from airllm import AutoModel #<----- 而不是 AirLLMLlama2
AutoModel.from_pretrained(...)

对于 ChatGLM 模型：

from airllm import AutoModel #<----- 而不是 AirLLMLlama2
AutoModel.from_pretrained(...)

3. 401 Client Error....Repo model ... is gated.

有些模型是需要权限的模型，需要 Hugging Face API token。你可以提供 hf_token：

model = AutoModel.from_pretrained("meta-llama/Llama-2-7b-hf", hf_token='HF_API_TOKEN')

4. ValueError: Asking to pad but the tokenizer does not have a padding token.

某些模型的 tokenizer 没有填充标记，因此你可以设置一个填充标记，或者直接关闭填充配置：

input_tokens = model.tokenizer(input_text,
    return_tensors="pt", 
    return_attention_mask=False, 
    truncation=True, 
    max_length=MAX_LENGTH, 
    padding=False  #<-----------   关闭填充
)

引用 AirLLM

如果你在研究中发现 AirLLM 有用并希望引用它，请使用以下 BibTex 条目：

@software{airllm2023,
  author = {Gavin Li},
  title = {AirLLM: scaling large language models on low-end commodity computers},
  url = {https://github.com/lyogavin/airllm/},
  version = {0.0},
  year = {2023},
}

贡献

欢迎贡献、想法和讨论！

如果你觉得有用，请 ⭐ 或请我喝杯咖啡！ 🙏

项目地址：https://github.com/lyogavin/airllm

48 次点击 ∙ 0 人收藏

登录后收藏

0 条回复