LMCache 核心库

安装

前提条件： Python >= 3.10

pip install -e .

演示

欢迎亲自尝试我们的 Docker 演示！所有演示内容均可在此仓库获取。

快速开始：

前提条件：要运行快速开始演示，您的服务器应具备 1 块 GPU 并已安装 Docker 环境。

步骤 1： 拉取 Docker 镜像

docker pull apostacyh/vllm:lmcache-0.1.0

步骤 2： 启动 vLLM + LMCache

model=mistralai/Mistral-7B-Instruct-v0.2    # 替换为您的模型名称
sudo docker run --runtime nvidia --gpus '"device=0"' \
    -v <本机 Huggingface 缓存目录>:/root/.cache/huggingface \
    -p 8000:8000 \
    --env "HF_TOKEN=<您的 Huggingface 访问令牌>" \
    --ipc=host \
    --network=host \
    apostacyh/vllm:lmcache-0.1.0 \
    --model $model --gpu-memory-utilization 0.6 --port 8000 \
    --lmcache-config-file /lmcache/LMCache/examples/example-local.yaml

请将上述命令中的 <本机 Huggingface 缓存目录> 和 <您的 Huggingface 访问令牌> 替换为实际值。

您也可以修改 model 变量来使用不同的模型。

当您看到如下日志时，表示 vLLM 引擎已准备就绪：

INFO:     Started server process [865615]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

步骤 3： 运行演示应用

现在，您可以在 LMCache 仓库中运行演示应用。请在服务器上执行以下命令：

git clone https://github.com/LMCache/LMCache
cd LMCache/examples/

# 安装 OpenAI 客户端库
pip install openai

# 启动演示聊天应用
python openai_chat_completion_client.py 8000

该演示是一个基于长上下文（examples/f.txt）的问答应用。从第二轮问答开始，TTFT（首词生成时间）将显著缩短。

用例 1：通过 LMCache 在不同 vLLM 实例之间共享前缀 KV 缓存

以下说明将帮助您使用 Docker 容器部署 LMCache 后端和多个 vLLM 实例。演示应用的架构如下图所示：

前提条件：要运行快速开始演示，您的服务器必须配备 2 块 GPU 并已安装 Docker 环境。

步骤 1： 拉取 Docker 镜像

docker pull apostacyh/lmcache-server:0.1.0
docker pull apostacyh/vllm:lmcache-0.1.0

步骤 2： 启动 LMCache 后端服务器

docker run --name lmcache-server --network host -d apostacyh/lmcache-server:0.1.0 0.0.0.0 65432

步骤 3： 启动 2 个 vLLM 实例

# 第一个 vLLM 实例监听 8000 端口
model=mistralai/Mistral-7B-Instruct-v0.2    # 替换为您的模型名称
sudo docker run --runtime nvidia --gpus '"device=0"' \
    -v <本机 Huggingface 缓存目录>:/root/.cache/huggingface \
    -p 8000:8000 \
    --env "HF_TOKEN=<您的 Huggingface 令牌>" \
    --ipc=host \
    --network=host \
    apostacyh/vllm:lmcache-0.1.0 \
    --model $model --gpu-memory-utilization 0.7 --port 8000 \
    --lmcache-config-file /lmcache/LMCache/examples/example.yaml

现在，打开另一个终端并启动另一个 vLLM 实例：

# 第二个 vLLM 实例监听 8001 端口
model=mistralai/Mistral-7B-Instruct-v0.2    # 替换为您的模型名称
sudo docker run --runtime nvidia --gpus '"device=1"' \
    -v <本机 Huggingface 缓存目录>:/root/.cache/huggingface \
    -p 8001:8001 \
    --env "HF_TOKEN=<您的 Huggingface 令牌>" \
    --ipc=host \
    --network=host \
    apostacyh/vllm:lmcache-0.1.0 \
    --model $model --gpu-memory-utilization 0.7 --port 8001 \
    --lmcache-config-file /lmcache/LMCache/examples/example.yaml

请记得替换命令中的 <本机 Huggingface 缓存目录> 和 <您的 Huggingface 令牌>。

当您看到如下日志时，表示 vLLM 引擎已准备就绪：

INFO:     Started server process [865615]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

步骤 4： 运行演示应用
您可以在 LMCache 仓库中运行演示应用。请在服务器上执行以下命令：

git clone https://github.com/LMCache/LMCache
cd LMCache/examples/

# 安装 OpenAI 客户端库
pip install openai

在一个终端中：

# 连接到第一个 vLLM 引擎
python openai_chat_completion_client.py 8000

在另一个终端中：

# 连接到第二个 vLLM 引擎
python openai_chat_completion_client.py 8001

您应该能够看到第二个 vLLM 引擎的响应延迟显著降低。
这是因为通过 LMCache，长上下文的 KV 缓存可以在两个 vLLM 引擎之间共享。

项目地址：https://github.com/LMCache/LMCache

51 次点击 ∙ 0 人收藏

登录后收藏

0 条回复

Lmcache — 为大模型推理提供 KV Cache 复用能力

LMCache 核心库

安装

演示

快速开始：

用例 1：通过 LMCache 在不同 vLLM 实例之间共享前缀 KV 缓存