OA0

OA0 是一个探索 AI 的社区

现在注册

已注册用户请登录

OA0 › 代码 › Infinity — 面向嵌入与重排序模型的高性能推理服务

Infinity — 面向嵌入与重排序模型的高性能推理服务

hash · 2026-05-13 11:00:23 · 69 次点击 · 0 条评论

Infinity ♾️

Docker pulls

Infinity 是一个高吞吐、低延迟的 REST API，用于服务文本嵌入（text-embeddings）、重排序（reranking）模型、CLIP、CLAP 和 ColPali。Infinity 基于 MIT 许可证开发。

为什么选择 Infinity

部署 HuggingFace 上的任何模型：可部署来自 HuggingFace 的任何嵌入、重排序、CLIP 和 sentence-transformer 模型。
快速推理后端：推理服务器基于 PyTorch、optimum (ONNX/TensorRT) 和 CTranslate2 构建，利用 FlashAttention 最大限度发挥 NVIDIA CUDA、AMD ROCM、CPU、AWS INF2 或 APPLE MPS 加速器的性能。Infinity 使用专用工作线程进行动态批处理和分词。
多模态与多模型：支持混合搭配多个模型，Infinity 负责协调管理。
经过测试的实现：经过单元测试和端到端测试。通过 Infinity 生成的嵌入结果正确可靠，让 API 用户可以源源不断地创建嵌入。
易于使用：基于 FastAPI 构建。Infinity CLI v2 允许通过环境变量或命令行参数启动所有配置。OpenAPI 遵循 OpenAI 的 API 规范。请查看文档 https://michaelfeil.github.io/infinity 了解如何开始。

快速开始

通过 pip 安装并启动 CLI

pip install infinity-emb[all]

安装后，在激活的虚拟环境中，可以直接运行 CLI：

infinity_emb v2 --model-id BAAI/bge-small-en-v1.5

使用 v2 --help 命令查看所有参数说明：

infinity_emb v2 --help

使用预构建的 Docker 容器启动 CLI（推荐）

除了通过 pip 安装 CLI，您也可以使用 Docker 运行 michaelf34/infinity。请确保挂载了您的加速器（例如，安装 nvidia-docker 并使用 --gpus all 激活）。

port=7997
model1=michaelfeil/bge-small-en-v1.5
model2=mixedbread-ai/mxbai-rerank-xsmall-v1
volume=$PWD/data

docker run -it --gpus all \
 -v $volume:/app/.cache \
 -p $port:$port \
 michaelf34/infinity:latest \
 v2 \
 --model-id $model1 \
 --model-id $model2 \
 --port $port

Docker 容器内的缓存路径由环境变量 HF_HOME 设置。

专用 Docker 镜像

CPU 版 Docker 容器

使用 `latest-cpu` 镜像或 `x.x.x-cpu` 以获得更精简的镜像。像其他仅 CPU 的 Docker 镜像一样运行。Optimum/Onnx 通常是首选的引擎。 ``` docker run -it \ -v $volume:/app/.cache \ -p $port:$port \ michaelf34/infinity:latest-cpu \ v2 \ --engine optimum \ --model-id $model1 \ --model-id $model2 \ --port $port ```

ROCm 版 Docker 容器 (MI200 系列和 MI300 系列)

使用 `latest-rocm` 镜像或 `x.x.x-rocm` 进行 ROCm 兼容的推理。**此镜像目前未通过 CI/CD 构建（过大），请考虑固定到特定版本。** 确保正确安装 ROCm 并准备好与 Docker 一起使用。更多信息请查看 [文档](https://michaelfeil.github.io/infinity)。

Onnx-GPU、Cuda 扩展、TensorRT 版 Docker 容器

使用 `latest-trt-onnx` 镜像或 `x.x.x-trt-onnx` 进行 NVIDIA 兼容的推理。**此镜像目前未通过 CI/CD 构建（过大），请考虑固定到特定版本。** 此镜像支持： - ONNX-Cuda "CudaExecutionProvider" - ONNX-TensorRT "TensorRTExecutionProvider"（可能因 ORT 版本不匹配而无法始终工作） - CudaExtensions 和包，例如使用 Pytorch 时的 Tri-Dao 的 `pip install flash-attn` 包。 - nvcc 编译器支持 ``` docker run -it \ -v $volume:/app/.cache \ -p $port:$port \ michaelf34/infinity:latest-trt-onnx \ v2 \ --engine optimum \ --device cuda \ --model-id $model1 \ --port $port ```

在 Docker 容器中使用本地模型

要使用 Docker 容器部署本地模型，需要将模型挂载到容器内，并在启动命令中指定容器内的路径。

示例：

git lfs install
cd /tmp
mkdir models && cd models && git clone https://huggingface.co/BAAI/bge-small-en-v1.5
docker run -it   -v /tmp/models:/models  -p 8081:8081  michaelf34/infinity:latest v2  --model-id "/models/bge-small-en-v1.5" --port 8081

高级 CLI 用法

同时启动多个模型

自 `infinity_emb>=0.0.34` 起，可以使用 CLI `v2` 方法同时启动多个模型。请查看 `infinity_emb v2 --help` 了解所有参数和验证。多模型 CLI 使用指南： - 1. CLI 选项可以重复，例如 `v2 --model-id model/id1 --model-id model/id2 --batch-size 8 --batch-size 4`。这将创建两个模型 `model/id1` 和 `model/id2`。 - 2. 或者通过设置以 `;` 分隔的环境变量来调整默认值：`INFINITY_MODEL_ID="model/id1;model/id2;" && INFINITY_BATCH_SIZE="8;4;"` - 3. 单个参数会广播到所有 `--model-id`，例如 `v2 --model-id model/id1 --model-id/id2 --batch-size 8` 会使两个模型的批处理大小均为 8。 - 4. 所有参数都会广播到 `--model-id` 的数量，并且 API 请求会路由到 `--served-model-name/--model-id`。

使用环境变量代替 CLI 参数

所有 CLI 参数也可以通过环境变量启动。环境变量以 `INFINITY_{大写蛇形命名}` 开头，通常与 `--{小写短横线命名}` 的 CLI 参数对应。以下两种方式等效： - CLI: `infinity_emb v2 --model-id BAAI/bge-base-en-v1.5` - ENV-CLI: `export INFINITY_MODEL_ID="BAAI/bge-base-en-v1.5" && infinity_emb v2` 多个参数可以使用 `;` 语法：`INFINITY_MODEL_ID="model/id1;model/id2;"`

API Key

通过 CLI 提供 `--api-key secret123`，或通过环境变量设置 `INFINITY_API_KEY="secret123"`。

选择最快的引擎

使用 `--engine torch` 命令时，模型必须兼容 https://github.com/UKPLab/sentence-transformers/ 和 AutoModel。使用 `--engine optimum` 命令时，需要存在 onnx 文件。推荐使用来自 https://huggingface.co/Xenova 的模型。使用 `--engine ctranslate2` 命令时，只支持 `BERT` 模型。

遥测数据收集选择退出

查看收集的遥测数据：https://michaelfeil.eu/infinity/main/telemetry/ ``` # 禁用 export INFINITY_ANONYMOUS_USAGE_STATS="0" ```

Infinity 支持的任务与模型

Infinity 旨在成为支持嵌入、重排序及 RAG 相关任务最全面的推理服务器。以下 Infinity 在 Github CI 中测试了 15 种以上的架构以及下文中的所有情况。
点击下方各节查看任务和已验证的示例模型。

文本嵌入

文本嵌入用于衡量文本字符串之间的相关性。嵌入常用于搜索、聚类、推荐等场景。可以将其视为 OpenAI 文本嵌入的私有化部署版本（https://platform.openai.com/docs/guides/embeddings）。已验证的嵌入模型： - [mixedbread-ai/mxbai-embed-large-v1](https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1) - [WhereIsAI/UAE-Large-V1](https://huggingface.co/WhereIsAI/UAE-Large-V1) - [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) - [Alibaba-NLP/gte-large-en-v1.5](https://huggingface.co/Alibaba-NLP/gte-large-en-v1.5) - [jinaai/jina-embeddings-v2-base-code](https://huggingface.co/jinaai/jina-embeddings-v2-base-code) - [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) - [intfloat/multilingual-e5-large-instruct](https://huggingface.co/intfloat/multilingual-e5-large-instruct) - [intfloat/multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small) - [jinaai/jina-embeddings-v3](nomic-ai/nomic-embed-text-v1.5) - [BAAI/bge-m3, 无稀疏](https://huggingface.co/BAAI/bge-m3) - 基于解码器的模型。请注意，它们比 bert-small 模型大约大 20-100 倍（也更慢）： - [Alibaba-NLP/gte-Qwen2-1.5B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-1.5B-instruct/discussions/20) - [Salesforce/SFR-Embedding-2_R](https://huggingface.co/Salesforce/SFR-Embedding-2_R/discussions/6) - [Alibaba-NLP/gte-Qwen2-7B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-7B-instruct/discussions/39) 其他模型： - 大多数嵌入模型很可能都支持：https://huggingface.co/models?pipeline_tag=feature-extraction&other=text-embeddings-inference&sort=trending - 查看 MTEB 排行榜寻找模型：https://huggingface.co/spaces/mteb/leaderboard。

重排序

给定一个查询和一组文档，重排序将文档按与查询语义相关性从高到低进行排序。类似于本地部署的 https://docs.cohere.com/reference/rerank。已验证的重排序模型： - [mixedbread-ai/mxbai-rerank-xsmall-v1](https://huggingface.co/mixedbread-ai/mxbai-rerank-xsmall-v1) - [Alibaba-NLP/gte-multilingual-reranker-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-reranker-base) - [BAAI/bge-reranker-base](https://huggingface.co/BAAI/bge-reranker-base) - [BAAI/bge-reranker-large](https://huggingface.co/BAAI/bge-reranker-large) - [BAAI/bge-reranker-v2-m3](https://huggingface.co/BAAI/bge-reranker-v2-m3) - [jinaai/jina-reranker-v1-turbo-en](https://huggingface.co/jinaai/jina-reranker-v1-turbo-en) 其他重排序模型： - Infinity 支持的重排序模型是 BERT 风格的分类模型（单类别）。 - 大多数重排序模型很可能都支持：https://huggingface.co/models?pipeline_tag=text-classification&other=text-embeddings-inference&sort=trending - https://huggingface.co/models?pipeline_tag=text-classification&sort=trending&search=rerank

多模态与跨模态 - 图像与音频嵌入

专用嵌入模型，允许进行图像<->文本或图像<->音频的搜索。通常，这些模型支持文本<->文本、文本<->其他以及其他<->其他搜索，但在跨模态时会有精度权衡。图像<->文本模型可用于例如照片库搜索，用户可以通过关键词输入找到照片，或使用照片找到相关图像。音频<->文本模型不太流行，可以用于例如根据文本描述或相关音乐来搜索音乐曲目。已测试的图像<->文本模型： - [wkcn/TinyCLIP-ViT-8M-16-Text-3M-YFCC15M](https://huggingface.co/wkcn/TinyCLIP-ViT-8M-16-Text-3M-YFCC15M) - [jinaai/jina-clip-v1](https://huggingface.co/jinaai/jina-clip-v1) - [google/siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) - 类型为 `config.json` 中的 `ClipModel` / `SiglipModel` 的模型已测试的音频<->文本模型： - [来自 LAION 的 CLAP 模型](https://huggingface.co/collections/laion/clap-contrastive-language-audio-pretraining-65415c0b18373b607262a490) - 训练这些模型的开源组织数量有限 - * 注意：音频数据的采样率需要与模型匹配 * 不支持： - 纯视觉模型，例如 nomic-ai/nomic-embed-vision-v1.5

ColBert 风格的后交互嵌入

ColBert Embeddings 不执行任何特殊的池化方法，而是返回原始的**令牌嵌入**。然后，这些**令牌嵌入**需要在 VectorDB (Qdrant / Vespa) 中使用 MaxSim 度量进行评分。通过 RestAPI 使用时，后交互嵌入最好通过 `base64` 编码传输。示例笔记本：https://colab.research.google.com/drive/14FqLc0N_z92_VgL_zygWV5pJZkaskyk7?usp=sharing 已测试的 ColBERT 模型： - [colbert-ir/colbertv2.0](https://huggingface.co/colbert-ir/colbertv2.0) - [jinaai/jina-colbert-v2](https://huggingface.co/jinaai/jina-colbert-v2) - [mixedbread-ai/mxbai-colbert-large-v1](https://huggingface.co/mixedbread-ai/mxbai-colbert-large-v1) - [answerai-colbert-small-v1 - 点击链接查看说明](https://huggingface.co/answerdotai/answerai-colbert-small-v1/discussions/14)

ColPali 风格的后交互图像<->文本嵌入

用法与 ColBert 类似，但扫描的是图像<->文本而非纯文本。通过 RestAPI 使用时，后交互嵌入最好通过 `base64` 编码传输。示例笔记本：https://colab.research.google.com/drive/14FqLc0N_z92_VgL_zygWV5pJZkaskyk7?usp=sharing 已测试的 ColPali/ColQwen 模型： - [vidore/colpali-v1.2-merged](https://huggingface.co/michaelfeil/colpali-v1.2-merged) - [michaelfeil/colqwen2-v0.1](https://huggingface.co/michaelfeil/colqwen2-v0.1) - 不支持 LoRA 适配器，仅支持 "merged" 模型。

文本分类

一种 BERT 风格的多标签文本分类器，可将文本分类到不同的类别中。已测试的模型： - [ProsusAI/finbert](https://huggingface.co/ProsusAI/finbert)，金融新闻分类 - [SamLowe/roberta-base-go_emotions](https://huggingface.co/SamLowe/roberta-base-go_emotions)，文本到情感类别 - `config.json` 中具有超过 1 个标签的 BERT 风格文本分类模型

通过 Python API 使用 Infinity

除了 CLI 和 RestAPI，也可以通过 Python API 使用 Infinity 的接口，这提供了最大的灵活性。Python API 基于 asyncio 及其 await/async 特性，支持请求的并发处理。CLI 的参数同样适用于 Python。

嵌入

import asyncio
from infinity_emb import AsyncEngineArray, EngineArgs, AsyncEmbeddingEngine

sentences = ["将这句话通过 Infinity 嵌入。", "巴黎在法国。"]
array = AsyncEngineArray.from_args([
  EngineArgs(model_name_or_path = "BAAI/bge-small-en-v1.5", engine="torch", embedding_dtype="float32", dtype="auto")
])

async def embed_text(engine: AsyncEmbeddingEngine): 
    async with engine: 
        embeddings, usage = await engine.embed(sentences=sentences)
    # 或者自行处理异步启动/停止
    await engine.astart()
    embeddings, usage = await engine.embed(sentences=sentences)
    await engine.astop()
asyncio.run(embed_text(array[0]))

重排序

重排序可以给出一个查询与多个文档之间的相似度分数。可与 VectorDB+Embeddings 结合使用，也可作为独立功能处理少量文档。请从 HuggingFace 选择与 AutoModelForSequenceClassification 兼容的单类分类模型。

import asyncio
from infinity_emb import AsyncEngineArray, EngineArgs, AsyncEmbeddingEngine
query = "什么是 infinity_emb 的 Python 包？"
docs = ["这是一个与 infinity_emb Python 包无关的文档，因此...", 
    "巴黎在法国！",
    "infinity_emb 是一个用于使用 Transformer 模型进行句子嵌入和重排序的 Python 包！"]
array = AsyncEmbeddingEngine.from_args(
  [EngineArgs(model_name_or_path = "mixedbread-ai/mxbai-rerank-xsmall-v1", engine="torch")]
)

async def rerank(engine: AsyncEmbeddingEngine): 
    async with engine:
        ranking, usage = await engine.rerank(query=query, docs=docs)
        print(list(zip(ranking, docs)))
    # 或者自行处理异步启动/停止
    await engine.astart()
    ranking, usage = await engine.rerank(query=query, docs=docs)
    await engine.astop()

asyncio.run(rerank(array[0]))

使用 CLI 启动重排序器：

infinity_emb v2 --model-id mixedbread-ai/mxbai-rerank-xsmall-v1

图像嵌入：CLIP 模型

CLIP 模型能够同时编码图像和文本。

import asyncio
from infinity_emb import AsyncEngineArray, EngineArgs, AsyncEmbeddingEngine

sentences = ["这太棒了。", "我感到无聊。"]
images = ["http://images.cocodataset.org/val2017/000000039769.jpg"]
engine_args = EngineArgs(
    model_name_or_path = "wkcn/TinyCLIP-ViT-8M-16-Text-3M-YFCC15M", 
    engine="torch"
)
array = AsyncEngineArray.from_args([engine_args])

async def embed(engine: AsyncEmbeddingEngine): 
    await engine.astart()
    embeddings, usage = await engine.embed(sentences=sentences)
    embeddings_image, _ = await engine.image_embed(images=images)
    await engine.astop()

asyncio.run(embed(array["wkcn/TinyCLIP-ViT-8M-16-Text-3M-YFCC15M"]))

音频嵌入：CLAP 模型

CLAP 模型能够同时编码音频和文本。

import asyncio
from infinity_emb import AsyncEngineArray, EngineArgs, AsyncEmbeddingEngine
import requests
import soundfile as sf
import io

sentences = ["这太棒了。", "我感到无聊。"]

url = "https://bigsoundbank.com/UPLOAD/wav/2380.wav"
raw_bytes = requests.get(url, stream=True).content

audios = [raw_bytes]
engine_args = EngineArgs(
    model_name_or_path = "laion/clap-htsat-unfused",
    dtype="float32", 
    engine="torch"

)
array = AsyncEngineArray.from_args([engine_args])

async def embed(engine: AsyncEmbeddingEngine): 
    await engine.astart()
    embeddings, usage = await engine.embed(sentences=sentences)
    embedding_audios = await engine.audio_embed(audios=audios)
    await engine.astop()

asyncio.run(embed(array["laion/clap-htsat-unfused"]))

文本分类

使用 Infinity 的 classify 功能进行文本分类，支持情感分析、情绪检测等分类任务。

import asyncio
from infinity_emb import AsyncEngineArray, EngineArgs, AsyncEmbeddingEngine

sentences = ["这太棒了。", "我感到无聊。"]
engine_args = EngineArgs(
    model_name_or_path = "SamLowe/roberta-base-go_emotions", 
    engine="torch", model_warmup=True)
array = AsyncEngineArray.from_args([engine_args])

async def classifier(engine: AsyncEmbeddingEngine): 
    async with engine:
        predictions, usage = await engine.classify(sentences=sentences)
    # 或者自行处理异步启动/停止
    await engine.astart()
    predictions, usage = await engine.classify(sentences=sentences)
    await engine.astop()
asyncio.run(classifier(array["SamLowe/roberta-base-go_emotions"]))

通过 Python 客户端使用 Infinity

Infinity 提供了一个用于 RestAPI 客户端使用的生成客户端代码。

如果您想通过 RestAPI 调用远程的 Infinity 实例，请在本地安装以下包：

pip install infinity_client

更多信息请查看客户端 README：
https://github.com/michaelfeil/infinity/tree/main/libs/client_infinity/infinity_client

集成：

文档

查看文档 https://michaelfeil.github.io/infinity 了解如何开始。启动后，Swagger UI 将在 {url}:{port}/docs 可用，本例中为 http://localhost:7997/docs。您也可以在此处找到交互式预览：https://infinity.modal.michaelfeil.eu/docs (以及 https://michaelfeil-infinity.hf.space/docs)

贡献与开发

在 Ubuntu 22.04 上通过 Poetry 1.8.1、Python3.11 安装：

cd libs/infinity_emb
poetry install --extras all --with lint,test

要符合 CI 要求：

cd libs/infinity_emb
make precommit

所有贡献必须以与本仓库的 MIT 许可证兼容的方式进行。

引用

@software{feil_2023_11630143,
  author       = {Feil, Michael},
  title        = {Infinity - To Embeddings and Beyond},
  month        = oct,
  year         = 2023,
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.11630143},
  url          = {https://doi.org/10.5281/zenodo.11630143}
}

💚 当前贡献者

项目地址：https://github.com/michaelfeil/infinity

69 次点击 ∙ 0 人收藏

登录后收藏

0 条回复

Infinity — 面向嵌入与重排序模型的高性能推理服务

Infinity ♾️

为什么选择 Infinity

最新动态 🔥

快速开始

通过 pip 安装并启动 CLI

使用预构建的 Docker 容器启动 CLI（推荐）

专用 Docker 镜像

在 Docker 容器中使用本地模型

高级 CLI 用法

Infinity 支持的任务与模型

通过 Python API 使用 Infinity

嵌入

重排序

图像嵌入：CLIP 模型

音频嵌入：CLAP 模型

文本分类

通过 Python 客户端使用 Infinity

集成：

文档

贡献与开发

引用

💚 当前贡献者