OA0

OA0 是一个探索 AI 的社区

现在注册

已注册用户请登录

OA0 › 代码 › Sentence-Transformers — 计算文本语义相似度的领先工具

Sentence-Transformers — 计算文本语义相似度的领先工具

four · 2025-12-13 03:02:23 · 66 次点击 · 0 条评论

Sentence Transformers：文本嵌入、检索与重排序

本框架提供了一种简便的方法来计算嵌入向量，用于访问、使用和训练最先进的嵌入模型和重排序模型。它可用于通过 Sentence Transformer 模型计算嵌入向量（快速入门），通过 Cross-Encoder（又称重排序器）模型计算相似度分数（快速入门），或通过 Sparse Encoder 模型生成稀疏嵌入向量（快速入门）。这解锁了广泛的应用场景，包括语义搜索、语义文本相似度和释义挖掘。

在 🤗 Hugging Face 上，有超过 15,000 个预训练的 Sentence Transformers 模型可供立即使用，其中包括许多来自 Massive Text Embeddings Benchmark (MTEB) 排行榜的顶尖模型。此外，使用 Sentence Transformers 可以轻松训练或微调你自己的嵌入模型、重排序模型或稀疏编码器模型，从而为你的特定用例创建定制模型。

完整文档请参见 www.SBERT.net。

安装

我们推荐使用 Python 3.10+、PyTorch 1.11.0+ 和 transformers v4.34.0+。

使用 pip 安装

pip install -U sentence-transformers

使用 conda 安装

conda install -c conda-forge sentence-transformers

从源码安装

或者，你也可以从仓库克隆最新版本，并直接从源代码安装：

pip install -e .

带 CUDA 的 PyTorch

如果你想使用 GPU / CUDA，必须安装匹配 CUDA 版本的 PyTorch。请参考 PyTorch - 入门指南了解如何安装 PyTorch 的详细信息。

快速开始

请参阅我们文档中的快速入门。

嵌入模型

首先下载一个预训练的嵌入模型，即 Sentence Transformer 模型。

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

然后向模型提供一些文本。

sentences = [
    "The weather is lovely today.",
    "It's so sunny outside!",
    "He drove to the stadium.",
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# => (3, 384)

这样就完成了。我们现在有了每个文本对应的嵌入向量（numpy 数组）。我们可以用它们来计算相似度。

similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.6660, 0.1046],
#         [0.6660, 1.0000, 0.1411],
#         [0.1046, 0.1411, 1.0000]])

重排序模型

首先下载一个预训练的重排序模型，即 Cross Encoder 模型。

from sentence_transformers import CrossEncoder

# 1. 加载一个预训练的 CrossEncoder 模型
model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L6-v2")

然后向模型提供一些文本。

# 需要预测相似度分数的文本
query = "How many people live in Berlin?"
passages = [
    "Berlin had a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.",
    "Berlin has a yearly total of about 135 million day visitors, making it one of the most-visited cities in the European Union.",
    "In 2013 around 600,000 Berliners were registered in one of the more than 2,300 sport and fitness clubs.",
]

# 2a. 预测文本对的分数
scores = model.predict([(query, passage) for passage in passages])
print(scores)
# => [8.607139 5.506266 6.352977]

这样就完成了。你也可以使用 model.rank 来避免手动执行重排序：

# 2b. 为查询对一组文本进行排序
ranks = model.rank(query, passages, return_documents=True)

print("查询:", query)
for rank in ranks:
    print(f"- #{rank['corpus_id']} ({rank['score']:.2f}): {rank['text']}")
"""
查询: How many people live in Berlin?
- #0 (8.61): Berlin had a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.
- #2 (6.35): In 2013 around 600,000 Berliners were registered in one of the more than 2,300 sport and fitness clubs.
- #1 (5.51): Berlin has a yearly total of about 135 million day visitors, making it one of the most-visited cities in the European Union.
"""

稀疏编码器模型

首先下载一个预训练的稀疏嵌入模型，即 Sparse Encoder 模型。


from sentence_transformers import SparseEncoder

# 1. 加载一个预训练的 SparseEncoder 模型
model = SparseEncoder("naver/splade-cocondenser-ensembledistil")

# 需要编码的句子
sentences = [
    "The weather is lovely today.",
    "It's so sunny outside!",
    "He drove to the stadium.",
]

# 2. 调用 model.encode() 计算稀疏嵌入向量
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 30522] - 具有词汇表大小维度的稀疏表示

# 3. 计算嵌入向量的相似度
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[   35.629,     9.154,     0.098],
#         [    9.154,    27.478,     0.019],
#         [    0.098,     0.019,    29.553]])

# 4. 检查稀疏度统计信息
stats = SparseEncoder.sparsity(embeddings)
print(f"稀疏度: {stats['sparsity_ratio']:.2%}")
# 稀疏度: 99.84%

预训练模型

我们提供了大量预训练模型，支持超过 100 种语言。一些模型是通用模型，而另一些则针对特定用例生成嵌入向量。

训练

本框架允许你微调自己的句子嵌入方法，从而获得针对特定任务的句子嵌入向量。你可以从多种选项中进行选择，以便为你的特定任务获得完美的句子嵌入。

嵌入模型
Sentence Transformer > 训练概述
Sentence Transformer > 训练示例或 GitHub 上的训练示例。
重排序模型
Cross Encoder > 训练概述
Cross Encoder > 训练示例或 GitHub 上的训练示例。
稀疏嵌入模型
Sparse Encoder > 训练概述
Sparse Encoder > 训练示例或 GitHub 上的训练示例。

不同类型训练的一些亮点包括：

支持多种 Transformer 网络，包括 BERT、RoBERTa、XLM-R、DistilBERT、Electra、BART 等。
多语言和多任务学习。
训练期间进行评估以找到最优模型。
为嵌入模型提供 20 多种损失函数，为重排序模型提供 10 多种损失函数，为稀疏嵌入模型提供 10 多种损失函数，允许你针对语义搜索、释义挖掘、语义相似度比较、聚类、三元组损失、对比损失等进行专门调优。

应用示例

你可以将此框架用于：

以及更多用例。

所有示例请参见 examples/sentence_transformer/applications。

开发设置

将仓库（或你的 fork）克隆到本地后，在虚拟环境中运行：

python -m pip install -e ".[dev]"

pre-commit install

要测试你的更改，请运行：

pytest

引用与作者

如果你觉得这个仓库有帮助，欢迎引用我们的论文 Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks：

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

如果你使用了多语言模型，欢迎引用我们的论文 Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation：

@inproceedings{reimers-2020-multilingual-sentence-bert,
    title = "Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2020",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/2004.09813",
}

请查看出版物页面，了解集成到 SentenceTransformers 中的我们的不同出版物。

维护者

维护者：Tom Aarsen, 🤗 Hugging Face

如果遇到问题（本不应该发生）或有进一步疑问，请随时提出 issue。

本项目最初由达姆施塔特工业大学的泛在知识处理 (UKP) 实验室开发。我们感谢他们的基础性工作以及对该领域的持续贡献。

此仓库包含实验性软件，发布目的仅为提供相关出版物的额外背景信息。

项目地址：https://github.com/UKPLab/sentence-transformers

66 次点击 ∙ 0 人收藏

登录后收藏

0 条回复