OA0

OA0 是一个探索 AI 的社区

现在注册

已注册用户请登录

OA0 › 代码 › Instructor-XL — 面向指令微调与数据构建的研究型项目

Instructor-XL — 面向指令微调与数据构建的研究型项目

lemon · 2026-06-04 11:00:24 · 32 次点击 · 0 条评论

一个嵌入器，任意任务：基于指令微调的文本嵌入

本仓库包含论文《One Embedder, Any Task: Instruction-Finetuned Text Embeddings》的代码与预训练模型。请参阅我们的项目页面，快速了解项目概况。

我们推出了 Instructor👨‍🏫，一个基于指令微调的文本嵌入模型。它能够为任意任务（如分类、检索、聚类、文本评估等）和任意领域（如科学、金融等）生成定制化的文本嵌入，只需提供任务指令即可，无需任何微调。Instructor👨‍ 在 70 个不同的嵌入任务上达到了当前最优水平！

**** 更新日志 ****

01/21：更新了代码结构，支持简单的包安装。
12/28：使用难负例更新了模型检查点。
12/20：发布了论文、代码、项目页面和模型检查点，欢迎查阅！

安装

使用 INSTRUCTOR 进行文本嵌入非常简单。你可以轻松在 Colab notebook 中试用。在本地机器上，建议首先创建一个虚拟环境：

conda env create -n instructor python=3.7
git clone https://github.com/HKUNLP/instructor-embedding
pip install -r requirements.txt

这将创建我们使用的 instructor 环境。要使用嵌入工具，首先从 PyPI 安装 InstructorEmbedding 包：

pip install InstructorEmbedding

或者直接从我们的代码安装：

pip install -e .

环境搭建

通过运行以下命令激活环境：

conda activate instructor

快速开始

首先下载一个预训练模型（详见模型列表）：

from InstructorEmbedding import INSTRUCTOR
model = INSTRUCTOR('hkunlp/instructor-large')

然后，为模型提供句子和自定义指令：

# 准备带有指令的文本
text_instruction_pairs = [
    {"instruction": "Represent the Science title:", "text": "3D ActionSLAM: wearable person tracking in multi-floor environments"},
    {"instruction": "Represent the Medicine sentence for retrieving a duplicate sentence:", "text": "Recent studies have suggested that statins, an established drug group in the prevention of cardiovascular mortality, could delay or prevent breast cancer recurrence but the effect on disease-specific mortality remains unclear."}
]

# 后处理
texts_with_instructions = []
for pair in text_instruction_pairs:
    texts_with_instructions.append([pair["instruction"], pair["text"]])

# 计算嵌入
customized_embeddings = model.encode(texts_with_instructions)

这样就完成了。我们现在得到了一个由 numpy 数组组成的嵌入列表。

for pair, embedding in zip(text_instruction_pairs, customized_embeddings):
    print("Instruction: ", pair["instruction"])
    print("text: ", pair["text"])
    print("Embedding: ", embedding)
    print("")

`encode` 函数

用户只需使用 encode 函数：

model.encode( sentences,
              batch_size: int = 32,
              show_progress_bar: bool = None,
              output_value: str = 'sentence_embedding',
              convert_to_numpy: bool = True,
              convert_to_tensor: bool = False,
              device: str = None,
              normalize_embeddings: bool = False)

sentences：要嵌入的句子。格式应为 [["指令提示 0", "待嵌入文本 0"], ["指令提示 1", "待嵌入文本 1"], ...]。
batch_size（默认值：32）：用于计算的批处理大小，决定每批处理的句子数量。
show_progress_bar（默认值：None）：设置为 True 时，显示编码进度条。
output_value（默认值：'sentence_embedding'）：指定输出类型。默认值返回句子嵌入；设为 'token_embeddings' 返回词块嵌入；设为 None 返回所有输出值。
convert_to_numpy（默认值：True）：设为 True 时，输出为 numpy 向量列表；设为 False 时，输出为 PyTorch 张量列表。
convert_to_tensor（默认值：False）：设为 True 时，返回堆叠张量作为单一输出。该参数会覆盖 convert_to_numpy 的设置。
device（默认值：None）：指定计算所用的 torch.device。未指定时，使用默认设备。
normalize_embeddings（默认值：False）：设为 True 时，返回的向量长度为 1，即归一化向量。此时相似度搜索将使用更快的点积（util.dot_score），而非余弦相似度。

模型列表

我们发布了系列不同大小的 INSTRUCTOR 模型检查点，可通过 InstructorEmbedding 包轻松加载。

模型	平均得分
hkunlp/instructor-base	55.9
hkunlp/instructor-large	58.4
hkunlp/instructor-xl	58.8

使用场景

以下提供几个具体使用场景。更多示例和应用，请参阅我们的论文。

为自定义文本计算嵌入

如果你需要为特定句子计算定制嵌入，可以按照统一模板编写指令：

Represent the domain text_type for task_objective:

domain（可选）：指定文本领域，如科学（science）、金融（finance）、医学（medicine）等。
text_type（必填）：指定编码单元，如句子（sentence）、文档（document）、段落（paragraph）等。
task_objective（可选）：指定嵌入目标，如检索文档（retrieve a document）、分类句子（classify the sentence）等。

计算文本间的相似度

你可以使用 INSTRUCTOR 通过定制嵌入计算两组句子之间的相似度：

from sklearn.metrics.pairwise import cosine_similarity
sentences_a = [['Represent the Science sentence: ','Parton energy loss in QCD matter'], 
               ['Represent the Financial statement: ','The Federal Reserve on Wednesday raised its benchmark interest rate.']]
sentences_b = [['Represent the Science sentence: ','The Chiral Phase Transition in Dissipative Dynamics'],
               ['Represent the Financial statement: ','The funds rose less than 0.5 per cent on Friday']]
embeddings_a = model.encode(sentences_a)
embeddings_b = model.encode(sentences_b)
similarities = cosine_similarity(embeddings_a,embeddings_b)

使用定制嵌入进行信息检索

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
query  = [['Represent the Wikipedia question for retrieving supporting documents: ','where is the food stored in a yam plant']]
corpus = [['Represent the Wikipedia document for retrieval: ','Capitalism has been dominant in the Western world since the end of feudalism, but most feel[who?] that the term "mixed economies" more precisely describes most contemporary economies, due to their containing both private-owned and state-owned enterprises. In capitalism, prices determine the demand-supply scale. For example, higher demand for certain goods and services lead to higher prices and lower demand for certain goods lead to lower prices.'],
          ['Represent the Wikipedia document for retrieval: ',"The disparate impact theory is especially controversial under the Fair Housing Act because the Act regulates many activities relating to housing, insurance, and mortgage loansâ€”and some scholars have argued that the theory's use under the Fair Housing Act, combined with extensions of the Community Reinvestment Act, contributed to rise of sub-prime lending and the crash of the U.S. housing market and ensuing global economic recession"],
          ['Represent the Wikipedia document for retrieval: ','Disparate impact in United States labor law refers to practices in employment, housing, and other areas that adversely affect one group of people of a protected characteristic more than another, even though rules applied by employers or landlords are formally neutral. Although the protected classes vary by statute, most federal civil rights laws protect based on race, color, religion, national origin, and sex as protected traits, and some laws include disability status and other traits as well.']]
query_embeddings = model.encode(query)
corpus_embeddings = model.encode(corpus)
similarities = cosine_similarity(query_embeddings,corpus_embeddings)
retrieved_doc_id = np.argmax(similarities)
print(retrieved_doc_id)

使用定制嵌入进行聚类

import sklearn.cluster
sentences = [['Represent the Medicine sentence for clustering: ','Dynamical Scalar Degree of Freedom in Horava-Lifshitz Gravity'],
             ['Represent the Medicine sentence for clustering: ','Comparison of Atmospheric Neutrino Flux Calculations at Low Energies'],
             ['Represent the Medicine sentence for clustering: ','Fermion Bags in the Massive Gross-Neveu Model'],
             ['Represent the Medicine sentence for clustering: ',"QCD corrections to Associated t-tbar-H production at the Tevatron"],
             ['Represent the Medicine sentence for clustering: ','A New Analysis of the R Measurements: Resonance Parameters of the Higher,  Vector States of Charmonium']]
embeddings = model.encode(sentences)
clustering_model = sklearn.cluster.MiniBatchKMeans(n_clusters=2)
clustering_model.fit(embeddings)
cluster_assignment = clustering_model.labels_
print(cluster_assignment)

训练

数据

我们构建了多任务嵌入数据与指令集（MEDI），包含来自 Super-NI、sentence-transformer embedding training data、KILT 和 MedMCQA 的 330 个数据集，涵盖广泛的领域和任务。对于未提供正负例对的数据，我们构建并统一存储为以下格式：

[
    {'query': ['Represent the Wikipedia question for retrieving relevant documents;', 'big little lies season 2 how many episodes'], 'pos': ['Represent the Wikipedia document for retrieval;', 'Big Little Lies (TV series) series garnered several accolades. It received 16 Emmy Award nominations and won eight, including Outstanding Limited Series and acting awards for Kidman, Skarsgård, and Dern. The trio also won Golden Globe Awards in addition to a Golden Globe Award for Best Miniseries or Television Film win for the series. Kidman and Skarsgård also received Screen Actors Guild Awards for their performances. Despite originally being billed as a miniseries, HBO renewed the series for a second season. Production on the second season began in March 2018 and is set to premiere in 2019. All seven episodes are being written by Kelley'], 'neg': ['Represent the Wikipedia document for retrieval;', 'Little People, Big World final minutes of the season two-A finale, "Farm Overload". A crowd had gathered around Jacob, who was lying on the ground near the trebuchet. The first two episodes of season two-B focus on the accident, and how the local media reacted to it. The first season of "Little People, Big World" generated solid ratings for TLC (especially in the important 18–49 demographic), leading to the show\'s renewal for a second season. Critical reviews of the series have been generally positive, citing the show\'s positive portrayal of little people. Conversely, other reviews have claimed that the show has a voyeuristic bend'], 'task_id': 1}
    {'query': ['Represent the Wikipedia question for retrieving relevant documents;', 'who sang waiting for a girl like you'], 'pos': ['Represent the Wikipedia document for retrieval;', 'Waiting for a Girl Like You Waiting for a Girl Like You "Waiting for a Girl Like You" is a 1981 power ballad by the British-American rock band Foreigner. The distinctive synthesizer theme was performed by the then-little-known Thomas Dolby, and this song also marked a major departure from their earlier singles because their previous singles were mid to upper tempo rock songs while this song was a softer love song with the energy of a power ballad. It was the second single released from the album "4" (1981) and was co-written by Lou Gramm and Mick Jones. It has become one of the band\'s most'], 'neg': ['Represent the Wikipedia document for retrieval;', 'Waiting for a Girl Like You held off the number 1 spot by Olivia Newton-John\'s single "Physical" for nine consecutive weeks, and then by Hall & Oates\' "I Can\'t Go for That (No Can Do)" for a tenth week on January 30, 1982. Because of its chart longevity, it ended up being the number 19 song on the Top 100 singles of 1982. The song was the band\'s biggest hit until "I Want to Know What Love Is" hit number 1 in 1985. The song lists at number 100 on ""Billboard"\'s Greatest Songs of All Time". Waiting for a Girl Like You "Waiting for a Girl'], 'task_id': 1}
    ...
    {'query': ['Represent the Wikipedia sentence for retrieving relevant documents;', 'i LOVE sweet martini drinks!'], 'pos': ['Represent the Wikipedia document for retrieval;', "Appletini Appletini\nAn Apple martini (Appletini for short) is a cocktail containing vodka and one or more of apple juice, apple cider, apple liqueur, or apple brandy.\nThis drink, originally called an Adam's Apple Martini because the bartender who created it was named Adam, was created in 1996 at Lola's West Hollywood restaurant.\nThe drink, Adam's Apple was advertised by Smirnoff in the July 1972 issue of Playboy Magazine to the inside front cover. The recipe called for an ounce or so of Smirnoff"], 'neg': ['Represent the Wikipedia document for retrieval;', "Aromatised wine similar beverages described in this legislation are 'aromatised wine-based drinks' (non-fortified) and 'aromatised wine-product cocktail' (blended, lower alcohol drink under 7% ABV).\nVarieties of aromatised wine.\nVarieties of aromatised wine Vermouth.\nVermouth is the most widely used aromatised wine due to its use in cocktails and famous commercial brands such as Martini and Cinzano which are commonplace around the world. Vermouth can be sweet or dry and red, white, pink or orange. It is traditionally"], 'task_id': 300}
]

每个实例包含一个查询（query）、一个正例（pos）、一个负例（neg）以及任务 ID（task_id），用于确保同一训练批次中的数据来自相同任务。

MEDI 数据可通过此链接下载。

训练 INSTRUCTOR

我们提供了训练 INSTRUCTOR 的示例脚本。首先需要下载 MEDI 数据，解压后将 medi-data.json 放置在 --cache_dir 目录下：

python train.py --model_name_or_path sentence-transformers/gtr-t5-large --output_dir {output_directory} --cache_dir {cache_directory} --max_source_length 512 --num_train_epochs 10 --save_steps 500 --cl_temperature 0.1 --warmup_ratio 0.1 --learning_rate 2e-5 --overwrite_output_dir

参数说明如下：

--model_name_or_path：初始预训练检查点。支持模型 ID（如 sentence-transformers/gtr-t5-large、sentence-transformers/sentence-t5-large）或检查点路径。
--cl_temperature：对比损失的温度参数。
--cache_dir：缓存下载模型和数据的目录。下载的 MEDI 数据（medi-data.json）应放置在该目录下。
--output_dir：存储训练好的模型（检查点）用于评估的目录。

其他参数均为标准的 Huggingface transformers 训练参数，如 --overwrite_output_dir、--num_train_epochs、--learning_rate。详情请参阅 Huggingface transformers。

评估

我们在 70 个广泛的任务和领域上对 INSTRUCTOR 进行了大规模评估。具体来说，基于三个基准进行评测：MTEB、Billboard 和 Prompt Retrieval。以下说明运行评估脚本的详细信息。

MTEB

要在 MTEB 基准数据集上评估模型性能，首先安装 MTEB 库：

cd evaluation/MTEB
pip install -e .

然后运行以下命令：

python examples/evaluate_model.py --model_name hkunlp/instructor-large --output_dir outputs --task_name ArguAna --result_file results

你可以通过指定 --model_name 评估自己训练的模型检查点，并通过更改 --task_name 运行所有 MTEB 数据集。各任务的评估指标请参考我们的论文或 MTEB 基准。

Billboard

要在 Billboard 上评估模型性能，运行以下命令：

cd evaluation/text_evaluation
python main.py --model_name hkunlp/instructor-large --task mscoco --add_prompt

你可以通过指定 --model_name 评估自己训练的模型检查点，并通过更改 --task 运行所有 Billboard 数据集。Billboard 的三个数据集均报告 Pearson 相关系数。

Prompt Retrieval

要在 Prompt Retrieval 上评估模型性能，运行以下命令：

cd evaluation/prompt_retrieval
python main.py --embedding_model hkunlp/instructor-large --task rte --model_cache_dir {cache_dir} --output_dir {output_dir} --add_prompt

你可以通过指定 --model_name 评估自己训练的模型检查点，并通过更改 --task 运行 Prompt Retrieval 数据集。为保持指标一致性，我们将 Prompt Retrieval 中的所有任务转换为“text-to-text”格式，并报告 Rouge-L 分数。

量化

要对 INSTRUCTOR 嵌入模型进行量化，运行以下代码：

import torch
from InstructorEmbedding import INSTRUCTOR

# 加载模型（可在 CPU 或 GPU 上进行）
model = INSTRUCTOR('hkunlp/instructor-large', device='cpu')

# 动态量化模型
qmodel = torch.quantization.quantize_dynamic(
model, {torch.nn.Linear}, dtype=torch.qint8)

# 推理
sentence = "3D ActionSLAM: wearable person tracking in multi-floor environments"
instruction = "Represent the Science title:"

embeddings = qmodel.encode([[instruction, sentence]])

print(f"Quantized Embeddings:\n {embeddings}")

量化可将模型大小减少 10 倍，推理时间也将少于常规模型。

问题反馈

如果您对代码或论文有任何疑问，请随时发送邮件至 Hongjin（hjsu@cs.hku.hk）或 Weijia（swj0419@cs.washington.edu）。请尽量详细描述问题，以便我们更快速、更有效地帮助您。

引用

如果我们的工作对您有帮助，请引用我们：

@inproceedings{INSTRUCTOR,
  title={One Embedder, Any Task: Instruction-Finetuned Text Embeddings},
  author={Su, Hongjin and Shi, Weijia and Kasai, Jungo and Wang, Yizhong and Hu, Yushi and  Ostendorf, Mari and Yih, Wen-tau and Smith, Noah A. and  Zettlemoyer, Luke and Yu, Tao},
  url={https://arxiv.org/abs/2212.09741},
  year={2022},
}

INSTRUCTOR 相关项目

感谢社区对 INSTRUCTOR 的扩展贡献！

LangChain 支持使用 INSTRUCTOR 模型的 InstructEmbeddings。
MosaicML 已集成 Instructor-Large 和 Instructor-XL。
embaas 集成了 Instructor-Large。
Haystack 包含 InstructorTextEmbedder 和 InstructorDocumentEmbedder 组件。

项目地址：https://github.com/HKUNLP/instructor-embedding

32 次点击 ∙ 0 人收藏

登录后收藏

0 条回复