名称: azure-ai-evaluation-py
描述: |
Azure AI Evaluation SDK for Python。用于通过质量、安全性和自定义评估器来评估生成式 AI 应用程序。
触发词:"azure-ai-evaluation", "evaluators", "GroundednessEvaluator", "evaluate", "AI quality metrics"。
package: azure-ai-evaluation
使用内置和自定义评估器来评估生成式 AI 应用程序的性能。
pip install azure-ai-evaluation
# 如需远程评估支持
pip install azure-ai-evaluation[remote]
# 用于 AI 辅助评估器
AZURE_OPENAI_ENDPOINT=https://<resource>.openai.azure.com
AZURE_OPENAI_API_KEY=<your-api-key>
AZURE_OPENAI_DEPLOYMENT=gpt-4o-mini
# 用于 Foundry 项目集成
AIPROJECT_CONNECTION_STRING=<your-connection-string>
from azure.ai.evaluation import (
GroundednessEvaluator,
RelevanceEvaluator,
CoherenceEvaluator,
FluencyEvaluator,
SimilarityEvaluator,
RetrievalEvaluator
)
# 使用 Azure OpenAI 模型配置进行初始化
model_config = {
"azure_endpoint": os.environ["AZURE_OPENAI_ENDPOINT"],
"api_key": os.environ["AZURE_OPENAI_API_KEY"],
"azure_deployment": os.environ["AZURE_OPENAI_DEPLOYMENT"]
}
groundedness = GroundednessEvaluator(model_config)
relevance = RelevanceEvaluator(model_config)
coherence = CoherenceEvaluator(model_config)
from azure.ai.evaluation import (
F1ScoreEvaluator,
RougeScoreEvaluator,
BleuScoreEvaluator,
GleuScoreEvaluator,
MeteorScoreEvaluator
)
f1 = F1ScoreEvaluator()
rouge = RougeScoreEvaluator()
bleu = BleuScoreEvaluator()
from azure.ai.evaluation import (
ViolenceEvaluator,
SexualEvaluator,
SelfHarmEvaluator,
HateUnfairnessEvaluator,
IndirectAttackEvaluator,
ProtectedMaterialEvaluator
)
violence = ViolenceEvaluator(azure_ai_project=project_scope)
sexual = SexualEvaluator(azure_ai_project=project_scope)
from azure.ai.evaluation import GroundednessEvaluator
groundedness = GroundednessEvaluator(model_config)
result = groundedness(
query="什么是 Azure AI?",
context="Azure AI 是微软的 AI 平台...",
response="Azure AI 提供 AI 服务和工具。"
)
print(f"事实性得分: {result['groundedness']}")
print(f"理由: {result['groundedness_reason']}")
from azure.ai.evaluation import evaluate
result = evaluate(
data="test_data.jsonl",
evaluators={
"groundedness": groundedness,
"relevance": relevance,
"coherence": coherence
},
evaluator_config={
"default": {
"column_mapping": {
"query": "${data.query}",
"context": "${data.context}",
"response": "${data.response}"
}
}
}
)
print(result["metrics"])
from azure.ai.evaluation import QAEvaluator, ContentSafetyEvaluator
# 在一个评估器中包含所有质量指标
qa_evaluator = QAEvaluator(model_config)
# 在一个评估器中包含所有安全指标
safety_evaluator = ContentSafetyEvaluator(azure_ai_project=project_scope)
result = evaluate(
data="data.jsonl",
evaluators={
"qa": qa_evaluator,
"content_safety": safety_evaluator
}
)
from azure.ai.evaluation import evaluate
from my_app import chat_app # 你的应用程序
result = evaluate(
data="queries.jsonl",
target=chat_app, # 可调用对象,接收查询并返回响应
evaluators={
"groundedness": groundedness
},
evaluator_config={
"default": {
"column_mapping": {
"query": "${data.query}",
"context": "${outputs.context}",
"response": "${outputs.response}"
}
}
}
)
from azure.ai.evaluation import evaluator
@evaluator
def word_count_evaluator(response: str) -> dict:
return {"word_count": len(response.split())}
# 在 evaluate() 中使用
result = evaluate(
data="data.jsonl",
evaluators={"word_count": word_count_evaluator}
)
from azure.ai.evaluation import PromptChatTarget
class CustomEvaluator:
def __init__(self, model_config):
self.model = PromptChatTarget(model_config)
def __call__(self, query: str, response: str) -> dict:
prompt = f"为这个响应打分(1-5分): 查询: {query}, 响应: {response}"
result = self.model.send_prompt(prompt)
return {"custom_score": int(result)}
from azure.ai.projects import AIProjectClient
from azure.identity import DefaultAzureCredential
project = AIProjectClient.from_connection_string(
conn_str=os.environ["AIPROJECT_CONNECTION_STRING"],
credential=DefaultAzureCredential()
)
result = evaluate(
data="data.jsonl",
evaluators={"groundedness": groundedness},
azure_ai_project=project.scope # 将结果记录到 Foundry
)
print(f"查看结果: {result['studio_url']}")
| 评估器 | 类型 | 指标 |
|---|---|---|
GroundednessEvaluator |
AI | 事实性 (1-5) |
RelevanceEvaluator |
AI | 相关性 (1-5) |
CoherenceEvaluator |
AI | 连贯性 (1-5) |
FluencyEvaluator |
AI | 流畅性 (1-5) |
SimilarityEvaluator |
AI | 相似度 (1-5) |
RetrievalEvaluator |
AI | 检索性能 (1-5) |
F1ScoreEvaluator |
NLP | f1_score (0-1) |
RougeScoreEvaluator |
NLP | rouge 分数 |
ViolenceEvaluator |
安全 | 暴力 (0-7) |
SexualEvaluator |
安全 | 性相关内容 (0-7) |
SelfHarmEvaluator |
安全 | 自残 (0-7) |
HateUnfairnessEvaluator |
安全 | 仇恨/不公 (0-7) |
QAEvaluator |
复合 | 所有质量指标 |
ContentSafetyEvaluator |
复合 | 所有安全指标 |
| 文件 | 内容 |
|---|---|
| references/built-in-evaluators.md | AI 辅助、基于 NLP 和安全评估器的详细模式及配置表 |
| references/custom-evaluators.md | 创建基于代码和基于提示词的自定义评估器,测试模式 |
| scripts/run_batch_evaluation.py | 用于运行包含质量、安全和自定义评估器的批量评估的 CLI 工具 |