`textdescriptives`：添加基本描述性特征作为元数据#

在本教程中，我们将使用集成在 Argilla 上的 TextDescriptivesExtractor 轻松地添加文本描述性特征作为元数据。

我们将涵盖以下主题

📂 加载示例数据集
📃 将文本描述性特征添加到记录
🗒️ 将文本描述性特征添加到 FeedbackDataset

简介#

文本描述性特征是用于分析和描述文本特征的方法。它们的范围从简单的指标（如字数统计）到更复杂的指标（如情感分析或主题建模），将非结构化文本转换为更易于理解的结构化数据。对于注释项目，它们提供注释者未捕获的信息，并且作为元数据添加后，它们有助于过滤和创建数据集子集。

为了获取文本描述性特征，我们将使用基于 TextDescriptives 库的 TextDescriptivesExtractor。此提取器默认添加的基本指标如下

n_tokens：文本中的标记数。
n_unique_tokens：文本中唯一标记的数量。
n_sentences：文本中的句子数。
perplexity：衡量文本复杂性、词汇多样性和不可预测性。较低的分数表明模型认为文本更可预测，而较高的困惑度分数意味着模型认为文本不太可预测。
entropy：指示文本随机性或不确定性。较高的分数表示多样化、不可预测的语言使用。
flesch_reading_ease：一种可读性测试，旨在根据句子长度和每个单词的音节数来指示英语文本的易懂程度。较高的分数意味着更容易阅读，而较低的分数表示复杂性。

运行 Argilla#

对于本教程，您需要运行 Argilla 服务器。部署和运行 Argilla 有两个主要选项

在 Hugging Face Spaces 上部署 Argilla：如果您想使用外部笔记本（例如，Google Colab）运行教程，并且您在 Hugging Face 上有一个帐户，则只需点击几下即可在 Spaces 上部署 Argilla

有关配置部署的详细信息，请查看官方 Hugging Face Hub 指南。

使用 Argilla 的快速入门 Docker 镜像启动 Argilla：如果您想在本地计算机上运行 Argilla，建议使用此选项。请注意，此选项仅允许您在本地运行教程，而不能与外部笔记本服务一起运行。

有关部署选项的更多信息，请查看文档的部署部分。

提示

本教程是一个 Jupyter Notebook。有两种运行它的选项

使用此页面顶部的“在 Colab 中打开”按钮。此选项允许您直接在 Google Colab 上运行 notebook。不要忘记将运行时类型更改为 GPU 以加快模型训练和推理速度。
通过单击页面顶部的“查看源代码”链接下载 .ipynb 文件。此选项允许您下载 notebook 并在本地计算机或您选择的 Jupyter Notebook 工具上运行它。

设置环境#

要完成本教程，您需要使用 pip 安装 Argilla 客户端和一些第三方库

[ ]:

# %pip install --upgrade pip
%pip install argilla -qqq
%pip install datasets
%pip install textdescriptives

让我们进行所需的导入

[4]:

import argilla as rg
from argilla.client.feedback.integrations.textdescriptives import TextDescriptivesExtractor

from datasets import load_dataset

如果您正在使用 Docker 快速入门镜像或公共 Hugging Face Spaces 运行 Argilla，则需要使用 URL 和 API_KEY 初始化 Argilla 客户端

[ ]:

# Replace api_url with the url to your HF Spaces URL if using Spaces
# Replace api_key if you configured a custom API key
# Replace workspace with the name of your workspace
rg.init(
    api_url="https://:6900",
    api_key="owner.apikey",
    workspace="admin"
)

如果您正在运行私有的 Hugging Face Space，您还需要按如下所示设置 HF_TOKEN

[ ]:

# # Set the HF_TOKEN environment variable
# import os
# os.environ['HF_TOKEN'] = "your-hf-token"

# # Replace api_url with the url to your HF Spaces URL
# # Replace api_key if you configured a custom API key
# rg.init(
#     api_url="https://[your-owner-name]-[your_space_name].hf.space",
#     api_key="admin.apikey",
#     extra_headers={"Authorization": f"Bearer {os.environ['HF_TOKEN']}"},
# )

启用遥测#

我们从您与教程的互动中获得宝贵的见解。为了改进我们为您提供最合适内容的方式，使用以下代码行将帮助我们了解本教程是否有效地为您服务。虽然这是完全匿名的，但如果您愿意，可以选择跳过此步骤。有关更多信息，请查看遥测页面。

[ ]:

try:
    from argilla.utils.telemetry import tutorial_running
    tutorial_running()
except ImportError:
    print("Telemetry is introduced in Argilla 1.20.0 and not found in the current installation. Skipping telemetry.")

加载数据集#

对于此示例，我们将使用 Hugging Face 的 squad 数据集，这是一个阅读理解数据集，由关于维基百科文章集合的问题、给定的上下文和答案组成。

[5]:

# Load the dataset and select the first 100 examples
hf_dataset = load_dataset("squad", split="train").select(range(100))

[6]:

hf_dataset

[6]:

Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 100
})

创建 FeedbackDataset#

要创建 FeedbackDataset，我们为问答选择一个 TaskTemplate，使用默认配置，因此未添加任何元数据。

[7]:

# Create a FeedbackDataset
dataset = rg.FeedbackDataset.for_question_answering(
    use_markdown=True,
    guidelines=None,
    metadata_properties=None,
    vectors_settings=None,
)
dataset

[7]:

FeedbackDataset(
   fields=[TextField(name='question', title='Question', required=True, type='text', use_markdown=True), TextField(name='context', title='Context', required=True, type='text', use_markdown=True)]
   questions=[TextQuestion(name='answer', title='Answer', description='Answer the question. Note that the answer must exactly be in the context.', required=True, type='text', use_markdown=True)]
   guidelines=This is a question answering dataset that contains questions and contexts. Please answer the question by using the context.)
   metadata_properties=[])
)

我们还将定义我们的初始记录列表，将数据集的特征与任务模板的特征相匹配。但为了本教程的目的，我们尚不会将它们添加到我们的数据集中。

[8]:

# Create our list of records
records = [
    rg.FeedbackRecord(
        fields={"question": record["question"], "context": record["context"]},
    )
    for record in hf_dataset
]

添加文本描述性特征#

我们的数据集当前缺少元数据。为了解决这个问题，我们将使用 TextDescriptivesExtractor 添加文本描述性特征作为元数据，它具有以下参数

model：模型的语言。
metrics：要提取的指标。
visible_for_annotators：元数据是否对注释者可见。
show_progress：是否显示进度条。

有关 TextDescriptivesExtractor 的更多信息，请查看实用指南。我们可以将元数据添加到本地或远程记录或数据集。让我们看看如何做到这两者。

添加到记录#

首先，我们将文本描述性特征作为元数据添加到我们上面定义的记录中。为此，我们将初始化 TextDescriptivesExtractor，我们将在其中仅为 question 字段计算默认指标。请注意，由于这发生在记录级别，因此元数据在 UI 中对注释者不可见。

[9]:

# Initialize the TextDescriptivesExtractor
tde = TextDescriptivesExtractor(
    model = "en",
    metrics = None,
    visible_for_annotators = False,
    show_progress = True,
)

[ ]:

# Update the records
updated_records = tde.update_records(records, fields=["question"])

正如我们在下面看到的，指定字段的默认指标已作为元数据添加到记录中。

[11]:

updated_records[0].metadata

[11]:

{'question_n_tokens': 13,
 'question_n_unique_tokens': 12,
 'question_n_sentences': 1,
 'question_perplexity': 1.27,
 'question_entropy': 0.24,
 'question_flesch_reading_ease': 89.52}

因此，现在我们可以将带有元数据的更新记录添加到我们的数据集中。我们将把它推送到 Argilla。

[12]:

# Add the updated records to the dataset
dataset.add_records(updated_records)

[ ]:

# Push the dataset to Argilla
remote_dataset = dataset.push_to_argilla(name="squad_tutorial", workspace="argilla")

添加到数据集#

现在，我们将使用上下文的文本描述性特征更新我们的数据集。在这种情况下，我们将初始化 TextDescriptivesExtractor，指示我们要提取与 descriptive_stats 和 coherence 相关的指标。我们还将 visible_for_annotators 参数设置为 True，以便元数据在 UI 中对注释者可见。

[14]:

# Initialize the TextDescriptivesExtractor
tde = TextDescriptivesExtractor(
    model = "en",
    metrics = ["descriptive_stats", "readability"],
    visible_for_annotators = True,
    show_progress = True,
)

[ ]:

# Update the dataset
tde.update_dataset(remote_dataset, fields=["context"])

在这种情况下，它是一个远程数据集，因此它将直接在 Argilla 上更新。正如我们在下面看到的，指标已作为元数据添加到数据集中，并且对注释者可见。

[16]:

remote_dataset.records[0].metadata

[16]:

{'question_n_tokens': 13,
 'question_n_unique_tokens': 12,
 'question_n_sentences': 1,
 'question_perplexity': 1.27,
 'question_entropy': 0.24,
 'question_flesch_reading_ease': 89.52,
 'context_flesch_reading_ease': 76.96,
 'context_flesch_kincaid_grade': 6.93,
 'context_smog': 8.84,
 'context_gunning_fog': 9.34,
 'context_automated_readability_index': 8.43,
 'context_coleman_liau_index': 8.75,
 'context_lix': 34.65,
 'context_rix': 3.0,
 'context_token_length_mean': 4.46,
 'context_token_length_median': 4.0,
 'context_token_length_std': 2.55,
 'context_sentence_length_mean': 17.71,
 'context_sentence_length_median': 14.0,
 'context_sentence_length_std': 7.46,
 'context_syllables_per_token_mean': 1.32,
 'context_syllables_per_token_median': 1.0,
 'context_syllables_per_token_std': 0.69,
 'context_n_tokens': 124,
 'context_n_unique_tokens': 68,
 'context_proportion_unique_tokens': 0.55,
 'context_n_characters': 572,
 'context_n_sentences': 7}

结论#

在本教程中，我们探讨了如何使用集成在 Argilla 上的 TextDescriptivesExtractor 将文本描述性特征作为元数据添加到记录和数据集，这对于注释项目非常有用。