🩹 从 Token 或文本分类数据集中删除标签#

你可能会发现自己想要删除数据集中的一个标签，这很常见，可能是因为你改变了主意，或者因为你想更正标签的名称。然而，这不是一个简单的更改，因为如果数据集已经有注释，它会对后续操作产生影响，并可能触发错误。

在本教程中，你将学习在使用 Token 和文本分类数据集时，如何删除、修改或合并标签以应对这种情况。

让我们开始吧！

注意

本教程是一个 Jupyter Notebook。有两种运行方式

使用此页面顶部的“在 Colab 中打开”按钮。此选项允许你直接在 Google Colab 上运行 notebook。不要忘记将运行时类型更改为 GPU，以加快模型训练和推理速度。
通过单击页面顶部的“查看源代码”链接下载 .ipynb 文件。此选项允许你下载 notebook 并在本地计算机或你选择的 Jupyter notebook 工具上运行它。

设置#

在本教程中，你需要运行 Argilla 服务器。如果你还没有服务器，请查看我们的快速入门或安装页面。完成后，完成以下步骤

使用 pip 安装 Argilla 客户端和所需的第三方库

[ ]:

%pip install --upgrade argilla -qqq

让我们进行必要的导入

[ ]:

import argilla as rg

如果你正在使用 Docker 快速入门镜像或 Hugging Face Spaces 运行 Argilla，你需要使用 URL 和 API_KEY 初始化 Argilla 客户端

[ ]:

# Replace api_url with the url to your HF Spaces URL if using Spaces
# Replace api_key if you configured a custom API key
rg.init(
    api_url="https://:6900",
    api_key="admin.apikey"
)

如果你正在运行私有的 Hugging Face Space，你还需要设置 HF_TOKEN，如下所示

[ ]:

# # Set the HF_TOKEN environment variable
# import os
# os.environ['HF_TOKEN'] = "your-hf-token"

# # Replace api_url with the url to your HF Spaces URL
# # Replace api_key if you configured a custom API key
# rg.init(
#     api_url="https://[your-owner-name]-[your_space_name].hf.space",
#     api_key="admin.apikey",
#     extra_headers={"Authorization": f"Bearer {os.environ['HF_TOKEN']}"},
# )

启用遥测#

我们从你与教程的互动中获得宝贵的见解。为了改进自身，为您提供最合适的内容，使用以下代码行将帮助我们了解本教程是否有效地为您服务。虽然这是完全匿名的，但如果您愿意，可以选择跳过此步骤。有关更多信息，请查看遥测页面。

[ ]:

try:
    from argilla.utils.telemetry import tutorial_running
    tutorial_running()
except ImportError:
    print("Telemetry is introduced in Argilla 1.20.0 and not found in the current installation. Skipping telemetry.")

第一步#

让我们设置一些变量，以避免在后续操作中出错。

[ ]:

# save the name of the dataset that we will be working with
dataset_name = "my_dataset"

# and set the workspace where the dataset is located
rg.set_workspace("my_workspace")

可选地，你可以创建数据集的备份，以防我们想要恢复更改。为此，你可能需要创建一个专门用于保存备份的工作区，并将数据集复制到那里。

[ ]:

# optional: create a new workspace for the backups.
backups_ws = rg.Workspace.create("backups")

[ ]:

# optional: if you want users without the owner role to have access to this workspace
# change `username` and run this cell.
user = rg.User.from_name("username")
backups_ws.add_user(user.id)

[ ]:

# copy the dataset in the new workspace
rg.copy(dataset_name, name_of_copy=f"{dataset_name}_backup", workspace=backups_ws.name)

让我们加载设置并查看可用的标签。

提示

使用结果复制粘贴你要使用的标签名称，以避免错误。

[ ]:

settings = rg.load_dataset_settings(dataset_name)

[ ]:

# run this cell if you need to read or copy the labels
settings.label_schema

现在，保存一些变量，其中包含你要更改的标签 (old_label) 和你想将其更改为的内容 (new_label)。根据你的意图，你将在以下选项之间选择一个

如果你想更改标签的文本，你将在 new_label 中保存新文本。
如果你想将一个标签的注释与另一个现有标签合并，你将在 old_label 中保存你希望删除的标签，并在 new_label 中保存现在将包含注释的标签。
如果你想删除一个标签及其所有注释，你需要删除/注释掉 new_label 或将其设置为 None。

[ ]:

# set the old and new labels as variables, to avoid errors down the line
old_label = "old_label"
# comment out or set to None if you want to remove the label
new_label = "new_label"

如果你正在使用 new_label 变量来添加当前模式中不存在的标签，你现在需要添加它。否则，跳过以下单元格。

[ ]:

# add any labels that were not present in the original settings
settings.label_schema.append(new_label)

从记录中删除不需要的标签#

在你可以更改数据集的设置之前，你需要从记录中的所有注释和预测中删除你要删除的标签，否则，你会收到错误。为此，首先，使用查询获取所有包含该标签的记录。

[ ]:

# get all records with the old label in the annotations or predictions
records = rg.load(dataset_name, query=f"annotated_as:{old_label} OR predicted_as:{old_label}")
len(records)

现在，你可以清除注释和预测中我们标签的所有示例。

[ ]:

def cleaning_function(labels, old_label, new_label):

    # replaces / removes string labels (e.g. TextClassification)
    if isinstance(labels, str):
        if labels==old_label:
            labels = new_label

    elif isinstance(labels, list):
        # replaces / removes labels in a list (e.g. multi-label TextClassification)
        if isinstance(labels[0], str):
            if old_label in labels:
                if new_label == None:
                    labels.remove(old_label)
                else:
                    labels = [new_label if label == old_label else label for label in labels]

        # replaces / removes lables in a list of tuples (e.g. Predictions, TokenClassification)
        elif isinstance(labels[0], tuple):
            for ix,label in enumerate(labels):
                if label[0]==old_label:
                    if new_label == None:
                        labels.remove(label)
                    else:
                        new_label = list(label)
                        new_label[0] = new_label
                        labels[ix] = tuple(new_label)

    return labels

[ ]:

# loop over the records and make the correction in the predictions and annotations
for record in records:
    if record.prediction:
        record.prediction = cleaning_function(record.prediction, old_label, new_label)
    if record.annotation:
        record.annotation = cleaning_function(record.annotation, old_label, new_label)
        record.status = "Default"

提示

如果你要更改标签的名称以更正错别字，或者你要从 Token 分类 数据集或 多标签文本分类 数据集中删除标签，你可以跳过将记录状态更改为 Default。

警告

如果你要用另一个标签替换一个标签，强烈建议将状态更改为 Default，以便你可以在注释期间仔细检查新标签是否在所有情况下都适用。如果你要从 单标签文本分类 数据集中删除标签，你将始终需要将记录的状态设置为 Default。

修改记录后，将其记录回原始数据集以保存更改。

[ ]:

# log the corrected records
rg.log(records, name=dataset_name)

更新数据集设置#

现在标签在记录中不存在了，你可以修改数据集设置，删除不需要的标签并保存数据集的新配置。

[ ]:

# remove the unwanted label from the labelling schema
settings.label_schema.remove(old_label)

[ ]:

# change the configuration of the dataset
rg.configure_dataset_settings(name=dataset_name, settings=settings)

现在，不需要的标签应该从注释、预测和数据集设置中消失了。

总结#

在本教程中，你学习了如何在已经存在注释的情况下，从 Token 或文本分类数据集中删除或修改标签。此 notebook 包含代码，以便你可以更改标签的名称、将注释与另一个现有标签合并或完全删除标签。