📑 充分利用 Argilla TextFields 中的 Markdown#

正如您可能已经注意到的，Argilla 在其文本字段中支持 Markdown。这意味着您可以轻松添加格式，例如粗体、斜体或高亮文本、链接，甚至插入 HTML 元素，例如图像、音频、视频和 iframe。这是一个功能强大的工具，供您使用。让我们深入了解一下！

在本笔记本中，我们将介绍 Markdown 的基础知识，以及如何在 Argilla 中使用它。

利用 displaCy 的强大功能进行 NER 和关系提取。
探索多模态：视频、音频和图像。
检查 PDF。

让我们开始吧！

友情提示： Markdown 中的多媒体功能已在此处，但仍处于实验阶段。在我们探索早期阶段时，由于 ElasticSearch 的限制，文件大小受到限制，并且可视化和加载时间可能会因您的浏览器而异。我们正在努力改进这一点，并欢迎您的反馈和建议！🌟🚀

运行 Argilla#

对于本教程，您需要运行 Argilla 服务器。部署和运行 Argilla 有两个主要选项

在 Hugging Face Spaces 上部署 Argilla：如果您想使用外部笔记本（例如 Google Colab）运行教程，并且您在 Hugging Face 上有一个帐户，您只需点击几下即可在 Spaces 上部署 Argilla

有关配置部署的详细信息，请查看 Hugging Face Hub 官方指南。

使用 Argilla 的快速入门 Docker 镜像启动 Argilla：如果您想在本地计算机上运行 Argilla，这是推荐选项。请注意，此选项仅允许您在本地运行教程，而不能使用外部笔记本服务。

有关部署选项的更多信息，请查看文档的部署部分。

提示

本教程是一个 Jupyter Notebook。有两种运行方式

使用此页面顶部的“在 Colab 中打开”按钮。此选项允许您直接在 Google Colab 上运行笔记本。不要忘记将运行时类型更改为 GPU，以加快模型训练和推理速度。
通过单击页面顶部的“查看源代码”链接下载 .ipynb 文件。此选项允许您下载笔记本并在本地计算机或您选择的 Jupyter Notebook 工具上运行它。

设置环境#

要完成本教程，您需要使用 pip 安装 Argilla 客户端和一些第三方库

[ ]:

# %pip install --upgrade pip
%pip install argilla
%pip install datasets
%pip install spacy spacy-transformers
%pip install Pillow
%pip install span_marker
%pip install soundfile librosa
!python -m spacy download en_core_web_sm

让我们进行所需的导入

[ ]:

import argilla as rg
from argilla.client.feedback.utils import audio_to_html, image_to_html, video_to_html, pdf_to_html

import re
import os
import pandas as pd
import span_marker
import tarfile
import glob
import subprocess
import random

import spacy
from spacy import displacy

from datasets import load_dataset

from huggingface_hub import hf_hub_download

如果您正在使用 Docker 快速入门镜像或 Hugging Face Spaces 运行 Argilla，则需要使用 URL 和 API_KEY 初始化 Argilla 客户端

[ ]:

# Replace api_url with the url to your HF Spaces URL if using Spaces
# Replace api_key if you configured a custom API key
# Replace workspace with the name of your workspace
rg.init(
    api_url="https://:6900",
    api_key="owner.apikey",
    workspace="admin"
)

如果您正在运行私有的 Hugging Face Space，您还需要按如下方式设置 HF_TOKEN

[ ]:

# # Set the HF_TOKEN environment variable
# import os
# os.environ['HF_TOKEN'] = "your-hf-token"

# # Replace api_url with the url to your HF Spaces URL
# # Replace api_key if you configured a custom API key
# rg.init(
#     api_url="https://[your-owner-name]-[your_space_name].hf.space",
#     api_key="admin.apikey",
#     extra_headers={"Authorization": f"Bearer {os.environ['HF_TOKEN']}"},
# )

利用 `displaCy`#

SpaCy 是一个著名的自然语言处理 (NLP) 开源库。它为不同语言提供了广泛的模型，并且非常易于使用。提供的选项之一是 `displaCy <https://spacy.io/usage/visualizers>`__，它是 NLP 模型输出的可视化工具。在本教程中，我们将使用它来可视化 NER 模型的输出。

使用 `displaCy`#

首先，我们将通过导入英语 SpaCy 管道 (en_core_web_sm) 来解释 displaCy 的工作原理，同时排除默认的 NER 组件。稍后，我们使用 add_pipe 方法替换此组件，以在管道的末尾引入新的 span_marker 组件。这个新组件负责使用指定的模型进行 NER 训练。

[ ]:

# Load the custom pipeline
nlp = spacy.load(
    "en_core_web_sm",
    exclude=["ner"]
)
nlp.add_pipe("span_marker", config={"model": "tomaarsen/span-marker-bert-tiny-fewnerd-coarse-super"})

现在，您可以检查如何使用 displacy.render 函数（它接受文本和模型的输出）返回 HTML 字符串。下面提供了两个示例：第一个说明了句子的依存关系树，而第二个展示了 NER 的发现。

[3]:

# Show the dependency parse
doc = nlp("Rats are various medium-sized, long-tailed rodents.")
displacy.render(doc, style="dep")

c:\Users\sarah\miniconda3\envs\argilla\lib\site-packages\datasets\table.py:1395: FutureWarning: promote has been superseded by mode='default'.
  block_group = [InMemoryTable(cls._concat_blocks(list(block_group), axis=axis))]
c:\Users\sarah\miniconda3\envs\argilla\lib\site-packages\datasets\table.py:1421: FutureWarning: promote has been superseded by mode='default'.
  table = cls._concat_blocks(blocks, axis=0)

[12]:

# Show the entity recognition
text = "When Sebastian Thrun started working on self-driving cars at Google in 2007, few people outside of the company took him seriously."
doc2 = nlp(text)
displacy.render(doc2, style="ent")

当 Sebastian Thrun person 2007 年开始在 Google organization 从事自动驾驶汽车工作时，公司外部很少有人认真对待他。

您还可以使用 create_token_highlights 和自定义颜色映射向文本添加自定义高亮。例如

from argilla.client.feedback.utils import create_token_highlights
tokens = ["This", "is", "a", "test"]
weights = [0.1, 0.2, 0.3, 0.4]
html = create_token_highlights(tokens, weights, c_map=custom_RGB) # 'viridis' by default

示例：使用 displaCy 输出创建 `FeedbackDataset`#

在该示例中，我们展示了如何创建一个 Argilla FeedbackDataset，并将 displaCy 输出添加到其中。这样，我们可以检查模型的准确性，并评估用户是否想要对依存关系和/或实体应用更正。

首先，我们配置 ``FeedbackDataset`` </practical_guides/create_dataset.html#configure-the-dataset>`__。在字段中，我们使用三个 TextField 来显示默认文本、依存关系和实体。同时，在问题中，我们添加一个 LabelQuestion、一个 MultiLabelQuestion 和两个 TextQuestion。

[2]:

# Create the FeedbackDataset configuration
dataset_spacy = rg.FeedbackDataset(
    fields=[
        rg.TextField(name="text", required= True, use_markdown=True),
        rg.TextField(name="dependency-tree", required= True, use_markdown=True),
        rg.TextField(name="entities", required= True, use_markdown=True)
    ],
    questions=[
        rg.LabelQuestion(name="relevant", title="Is the text relevant?", labels=["Yes", "No"], required=True),
        rg.MultiLabelQuestion(name="question-multi", title="Mark which is correct", labels=["flag-pos", "flag-ner"], required=True),
        rg.TextQuestion(name="dependency-correction", title="Write the correct answer if needed", use_markdown=True),
        rg.TextQuestion(name="ner-correction", title="Write the correct answer if needed", use_markdown=True)
    ]
)
dataset_spacy

[2]:

FeedbackDataset(
    fields=[TextField(name='text', title='Text', required=True, type='text', use_markdown=True), TextField(name='dependency-tree', title='Dependency-tree', required=True, type='text', use_markdown=True), TextField(name='entities', title='Entities', required=True, type='text', use_markdown=True)]
    questions=[LabelQuestion(name='relevant', title='Is the text relevant?', description=None, required=True, type='label_selection', labels=['Yes', 'No'], visible_labels=None), MultiLabelQuestion(name='question-multi', title='Mark which is correct', description=None, required=True, type='multi_label_selection', labels=['flag-pos', 'flag-ner'], visible_labels=None), TextQuestion(name='dependency-correction', title='Write the correct answer if needed', description=None, required=True, type='text', use_markdown=True), TextQuestion(name='ner-correction', title='Write the correct answer if needed', description=None, required=True, type='text', use_markdown=True)]
    guidelines=None)
)

现在，我们从 Hugging Face 加载基本的 few-nerd 数据集。此数据集包含一些句子和 NER 模型的输出。我们将使用此数据集来展示如何在 Argilla 中使用 displaCy。

[ ]:

# Read the HF dataset
dataset_fewnerd = load_dataset("DFKI-SLT/few-nerd", "supervised", split="train[:20]")

接下来，我们将使用此数据集来填充我们的 Argilla FeedbackDataset。我们将使用 displacy.render 函数将 displacy 输出渲染为 html，并将 jupyter=False 设置添加到 FeedbackDataset。我们还将文本和 NER 模型的输出添加到 FeedbackDataset。最后，我们还将添加 markdown 格式的表格，以支持对 NER 和依存关系注释的基本支持。

[15]:

# Load the custom pipeline
nlp = spacy.load(
    "en_core_web_sm",
    exclude=["ner"]
)
nlp.add_pipe("span_marker", config={"model": "tomaarsen/span-marker-bert-tiny-fewnerd-coarse-super"})

# Read the dataset and run the pipeline
texts = [" ".join(x["tokens"]) for x in dataset_fewnerd]
docs = nlp.pipe(texts)

[ ]:

# Define the function to set the correct width and height of the SVG element
def wrap_in_max_width(html):
    html = html.replace("max-width: none;", "")

    # Remove existing width and height setting based on regex width="/d"
    html = re.sub(r"width=\"\d+\"", "overflow-x: auto;", html)
    html = re.sub(r"height=\"\d+\"", "", html)

    # Find the SVG element in the HTML output
    svg_start = html.find("<svg")
    svg_end = html.find("</svg>") + len("</svg>")
    svg = html[svg_start:svg_end]

    # Set the width and height attributes of the SVG element to 100%
    svg = svg.replace("<svg", "<svg width='100%' height='100%'")

    # Wrap the SVG element in a div with max-width and horizontal scrolling
    return f"<div style='max-width: 100%; overflow-x: auto;'>{svg}</div>"

[ ]:

# Add the records to the FeedbackDataset
records = []
for doc in docs:
    record = rg.FeedbackRecord(
        fields={
            "text": doc.text,
            "dependency-tree": displacy.render(doc, style="dep", jupyter=False),
            "entities": displacy.render(doc, style="ent", jupyter=False)
        },
        suggestions=[{
                "question_name": "dependency-correction",
                "value": pd.DataFrame([{"Label": token.dep_, "Text": token.text} for token in doc]).to_markdown(index=False)

            },
            {
                "question_name": "ner-correction",
                "value": pd.DataFrame([{"Label": ent.label_, "Text": ent.text} for ent in doc.ents]).to_markdown(index=False),
            }
        ]
    )
    records.append(record)

dataset_spacy.add_records(records)

[ ]:

# Push the dataset to Argilla
dataset_spacy = dataset_spacy.push_to_argilla(name="exploiting_displacy", workspace="admin")

探索多模态：视频、音频和图像#

正如我们已经提到的，Argilla 支持在 markdown 字段中处理视频、音频和图像，前提是它们以 HTML 格式设置。为了方便起见，我们提供了三个函数：video_to_html、audio_to_html 和 image_to_html。这些函数接受文件路径或文件的字节数据，并返回相应的 HTMurl，以在 Argilla 用户界面中渲染媒体文件。此外，您还可以为视频和图像设置像素或百分比的 width 和 height（默认为原始尺寸），并为音频和视频将 autoplay 和 loop 属性设置为 True（默认为 False）。

我们将使用 TextField 定义我们的 FeedbackDataset 以添加媒体内容。我们还将添加一个问题，要求用户描述视频、音频或图像文件。

[4]:

# Configure the FeedbackDataset
ds_multi_modal = rg.FeedbackDataset(
    fields=[rg.TextField(name="content", use_markdown=True, required=True)],
    questions=[rg.TextQuestion(name="description", title="Describe the content of the media:", use_markdown=True, required=True)],
)
ds_multi_modal

[4]:

FeedbackDataset(
    fields=[TextField(name='content', title='Content', required=True, type='text', use_markdown=True)]
    questions=[TextQuestion(name='description', title='Describe the content of the media:', description=None, required=True, type='text', use_markdown=True)]
    guidelines=None)
)

我们将使用相应的函数调用 add_records 方法，并将媒体内容添加到 FeedbackDataset。

[24]:

# Add the records
records = [
    rg.FeedbackRecord(fields={"content": video_to_html("/content/snapshot.mp4", autoplay=True)}),
    rg.FeedbackRecord(fields={"content": audio_to_html("/content/sea.wav", autoplay=True, loop=True)}),
    rg.FeedbackRecord(fields={"content": image_to_html("/content/peacock.jpg", width="50%", height="50%")}),
]
ds_multi_modal.add_records(records)

[ ]:

# Push the dataset to Argilla
ds_multi_modal = ds_multi_modal.push_to_argilla("multi-modal-basic", workspace="admin")

检查 PDF#

Argilla 也支持 PDF。您可以使用 pdf_to_html 函数将 PDF 添加到 TextField，这与我们之前所做的类似。此函数接受文件路径、URL 或文件的字节数据，并返回相应的 HTML，以在 Argilla 用户界面中渲染 PDF。或者，您还可以设置像素或百分比的 width 和 height（默认为 1000px）。下面，您可以看到如何使用此函数的示例。

[ ]:

# In this case, we will use the URL of the PDF file
file_url = "https://arxiv.org/pdf/2310.06825.pdf"

# Configure the FeedbackDataset
ds_pdf = rg.FeedbackDataset(
    fields=[
        rg.TextField(name="content", use_markdown=True, required=True),
    ],
    questions=[
        rg.TextQuestion(name="description", use_markdown=True, required=True)
    ],
)

# Push the dataset to Argilla
ds_pdf = ds_pdf.push_to_argilla(name='analyze_pdf_dataset', workspace='argilla')

# Add the records using pdf_to_html
records = [
    rg.FeedbackRecord(fields={"content": pdf_to_html(file_source=file_url, width="700px", height="700px")})]
ds_pdf.add_records(records)

提示

您还可以在 Argilla UI 中查看 .docx、.pptx 或 .xlsx，方法是将 use_markdown=True 设置为 true，并在它们位于公共 URL 上时嵌入它们。例如，您可以将它们上传到您的 Google Drive，并通过将共享设置更改为“拥有链接的任何人都可以查看”来获取公共 URL。

file_url = "your-sharable-link.xlsx"
html = f"<embed src={file_url} type=application/pdf width=700px height=700px/></embed>"

ds = rg.FeedbackDataset(...)

records = [
    rg.FeedbackRecord(fields={"xlsx_file": html})]
ds.add_records(records)

ds = ds.push_to_argilla(name='xlsx', workspace='argilla')

既然您已经了解了 Markdown 技巧，现在轮到您创建自己的多模态 FeedbackDataset 了！🔥

📑 充分利用 Argilla TextFields 中的 Markdown#

运行 Argilla#

设置环境#

利用 displaCy#

使用 displaCy#

示例：使用 displaCy 输出创建 FeedbackDataset#

探索多模态：视频、音频和图像#

示例：创建多模态视频-音频-图像 FeedbackDataset#

检查 PDF#

利用 `displaCy`#

使用 `displaCy`#

示例：使用 displaCy 输出创建 `FeedbackDataset`#

示例：创建多模态视频-音频-图像 `FeedbackDataset`#