🚀 使用 Transformer 在浏览器中通过主动学习循环和免费 GPU 运行 Argilla#

在本教程中，您将学习如何使用 Google Colab 和后端 GPU 设置完整的主动学习循环。本教程基于small-text 主动学习教程。主要区别在于本教程旨在在 Google Colab notebook 中运行，并使用 GPU 作为后端，以便更高效地使用 Transformer 模型进行主动学习循环。建议直接在 Google Colab 上按照本教程进行操作。您可以通过此超链接打开 Colab notebook，创建自己的副本并针对自己的用例进行修改。

⚠️ 请注意，此 notebook 需要手动输入以在终端中启动 Argilla 并输入 ngrok 令牌。请阅读每个单元格的说明。如果您不按照说明进行操作并按正确的顺序执行所有操作，代码将会出错。如果您遇到错误，重新启动运行时可以解决一些问题。⚠️

🙋🏼‍♂️ 此 notebook 由 Moritz Laurer 贡献

Google Colab 上的初始设置#

在 Colab 界面中，您可以通过单击左上角菜单中的“运行时”>“更改运行时类型”>“硬件加速器”来选择 CPU（用于初始测试）或 GPU（用于高效的主动学习循环）。选择硬件后，安装所需的软件包。

[ ]:

%pip install "argilla[server, listeners]==1.16.0"
%pip install "transformers[sentencepiece]~=4.25.1"
%pip install "datasets~=2.7.1"
%pip install "small-text[transformers]~=1.3.2"
%pip install "colab-xterm~=0.1.2"
%pip install "pyngrok~=5.2.1"
%pip install "colab-xterm~=0.1.2"

[ ]:

# info on the hardware you are using - either a CPU or GPU
!nvidia-smi
# info on available ram
from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('\n\nYour runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

启用遥测#

我们从您与我们的教程互动的方式中获得宝贵的见解。为了改进我们自己，为您提供最合适的内容，使用以下代码行将帮助我们了解本教程是否有效地为您服务。虽然这是完全匿名的，但如果您愿意，可以选择跳过此步骤。有关更多信息，请查看遥测页面。

[ ]:

try:
    from argilla.utils.telemetry import tutorial_running
    tutorial_running()
except ImportError:
    print("Telemetry is introduced in Argilla 1.20.0 and not found in the current installation. Skipping telemetry.")

安装 Elastic Search#

Elastic Search 是使用 Argilla 的一项要求。Argilla 推荐的 Elastic Search 的docker 安装在 Google Colab 中不起作用，因为Colab 不支持 docker。因此，Elastic Search 需要使用以下代码“手动”安装。

[ ]:

%%bash

wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.10.2-linux-x86_64.tar.gz -q
tar -xzf elasticsearch-7.10.2-linux-x86_64.tar.gz
chown -R daemon:daemon elasticsearch-7.10.2

[ ]:

%%bash --bg

sudo -u daemon -- elasticsearch-7.10.2/bin/elasticsearch

[ ]:

import time
time.sleep(30)  # sleeping to give ES time to set up. Otherwise downstream code will bug

在终端中启动 Argilla localhost#

现在您需要在单独的终端中启动 Argilla localhost。我们不能只在 Colab 的代码单元格中运行 !argilla server start，因为该单元格将无限期运行并阻止我们运行其他单元格。因此，我们需要打开一个单独的终端来运行 Argilla。

使用 Colab Pro 的选项：打开 Colab Pro 终端（左下角的按钮），然后在终端中键入：argilla server start
不使用 Colab Pro 的选项：运行以下代码单元格，以使用 xterm 在代码单元格中获取免费的终端窗口。然后在终端窗口中键入 argilla server start

[ ]:

# create a terminal to run Argilla with, in case you don't have Colab Pro.
# type "argilla server start" into the terminal that appears below this code cell.
%load_ext colabxterm
%xterm

上面的终端窗口现在应显示类似以下内容

“... INFO: Application startup complete.

INFO: Uvicorn running on http://0.0.0.0:6900 (Press CTRL+C to quit)”

使用 ngrok 创建到 Argilla localhost 的公共链接#

我们现在有来自 Google 的虚拟机正在运行 Argilla localhost，但我们尚无法访问它。ngrok 是一项旨在创建到 localhost 的公共链接的服务。因此，我们可以使用 ngrok 创建一个公共链接来访问在 Google 机器上运行的 Argilla localhost。请注意，任何拥有此（临时）公共链接的人都可以访问（临时）localhost。为了使用 ngrok，您需要创建一个免费帐户。按照此处的说明，只需一分钟即可创建一个免费帐户。使用免费帐户，您将收到一个访问令牌。获得访问令牌后，您可以运行以下单元格并将令牌复制到输入提示符中。

[ ]:

import getpass
from pyngrok import ngrok, conf

print("Enter your authtoken, which can be copied from https://dashboard.ngrok.com/auth")
print("You need to create a free ngrok account to get an authtoken. The token looks something like this: ASDO1283YZaDu95vysXYIUXZXYRR_54YfASDIb8cpNfVoz349587")
conf.get_default().auth_token = getpass.getpass()
# if the above does not work, you can try:
#ngrok.set_auth_token("<INSER_YOUR_NGROK_AUTHTOKEN>")

[ ]:

# disconnect all existing tunnels to avoid issues when rerunning cells
[ngrok.disconnect(tunnel.public_url) for tunnel in ngrok.get_tunnels()]

# create the public link
# ! check whether this is actually the localhost port Argilla is running on via the terminal above
ngrok_tunnel = ngrok.connect(6900)  # insert the port number Argilla is running on. e.g. 6900 if the terminal displays something like "Uvicorn running on http://0.0.0.0:6900"
print("You can now access the Argilla localhost with the public link below. (It should look something like 'http://X03b-34-XXX-237-25.ngrok.io')\n")
print(f"Your ngrok public link: {ngrok_tunnel}\n")
print("After clicking on the link, there will be a warning, which you can ignore")
print("You can then login with the default argilla username 'argilla' and password '1234'")

将数据记录到 argilla 并使用 small-text 启动您的主动学习循环#

如果您单击上面的公共链接，您应该能够访问 Argilla，但尚未将任何数据记录到 Argilla。以下代码下载示例数据集并将其记录到 Argilla。您可以更改以下代码以下载您要注释的任何其他数据集。以下代码遵循使用 small-text 进行主动学习教程，因此包含较少的解释。

[ ]:

# load dataset
import datasets
dataset_name = "trec"
dataset_hf = datasets.load_dataset(dataset_name, version=datasets.Version("2.0.0"))
# we work with only a sixth of the texts of the dataset for faster testing
dataset_hf["train"] = dataset_hf["train"].shard(num_shards=6, index=0)

[ ]:

## choose the transformer and load tokenizer
import torch
from transformers import AutoTokenizer

# Choose transformer model: In non-gpu environments we use a tiny model to increase efficiency
if not torch.cuda.is_available():
    transformer_model = "prajjwal1/bert-tiny"
    print(f"No GPU is available, we therefore use the small model '{transformer_model}' for the active learning loop.\n")
else:
    transformer_model = "microsoft/deberta-v3-xsmall"  #"bert-base-uncased"
    print(f"A GPU is available, we can therefore use '{transformer_model}' for the active learning loop.\n")

# Init tokenizer
tokenizer = AutoTokenizer.from_pretrained(transformer_model)

[ ]:

## create small_text transformersdataset object
import numpy as np
from small_text import TransformersDataset

num_classes = dataset_hf["train"].features["coarse_label"].num_classes
target_labels = np.arange(num_classes)

train_text = [row["text"] for row in dataset_hf["train"]]
train_labels = np.array([row["coarse_label"] for row in dataset_hf["train"]])

# Create the dataset for small-text
dataset_st = TransformersDataset.from_arrays(
    train_text, train_labels, tokenizer, target_labels=target_labels
)

# Create test dataset
test_text = [row["text"] for row in dataset_hf["test"]]
test_labels = np.array([row["coarse_label"] for row in dataset_hf["test"]])

dataset_test = TransformersDataset.from_arrays(
    test_text, test_labels, tokenizer, target_labels=np.arange(num_classes)
)

[ ]:

## setting up the active learner
from small_text import (
    BreakingTies,
    PoolBasedActiveLearner,
    TransformerBasedClassificationFactory,
    TransformerModelArguments,
)

# Define our classifier
device = "cuda" if torch.cuda.is_available() else "cpu"
print("Using device: ", device)

num_epochs = 5  # higher values of around 40 will probably improve performance on small datasets, but the active learning loop will take longer
clf_factory = TransformerBasedClassificationFactory(
    TransformerModelArguments(transformer_model),
    num_classes=num_classes,
    kwargs={"device": device, "num_epochs": num_epochs, "lr": 2e-05, "mini_batch_size": 8,
            "early_stopping_no_improvement": 5}
)


# Define our query strategy
query_strategy = BreakingTies()

# Use the active learner with a pool containing all unlabeled data
active_learner = PoolBasedActiveLearner(clf_factory, query_strategy, dataset_st)

[ ]:

## draw an initial sample for the first annotation round
# https://small-text.readthedocs.io/en/v1.1.1/components/initialization.html
from small_text import random_initialization, random_initialization_stratified, random_initialization_balanced
import numpy as np

# Fix seed for reproducibility
np.random.seed(42)

# Number of samples in our queried batches
NUM_SAMPLES = 10

# Draw an initial subset from the data pool
#initial_indices = random_initialization(dataset_st, NUM_SAMPLES)
#initial_indices = random_initialization_balanced(train_labels, NUM_SAMPLES)
initial_indices = random_initialization_stratified(train_labels, NUM_SAMPLES)

[ ]:

### log the first data to Argilla
import argilla as rg

# Choose a name for the dataset
DATASET_NAME = f"{dataset_name}-with-active-learning"

# Define labeling schema
labels = dataset_hf["train"].features["coarse_label"].names
settings = rg.TextClassificationSettings(label_schema=labels)

# Create dataset with a label schema
rg.configure_dataset_settings(name=DATASET_NAME, settings=settings)

# Create records from the initial batch
records = [
    rg.TextClassificationRecord(
        text=dataset_hf["train"]["text"][idx],
        metadata={"batch_id": 0},
        id=idx.item(),
    )
    for idx in initial_indices
]

# Log initial records to Argilla
rg.log(records, DATASET_NAME)

[ ]:

### create active learning loop
from argilla.listeners import listener
from sklearn.metrics import accuracy_score

# Define some helper variables
LABEL2INT = dataset_hf["train"].features["coarse_label"].str2int
ACCURACIES = []

# Set up the active learning loop with the listener decorator
@listener(
    dataset=DATASET_NAME,
    query="status:Validated AND metadata.batch_id:{batch_id}",
    condition=lambda search: search.total == NUM_SAMPLES,
    execution_interval_in_seconds=3,
    batch_id=0,
)
def active_learning_loop(records, ctx):
    # 1. Update active learner
    print(f"Updating with batch_id {ctx.query_params['batch_id']} ...")
    y = np.array([LABEL2INT(rec.annotation) for rec in records])

    # initial update
    if ctx.query_params["batch_id"] == 0:
        indices = np.array([rec.id for rec in records])
        active_learner.initialize_data(indices, y)
    # update with the prior queried indices
    else:
        active_learner.update(y)
    print("Done!")

    # 2. Query active learner
    print("Querying new data points ...")
    queried_indices = active_learner.query(num_samples=NUM_SAMPLES)
    new_batch = ctx.query_params["batch_id"] + 1
    new_records = [
        rg.TextClassificationRecord(
            text=dataset_hf["train"]["text"][idx],
            metadata={"batch_id": new_batch},
            id=idx.item(),
        )
        for idx in queried_indices
    ]

    # 3. Log the batch to Argilla
    rg.log(new_records, DATASET_NAME)

    # 4. Evaluate current classifier on the test set
    print("Evaluating current classifier ...")
    accuracy = accuracy_score(
        dataset_test.y,
        active_learner.classifier.predict(dataset_test),
    )

    ACCURACIES.append(accuracy)
    ctx.query_params["batch_id"] = new_batch
    print("Done!")

    print("Waiting for annotations ...")



active_learning_loop.start()

通过 ngrok 链接在浏览器中开始注释#

[ ]:

print(f"You can now start annotating with active learning in the background!")
print(f"The public link for accessing the annotation interface is: {ngrok_tunnel}")

You can now start annotating with an active learning in the background!
The public link for accessing the annotation interface is: NgrokTunnel: "http://30b0-34-124-178-185.ngrok.io" -> "https://#:6900"

在每次迭代 10 个新的注释文本后，主动学习器将重新训练并推荐一批新的 10 个文本。因此，您需要手动注释正好 10 个文本才能获得新的文本。

⚠️ 请注意，主动学习器需要一段时间才能重新训练并分析所有剩余数据以推荐新数据。这可能需要几分钟。几分钟后刷新 Argilla 窗口，应该会在界面中自动出现一批新的 10 个文本。如果它没有立即起作用，请仔细检查您是否真的注释了所有 10 个新文本并等待更长的时间。⚠️

[ ]:

# when you are done, stop active learning loop
active_learning_loop.stop()

[ ]:

# plot learning progress over different active learning iterations
import pandas as pd
pd.Series(ACCURACIES).plot(xlabel="Iteration", ylabel="Accuracy")

提取注释数据以供下游使用#

[ ]:

## https://docs.v1.argilla.com.cn/en/latest/getting_started/quickstart.html#Manual-extraction

# load your annotations
dataset_annotated = rg.load(DATASET_NAME)
# convert to Hugging Face format
dataset_annotated = dataset_annotated.prepare_for_training()
# now you can write your annotations to .csv, use them for training etc.
df_annotations = pd.DataFrame(dataset_annotated)
df_annotations.head()

总结#

在本教程中，我们了解了如何在 Google Colab 的 GPU 上将 Argilla 嵌入到主动学习循环中。我们依靠 small-text 在主动学习设置中使用 Hugging Face transformer。最后，我们通过仅注释模型信息量最大的记录来收集样本高效的数据集。

Argilla 使将专门的注释团队或主题 matter 专家用作主动学习系统的 oracle 变得非常容易。他们只会与 Argilla UI 交互，而不必担心训练或查询系统。我们鼓励您在下一个项目中尝试主动学习，让您和您的注释员的生活更轻松一些。