Open In Colab  View Notebook on GitHub

🚀 使用 Transformer 在浏览器中通过主动学习循环和免费 GPU 运行 Argilla#

在本教程中,您将学习如何使用 Google Colab 和后端 GPU 设置完整的主动学习循环。本教程基于small-text 主动学习教程。主要区别在于本教程旨在在 Google Colab notebook 中运行,并使用 GPU 作为后端,以便更高效地使用 Transformer 模型进行主动学习循环。建议直接在 Google Colab 上按照本教程进行操作。您可以通过此超链接打开 Colab notebook,创建自己的副本并针对自己的用例进行修改。

⚠️ 请注意,此 notebook 需要手动输入以在终端中启动 Argilla 并输入 ngrok 令牌。请阅读每个单元格的说明。如果您不按照说明进行操作并按正确的顺序执行所有操作,代码将会出错。如果您遇到错误,重新启动运行时可以解决一些问题。⚠️

🙋🏼‍♂️ 此 notebook 由 Moritz Laurer 贡献

Google Colab 上的初始设置#

在 Colab 界面中,您可以通过单击左上角菜单中的“运行时”>“更改运行时类型”>“硬件加速器”来选择 CPU(用于初始测试)或 GPU(用于高效的主动学习循环)。选择硬件后,安装所需的软件包。

[ ]:
%pip install "argilla[server, listeners]==1.16.0"
%pip install "transformers[sentencepiece]~=4.25.1"
%pip install "datasets~=2.7.1"
%pip install "small-text[transformers]~=1.3.2"
%pip install "colab-xterm~=0.1.2"
%pip install "pyngrok~=5.2.1"
%pip install "colab-xterm~=0.1.2"
[ ]:
# info on the hardware you are using - either a CPU or GPU
!nvidia-smi
# info on available ram
from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('\n\nYour runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

启用遥测#

我们从您与我们的教程互动的方式中获得宝贵的见解。为了改进我们自己,为您提供最合适的内容,使用以下代码行将帮助我们了解本教程是否有效地为您服务。虽然这是完全匿名的,但如果您愿意,可以选择跳过此步骤。有关更多信息,请查看遥测页面。

[ ]:
try:
    from argilla.utils.telemetry import tutorial_running
    tutorial_running()
except ImportError:
    print("Telemetry is introduced in Argilla 1.20.0 and not found in the current installation. Skipping telemetry.")

在终端中启动 Argilla localhost#

现在您需要在单独的终端中启动 Argilla localhost。我们不能只在 Colab 的代码单元格中运行 !argilla server start,因为该单元格将无限期运行并阻止我们运行其他单元格。因此,我们需要打开一个单独的终端来运行 Argilla。

  1. 使用 Colab Pro 的选项:打开 Colab Pro 终端(左下角的按钮),然后在终端中键入:argilla server start

  2. 不使用 Colab Pro 的选项:运行以下代码单元格,以使用 xterm 在代码单元格中获取免费的终端窗口。然后在终端窗口中键入 argilla server start

[ ]:
# create a terminal to run Argilla with, in case you don't have Colab Pro.
# type "argilla server start" into the terminal that appears below this code cell.
%load_ext colabxterm
%xterm

上面的终端窗口现在应显示类似以下内容

“... INFO: Application startup complete.

INFO: Uvicorn running on http://0.0.0.0:6900 (Press CTRL+C to quit)”

将数据记录到 argilla 并使用 small-text 启动您的主动学习循环#

如果您单击上面的公共链接,您应该能够访问 Argilla,但尚未将任何数据记录到 Argilla。以下代码下载示例数据集并将其记录到 Argilla。您可以更改以下代码以下载您要注释的任何其他数据集。以下代码遵循使用 small-text 进行主动学习教程,因此包含较少的解释。

[ ]:
# load dataset
import datasets
dataset_name = "trec"
dataset_hf = datasets.load_dataset(dataset_name, version=datasets.Version("2.0.0"))
# we work with only a sixth of the texts of the dataset for faster testing
dataset_hf["train"] = dataset_hf["train"].shard(num_shards=6, index=0)

[ ]:
## choose the transformer and load tokenizer
import torch
from transformers import AutoTokenizer

# Choose transformer model: In non-gpu environments we use a tiny model to increase efficiency
if not torch.cuda.is_available():
    transformer_model = "prajjwal1/bert-tiny"
    print(f"No GPU is available, we therefore use the small model '{transformer_model}' for the active learning loop.\n")
else:
    transformer_model = "microsoft/deberta-v3-xsmall"  #"bert-base-uncased"
    print(f"A GPU is available, we can therefore use '{transformer_model}' for the active learning loop.\n")

# Init tokenizer
tokenizer = AutoTokenizer.from_pretrained(transformer_model)

[ ]:
## create small_text transformersdataset object
import numpy as np
from small_text import TransformersDataset

num_classes = dataset_hf["train"].features["coarse_label"].num_classes
target_labels = np.arange(num_classes)

train_text = [row["text"] for row in dataset_hf["train"]]
train_labels = np.array([row["coarse_label"] for row in dataset_hf["train"]])

# Create the dataset for small-text
dataset_st = TransformersDataset.from_arrays(
    train_text, train_labels, tokenizer, target_labels=target_labels
)

# Create test dataset
test_text = [row["text"] for row in dataset_hf["test"]]
test_labels = np.array([row["coarse_label"] for row in dataset_hf["test"]])

dataset_test = TransformersDataset.from_arrays(
    test_text, test_labels, tokenizer, target_labels=np.arange(num_classes)
)


[ ]:
## setting up the active learner
from small_text import (
    BreakingTies,
    PoolBasedActiveLearner,
    TransformerBasedClassificationFactory,
    TransformerModelArguments,
)

# Define our classifier
device = "cuda" if torch.cuda.is_available() else "cpu"
print("Using device: ", device)

num_epochs = 5  # higher values of around 40 will probably improve performance on small datasets, but the active learning loop will take longer
clf_factory = TransformerBasedClassificationFactory(
    TransformerModelArguments(transformer_model),
    num_classes=num_classes,
    kwargs={"device": device, "num_epochs": num_epochs, "lr": 2e-05, "mini_batch_size": 8,
            "early_stopping_no_improvement": 5}
)


# Define our query strategy
query_strategy = BreakingTies()

# Use the active learner with a pool containing all unlabeled data
active_learner = PoolBasedActiveLearner(clf_factory, query_strategy, dataset_st)

[ ]:
## draw an initial sample for the first annotation round
# https://small-text.readthedocs.io/en/v1.1.1/components/initialization.html
from small_text import random_initialization, random_initialization_stratified, random_initialization_balanced
import numpy as np

# Fix seed for reproducibility
np.random.seed(42)

# Number of samples in our queried batches
NUM_SAMPLES = 10

# Draw an initial subset from the data pool
#initial_indices = random_initialization(dataset_st, NUM_SAMPLES)
#initial_indices = random_initialization_balanced(train_labels, NUM_SAMPLES)
initial_indices = random_initialization_stratified(train_labels, NUM_SAMPLES)

[ ]:
### log the first data to Argilla
import argilla as rg

# Choose a name for the dataset
DATASET_NAME = f"{dataset_name}-with-active-learning"

# Define labeling schema
labels = dataset_hf["train"].features["coarse_label"].names
settings = rg.TextClassificationSettings(label_schema=labels)

# Create dataset with a label schema
rg.configure_dataset_settings(name=DATASET_NAME, settings=settings)

# Create records from the initial batch
records = [
    rg.TextClassificationRecord(
        text=dataset_hf["train"]["text"][idx],
        metadata={"batch_id": 0},
        id=idx.item(),
    )
    for idx in initial_indices
]

# Log initial records to Argilla
rg.log(records, DATASET_NAME)

[ ]:
### create active learning loop
from argilla.listeners import listener
from sklearn.metrics import accuracy_score

# Define some helper variables
LABEL2INT = dataset_hf["train"].features["coarse_label"].str2int
ACCURACIES = []

# Set up the active learning loop with the listener decorator
@listener(
    dataset=DATASET_NAME,
    query="status:Validated AND metadata.batch_id:{batch_id}",
    condition=lambda search: search.total == NUM_SAMPLES,
    execution_interval_in_seconds=3,
    batch_id=0,
)
def active_learning_loop(records, ctx):
    # 1. Update active learner
    print(f"Updating with batch_id {ctx.query_params['batch_id']} ...")
    y = np.array([LABEL2INT(rec.annotation) for rec in records])

    # initial update
    if ctx.query_params["batch_id"] == 0:
        indices = np.array([rec.id for rec in records])
        active_learner.initialize_data(indices, y)
    # update with the prior queried indices
    else:
        active_learner.update(y)
    print("Done!")

    # 2. Query active learner
    print("Querying new data points ...")
    queried_indices = active_learner.query(num_samples=NUM_SAMPLES)
    new_batch = ctx.query_params["batch_id"] + 1
    new_records = [
        rg.TextClassificationRecord(
            text=dataset_hf["train"]["text"][idx],
            metadata={"batch_id": new_batch},
            id=idx.item(),
        )
        for idx in queried_indices
    ]

    # 3. Log the batch to Argilla
    rg.log(new_records, DATASET_NAME)

    # 4. Evaluate current classifier on the test set
    print("Evaluating current classifier ...")
    accuracy = accuracy_score(
        dataset_test.y,
        active_learner.classifier.predict(dataset_test),
    )

    ACCURACIES.append(accuracy)
    ctx.query_params["batch_id"] = new_batch
    print("Done!")

    print("Waiting for annotations ...")



active_learning_loop.start()

提取注释数据以供下游使用#

[ ]:
## https://docs.v1.argilla.com.cn/en/latest/getting_started/quickstart.html#Manual-extraction

# load your annotations
dataset_annotated = rg.load(DATASET_NAME)
# convert to Hugging Face format
dataset_annotated = dataset_annotated.prepare_for_training()
# now you can write your annotations to .csv, use them for training etc.
df_annotations = pd.DataFrame(dataset_annotated)
df_annotations.head()

总结#

在本教程中,我们了解了如何在 Google Colab 的 GPU 上将 Argilla 嵌入到主动学习循环中。我们依靠 small-text 在主动学习设置中使用 Hugging Face transformer。最后,我们通过仅注释模型信息量最大的记录来收集样本高效的数据集。

Argilla 使将专门的注释团队或主题 matter 专家用作主动学习系统的 oracle 变得非常容易。他们只会与 Argilla UI 交互,而不必担心训练或查询系统。我们鼓励您在下一个项目中尝试主动学习,让您和您的注释员的生活更轻松一些。