๐ค ๅๅญฆ่ ไฝฟ็จ BERT ๅพฎ่ฐ NER ๆจกๅ#
ๆจๆฏๅๅญฆ่ ๅ๏ผๆจๆณๅญฆไน ๏ผไฝไธ็ฅ้ไปๅช้ๅผๅง๏ผๅจๆฌๆ็จไธญ๏ผๆจๅฐๅญฆไน ๅฆไฝไธบๅฝๅๅฎไฝ่ฏๅซๅพฎ่ฐ้ข่ฎญ็ป็ BERT ๆจกๅใๅฎๅฐๅผๅฏผๆจๅฎๆไปฅไธๆญฅ้ชค
๐ ๅฐๆจ็่ฎญ็ปๆฐๆฎ้ๅ ่ฝฝๅฐ Argilla ไธญ๏ผๅนถไฝฟ็จๅ ถๅทฅๅ ท่ฟ่กๆข็ดขใ
โณ ้ขๅค็ๆฐๆฎไปฅ็ๆๆจกๅๆ้็ๅ ถไป่พๅ ฅ๏ผๅนถๅฐๅฎไปฌๆพๅ ฅๆจกๅๆๆ็ๆ ผๅผไธญใ
๐ ไธ่ฝฝ BERT ๆจกๅๅนถๅผๅงๅพฎ่ฐๅฎใ
๐งช ๆง่กๆจ่ชๅทฑ็ๆต่ฏ๏ผ
็ฎไป#
ๆไปฌ็็ฎๆ ๆฏไป่ฎญ็ปๆฐๆฎ้ๅฑ็คบๅฆไฝๅพฎ่ฐไธไธชๅพฎๅฐ็ BERT ๆจกๅ๏ผไปฅไพฟ่ฏๅซ NER ๆ ็ญพใ
ไธบๆญค๏ผๆไปฌๅฐ้ฆๅ ่ฟๆฅๅฐ Argilla ๅนถ่ฎฐๅฝๆไปฌ็ ๆฐๆฎ้๏ผไปฅไพฟๆไปฌๅฏไปฅ็จๆด็ด่ง็ๆนๅผๅๆๅฎใ
ๆฅไธๆฅ๏ผๆไปฌๅฐ้ขๅค็ๆไปฌ็ๆฐๆฎ้ๅนถๅพฎ่ฐๆจกๅใ่ฟ้ๆไปฌๅฐไฝฟ็จ DistilBERT๏ผไฝฟๅ ถๆดๅฎนๆ็่งฃๅนถ่ฝปๆพๅผๅง็ฉ่ฝฌๅๆฐใไฝๆฏ๏ผไป็ถๆๅพๅค็ฑปไผผ็ ๅฏไพๆข็ดขใ
โจ่ฎฉๆไปฌๅผๅงๅง๏ผ
่ฟ่ก Argilla#
ๅฏนไบๆฌๆ็จ๏ผๆจ้่ฆ่ฟ่ก Argilla ๆๅกๅจใ้จ็ฝฒๅ่ฟ่ก Argilla ๆไธคไธชไธป่ฆ้้กน
ๅจ Hugging Face Spaces ไธ้จ็ฝฒ Argilla๏ผๅฆๆๆจๅจ Hugging Face ไธๆๅธๆท๏ผ่ฟๆฏๆๅฟซ็้้กน๏ผไนๆฏ่ฟๆฅๅฐๅค้จ็ฌ่ฎฐๆฌ๏ผไพๅฆ๏ผGoogle Colab๏ผ็ๆจ่้ๆฉใ
ไฝฟ็จ Argilla ็ๅฟซ้ๅ ฅ้จ Docker ้ๅๅฏๅจ Argilla๏ผๅฆๆๆจๅธๆ Argilla ๅจๆฌๅฐ่ฎก็ฎๆบไธ่ฟ่ก๏ผ่ฟๆฏๆจ่็้้กนใ่ฏทๆณจๆ๏ผๆญค้้กนไป ๅ ่ฎธๆจๅจๆฌๅฐ่ฟ่กๆฌๆ็จ๏ผ่ไธ่ฝไธๅค้จ็ฌ่ฎฐๆฌๆๅกไธ่ตท่ฟ่กใ
ๆๅ ณ้จ็ฝฒ้้กน็ๆดๅคไฟกๆฏ๏ผ่ฏทๆฅ็ๆๆกฃ็้จ็ฝฒ้จๅใ
๐คฏ ๆ็คบ
ๆฌๆ็จๆฏไธไธช Jupyter Notebookใๆไธค็ง่ฟ่กๆนๅผ๏ผ - ไฝฟ็จๆญค้กต้ข้กถ้จ็โๅจ Colab ไธญๆๅผโๆ้ฎใๆญค้้กนๅ ่ฎธๆจ็ดๆฅๅจ Google Colab ไธ่ฟ่ก notebookใไธ่ฆๅฟ่ฎฐๅฐ่ฟ่กๆถ็ฑปๅๆดๆนไธบ GPU ไปฅๅ ๅฟซๆจกๅ่ฎญ็ปๅๆจ็้ๅบฆใ - ้่ฟๅๅป้กต้ข้กถ้จ็โๆฅ็ๆบไปฃ็ โ้พๆฅไธ่ฝฝ .ipynb ๆไปถใๆญค้้กนๅ ่ฎธๆจไธ่ฝฝ notebook ๅนถๅจๆฌๅฐ่ฎก็ฎๆบๆๆจ้ๆฉ็ Jupyter notebook ๅทฅๅ ทไธ่ฟ่กๅฎใ
่ฎพ็ฝฎ#
ๅฏนไบๆฌๆ็จ๏ผๆจ้่ฆไฝฟ็จ pip
ๅฎ่ฃ
Argilla ๅฎขๆท็ซฏๅไธไบ็ฌฌไธๆนๅบ
[1]:
%pip install "argilla[server]==1.5.0" -qqq
%pip install datasets
%pip install transformers
%pip install evaluate
%pip install seqeval
%pip install transformers[torch]
%pip install accelerate -U
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 2.0/2.0 MB 11.1 MB/s eta 0:00:00
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 71.5/71.5 kB 8.9 MB/s eta 0:00:00
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 238.1/238.1 kB 15.2 MB/s eta 0:00:00
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 55.5/55.5 kB 7.1 MB/s eta 0:00:00
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 214.7/214.7 kB 14.7 MB/s eta 0:00:00
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 385.3/385.3 kB 17.0 MB/s eta 0:00:00
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 56.9/56.9 kB 5.9 MB/s eta 0:00:00
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 51.6/51.6 kB 7.2 MB/s eta 0:00:00
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 45.7/45.7 kB 6.1 MB/s eta 0:00:00
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 525.6/525.6 kB 18.3 MB/s eta 0:00:00
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 69.9/69.9 kB 9.4 MB/s eta 0:00:00
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 2.7/2.7 MB 25.6 MB/s eta 0:00:00
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 59.5/59.5 kB 8.7 MB/s eta 0:00:00
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 3.1/3.1 MB 33.9 MB/s eta 0:00:00
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 64.3/64.3 kB 8.4 MB/s eta 0:00:00
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 69.6/69.6 kB 9.3 MB/s eta 0:00:00
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 49.6/49.6 kB 6.9 MB/s eta 0:00:00
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 143.1/143.1 kB 19.1 MB/s eta 0:00:00
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 593.7/593.7 kB 29.5 MB/s eta 0:00:00
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 142.9/142.9 kB 20.5 MB/s eta 0:00:00
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 51.1/51.1 kB 6.5 MB/s eta 0:00:00
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 58.3/58.3 kB 8.0 MB/s eta 0:00:00
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 428.8/428.8 kB 36.5 MB/s eta 0:00:00
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 4.1/4.1 MB 49.8 MB/s eta 0:00:00
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 1.3/1.3 MB 50.5 MB/s eta 0:00:00
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 129.9/129.9 kB 16.5 MB/s eta 0:00:00
Collecting datasets
Downloading datasets-2.14.4-py3-none-any.whl (519 kB)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 519.3/519.3 kB 8.5 MB/s eta 0:00:00
Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from datasets) (1.23.5)
Requirement already satisfied: pyarrow>=8.0.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (9.0.0)
Collecting dill<0.3.8,>=0.3.0 (from datasets)
Downloading dill-0.3.7-py3-none-any.whl (115 kB)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 115.3/115.3 kB 14.8 MB/s eta 0:00:00
Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from datasets) (1.5.3)
Requirement already satisfied: requests>=2.19.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (2.31.0)
Requirement already satisfied: tqdm>=4.62.1 in /usr/local/lib/python3.10/dist-packages (from datasets) (4.66.1)
Collecting xxhash (from datasets)
Downloading xxhash-3.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 194.1/194.1 kB 22.8 MB/s eta 0:00:00
Collecting multiprocess (from datasets)
Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 134.8/134.8 kB 18.1 MB/s eta 0:00:00
Requirement already satisfied: fsspec[http]>=2021.11.1 in /usr/local/lib/python3.10/dist-packages (from datasets) (2023.6.0)
Requirement already satisfied: aiohttp in /usr/local/lib/python3.10/dist-packages (from datasets) (3.8.5)
Collecting huggingface-hub<1.0.0,>=0.14.0 (from datasets)
Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 268.8/268.8 kB 35.3 MB/s eta 0:00:00
Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from datasets) (23.1)
Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from datasets) (6.0.1)
Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (23.1.0)
Requirement already satisfied: charset-normalizer<4.0,>=2.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (3.2.0)
Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (6.0.4)
Requirement already satisfied: async-timeout<5.0,>=4.0.0a3 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (4.0.3)
Requirement already satisfied: yarl<2.0,>=1.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.9.2)
Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.4.0)
Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.3.1)
Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0.0,>=0.14.0->datasets) (3.12.2)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0.0,>=0.14.0->datasets) (4.7.1)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets) (3.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets) (1.26.16)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets) (2023.7.22)
Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2023.3)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.1->pandas->datasets) (1.16.0)
Installing collected packages: xxhash, dill, multiprocess, huggingface-hub, datasets
Successfully installed datasets-2.14.4 dill-0.3.7 huggingface-hub-0.16.4 multiprocess-0.70.15 xxhash-3.3.0
Collecting transformers
Downloading transformers-4.33.0-py3-none-any.whl (7.6 MB)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 7.6/7.6 MB 19.2 MB/s eta 0:00:00
Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from transformers) (3.12.2)
Requirement already satisfied: huggingface-hub<1.0,>=0.15.1 in /usr/local/lib/python3.10/dist-packages (from transformers) (0.16.4)
Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from transformers) (1.23.5)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from transformers) (23.1)
Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from transformers) (6.0.1)
Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.10/dist-packages (from transformers) (2023.6.3)
Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from transformers) (2.31.0)
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 7.8/7.8 MB 48.2 MB/s eta 0:00:00
Collecting safetensors>=0.3.1 (from transformers)
Downloading safetensors-0.3.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 1.3/1.3 MB 53.1 MB/s eta 0:00:00
Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.10/dist-packages (from transformers) (4.66.1)
Requirement already satisfied: fsspec in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0,>=0.15.1->transformers) (2023.6.0)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0,>=0.15.1->transformers) (4.7.1)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (3.2.0)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (3.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (1.26.16)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (2023.7.22)
Installing collected packages: tokenizers, safetensors, transformers
Successfully installed safetensors-0.3.3 tokenizers-0.13.3 transformers-4.33.0
Collecting evaluate
Downloading evaluate-0.4.0-py3-none-any.whl (81 kB)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 81.4/81.4 kB 2.4 MB/s eta 0:00:00
Requirement already satisfied: datasets>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from evaluate) (2.14.4)
Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from evaluate) (1.23.5)
Requirement already satisfied: dill in /usr/local/lib/python3.10/dist-packages (from evaluate) (0.3.7)
Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from evaluate) (1.5.3)
Requirement already satisfied: requests>=2.19.0 in /usr/local/lib/python3.10/dist-packages (from evaluate) (2.31.0)
Requirement already satisfied: tqdm>=4.62.1 in /usr/local/lib/python3.10/dist-packages (from evaluate) (4.66.1)
Requirement already satisfied: xxhash in /usr/local/lib/python3.10/dist-packages (from evaluate) (3.3.0)
Requirement already satisfied: multiprocess in /usr/local/lib/python3.10/dist-packages (from evaluate) (0.70.15)
Requirement already satisfied: fsspec[http]>=2021.05.0 in /usr/local/lib/python3.10/dist-packages (from evaluate) (2023.6.0)
Requirement already satisfied: huggingface-hub>=0.7.0 in /usr/local/lib/python3.10/dist-packages (from evaluate) (0.16.4)
Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from evaluate) (23.1)
Collecting responses<0.19 (from evaluate)
Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Requirement already satisfied: pyarrow>=8.0.0 in /usr/local/lib/python3.10/dist-packages (from datasets>=2.0.0->evaluate) (9.0.0)
Requirement already satisfied: aiohttp in /usr/local/lib/python3.10/dist-packages (from datasets>=2.0.0->evaluate) (3.8.5)
Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from datasets>=2.0.0->evaluate) (6.0.1)
Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.7.0->evaluate) (3.12.2)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.7.0->evaluate) (4.7.1)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->evaluate) (3.2.0)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->evaluate) (3.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->evaluate) (1.26.16)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->evaluate) (2023.7.22)
Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-packages (from pandas->evaluate) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->evaluate) (2023.3)
Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.0.0->evaluate) (23.1.0)
Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.0.0->evaluate) (6.0.4)
Requirement already satisfied: async-timeout<5.0,>=4.0.0a3 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.0.0->evaluate) (4.0.3)
Requirement already satisfied: yarl<2.0,>=1.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.0.0->evaluate) (1.9.2)
Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.0.0->evaluate) (1.4.0)
Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.0.0->evaluate) (1.3.1)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.1->pandas->evaluate) (1.16.0)
Installing collected packages: responses, evaluate
Successfully installed evaluate-0.4.0 responses-0.18.0
Collecting seqeval
Downloading seqeval-1.2.2.tar.gz (43 kB)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 43.6/43.6 kB 1.5 MB/s eta 0:00:00
Preparing metadata (setup.py) ... done
Requirement already satisfied: numpy>=1.14.0 in /usr/local/lib/python3.10/dist-packages (from seqeval) (1.23.5)
Requirement already satisfied: scikit-learn>=0.21.3 in /usr/local/lib/python3.10/dist-packages (from seqeval) (1.2.2)
Requirement already satisfied: scipy>=1.3.2 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=0.21.3->seqeval) (1.10.1)
Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=0.21.3->seqeval) (1.3.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=0.21.3->seqeval) (3.2.0)
Building wheels for collected packages: seqeval
Building wheel for seqeval (setup.py) ... done
Created wheel for seqeval: filename=seqeval-1.2.2-py3-none-any.whl size=16162 sha256=a3e4deed0ae4f82793ec07d332ea0faca9b72401ac85aa8047235d5fec9ef8ce
Stored in directory: /root/.cache/pip/wheels/1a/67/4a/ad4082dd7dfc30f2abfe4d80a2ed5926a506eb8a972b4767fa
Successfully built seqeval
Installing collected packages: seqeval
Successfully installed seqeval-1.2.2
Requirement already satisfied: transformers[torch] in /usr/local/lib/python3.10/dist-packages (4.33.0)
Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from transformers[torch]) (3.12.2)
Requirement already satisfied: huggingface-hub<1.0,>=0.15.1 in /usr/local/lib/python3.10/dist-packages (from transformers[torch]) (0.16.4)
Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from transformers[torch]) (1.23.5)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from transformers[torch]) (23.1)
Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from transformers[torch]) (6.0.1)
Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.10/dist-packages (from transformers[torch]) (2023.6.3)
Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from transformers[torch]) (2.31.0)
Requirement already satisfied: tokenizers!=0.11.3,<0.14,>=0.11.1 in /usr/local/lib/python3.10/dist-packages (from transformers[torch]) (0.13.3)
Requirement already satisfied: safetensors>=0.3.1 in /usr/local/lib/python3.10/dist-packages (from transformers[torch]) (0.3.3)
Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.10/dist-packages (from transformers[torch]) (4.66.1)
Requirement already satisfied: torch!=1.12.0,>=1.10 in /usr/local/lib/python3.10/dist-packages (from transformers[torch]) (2.0.1+cu118)
Collecting accelerate>=0.20.3 (from transformers[torch])
Downloading accelerate-0.22.0-py3-none-any.whl (251 kB)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 251.2/251.2 kB 5.0 MB/s eta 0:00:00
Requirement already satisfied: psutil in /usr/local/lib/python3.10/dist-packages (from accelerate>=0.20.3->transformers[torch]) (5.9.5)
Requirement already satisfied: fsspec in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0,>=0.15.1->transformers[torch]) (2023.6.0)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0,>=0.15.1->transformers[torch]) (4.7.1)
Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from torch!=1.12.0,>=1.10->transformers[torch]) (1.12)
Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch!=1.12.0,>=1.10->transformers[torch]) (3.1)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch!=1.12.0,>=1.10->transformers[torch]) (3.1.2)
Requirement already satisfied: triton==2.0.0 in /usr/local/lib/python3.10/dist-packages (from torch!=1.12.0,>=1.10->transformers[torch]) (2.0.0)
Requirement already satisfied: cmake in /usr/local/lib/python3.10/dist-packages (from triton==2.0.0->torch!=1.12.0,>=1.10->transformers[torch]) (3.27.2)
Requirement already satisfied: lit in /usr/local/lib/python3.10/dist-packages (from triton==2.0.0->torch!=1.12.0,>=1.10->transformers[torch]) (16.0.6)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->transformers[torch]) (3.2.0)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->transformers[torch]) (3.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->transformers[torch]) (1.26.16)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->transformers[torch]) (2023.7.22)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch!=1.12.0,>=1.10->transformers[torch]) (2.1.3)
Requirement already satisfied: mpmath>=0.19 in /usr/local/lib/python3.10/dist-packages (from sympy->torch!=1.12.0,>=1.10->transformers[torch]) (1.3.0)
Installing collected packages: accelerate
Successfully installed accelerate-0.22.0
Requirement already satisfied: accelerate in /usr/local/lib/python3.10/dist-packages (0.22.0)
Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from accelerate) (1.23.5)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from accelerate) (23.1)
Requirement already satisfied: psutil in /usr/local/lib/python3.10/dist-packages (from accelerate) (5.9.5)
Requirement already satisfied: pyyaml in /usr/local/lib/python3.10/dist-packages (from accelerate) (6.0.1)
Requirement already satisfied: torch>=1.10.0 in /usr/local/lib/python3.10/dist-packages (from accelerate) (2.0.1+cu118)
Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from torch>=1.10.0->accelerate) (3.12.2)
Requirement already satisfied: typing-extensions in /usr/local/lib/python3.10/dist-packages (from torch>=1.10.0->accelerate) (4.7.1)
Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from torch>=1.10.0->accelerate) (1.12)
Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch>=1.10.0->accelerate) (3.1)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch>=1.10.0->accelerate) (3.1.2)
Requirement already satisfied: triton==2.0.0 in /usr/local/lib/python3.10/dist-packages (from torch>=1.10.0->accelerate) (2.0.0)
Requirement already satisfied: cmake in /usr/local/lib/python3.10/dist-packages (from triton==2.0.0->torch>=1.10.0->accelerate) (3.27.2)
Requirement already satisfied: lit in /usr/local/lib/python3.10/dist-packages (from triton==2.0.0->torch>=1.10.0->accelerate) (16.0.6)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch>=1.10.0->accelerate) (2.1.3)
Requirement already satisfied: mpmath>=0.19 in /usr/local/lib/python3.10/dist-packages (from sympy->torch>=1.10.0->accelerate) (1.3.0)
่ฎฉๆไปฌๅฏผๅ ฅ Argilla ๆจกๅไปฅ่ฟ่กๆฐๆฎ่ฏปๅๅๅๅ ฅ
[2]:
import argilla as rg
ๅฆๆๆจๆญฃๅจไฝฟ็จ Docker ๅฟซ้ๅ
ฅ้จ้ๅๆ Hugging Face Spaces ่ฟ่ก Argilla๏ผๅ้่ฆไฝฟ็จ URL
ๅ API_KEY
ๅๅงๅ Argilla ๅฎขๆท็ซฏ
[3]:
# Replace api_url with the url to your HF Spaces URL if using Spaces
# Replace api_key if you configured a custom API key
# Replace workspace with the name of your workspace
rg.init(
api_url="https://#:6900",
api_key="owner.apikey",
workspace="admin"
)
ๅฆๆๆจๆญฃๅจ่ฟ่ก็งๆ็ Hugging Face Space๏ผๆจ่ฟ้่ฆๆๅฆไธๆนๅผ่ฎพ็ฝฎ HF_TOKEN
[ ]:
# # Set the HF_TOKEN environment variable
# import os
# os.environ['HF_TOKEN'] = "your-hf-token"
# # Replace api_url with the url to your HF Spaces URL
# # Replace api_key if you configured a custom API key
# # Replace workspace with the name of your workspace
# rg.init(
# api_url="https://[your-owner-name]-[your_space_name].hf.space",
# api_key="owner.apikey",
# workspace="admin",
# extra_headers={"Authorization": f"Bearer {os.environ['HF_TOKEN']}"},
# )
ๆๅ๏ผ่ฎฉๆไปฌๅ ๅซๆไปฌ้่ฆ็ๅฏผๅ ฅ
[4]:
import pandas as pd
import random
import evaluate
import transformers
import numpy as np
import torch
import pickle
from datasets import load_dataset, ClassLabel, Sequence
from argilla.metrics.token_classification import top_k_mentions
from argilla.metrics.token_classification.metrics import Annotations
from IPython.display import display, HTML
from sklearn.model_selection import train_test_split
from transformers import AutoTokenizer, AutoModelForTokenClassification, TrainingArguments, Trainer, DataCollatorForTokenClassification, pipeline
ๅฏ็จ้ฅๆต#
ๆไปฌไปๆจไธๆไปฌ็ๆ็จไบๅจ็ๆนๅผไธญ่ทๅพๅฎ่ดต็่ง่งฃใไธบไบๆน่ฟๆไปฌ่ชๅทฑ๏ผไธบๆจๆไพๆๅ้็ๅ ๅฎน๏ผไฝฟ็จไปฅไธไปฃ็ ่กๅฐๅธฎๅฉๆไปฌไบ่งฃๆฌๆ็จๆฏๅฆๆๆๅฐไธบๆจๆๅกใ่ฝ็ถ่ฟๆฏๅฎๅ จๅฟๅ็๏ผไฝๅฆๆๆจๆฟๆ๏ผๅฏไปฅ้ๆฉ่ทณ่ฟๆญคๆญฅ้ชคใๆๅ ณๆดๅคไฟกๆฏ๏ผ่ฏทๆฅ็ ้ฅๆต ้กต้ขใ
[ ]:
try:
from argilla.utils.telemetry import tutorial_running
tutorial_running()
except ImportError:
print("Telemetry is introduced in Argilla 1.20.0 and not found in the current installation. Skipping telemetry.")
๐ ๆข็ดขๆไปฌ็ๆฐๆฎ้#
้ฆๅ
๏ผๆไปฌๅฐไป HuggingFace ๅ ่ฝฝๆไปฌๆฐๆฎ้็่ฎญ็ป้๏ผไปฅไพฟไฝฟ็จ load_dataset
ๆข็ดขๅฎใ่ไธ๏ผๆญฃๅฆๆไปฌๆ่ง๏ผๅฎๆ 119 ไธชๆก็ฎๅไธคๅ๏ผไธๅๆฏ token ๅบๅ๏ผๅฆไธๅๆฏ NER ๆ ็ญพๅบๅใ
[5]:
dataset = load_dataset("argilla/spacy_sm_wnut17", split = "train")
[6]:
dataset
[6]:
Dataset({
features: ['tokens', 'ner_tags'],
num_rows: 119
})
ๆฅไธๆฅ๏ผๆไปฌๅฐไฝฟ็จไปฅไธไปฃ็ ๏ผๅฉ็จ DatasetDict
้้กน Features
ๅฐๅ
ถ่ฝฌๆขไธบ Argilla ๆ้็ๆ ผๅผไปฅ่ฟ่กๆฅๅฟ่ฎฐๅฝใ
ๆไปฌ็ๆฐๆฎๅฟ ้กปๅ ทๅค็ไธไธช Token ๅ็ฑปๅ ็ด ๅฆไธ
text๏ผๅฎๆด็ๅญ็ฌฆไธฒใ
tokens๏ผtoken ๅบๅใ
annotation๏ผ็ฑๆ ็ญพใ่ตทๅงไฝ็ฝฎๅ็ปๆไฝ็ฝฎ็ปๆ็ๅ ็ปใ
โ ๏ธ ่ฏทๆณจๆ๏ผ ๆฏๆฌกๆง่ก้ฝไผๅๆฌกไธไผ ๅๆทปๅ ๆจ็ๆ ๆณจ๏ผ่ไธไผ่ขซ่ฆ็ใ
[79]:
# Create a function to read the sequences
def parse_entities(record):
current_entity = None # to check if current entity in process
current_info = [] # to save the information used in the tuple for the whole sentence
char_position = 0
entities = [] # final list to save the tuples
# Iterate over the tokens and ner tags
for i in range(len(record["ner_tags"])):
token = record["tokens"][i]
ner_tag = dataset.features["ner_tags"].feature.names[record["ner_tags"][i]]
if ner_tag.startswith("B-"):
if current_entity:
current_info.append(current_entity)
current_entity = {"word": token, "start": char_position, "tag": ner_tag[2:]}
char_position += len(token) + 1
elif ner_tag.startswith("I-"):
if current_entity:
current_entity["word"] += " " + token
char_position += len(token) + 1
elif ner_tag == "O":
char_position += len(token) + 1
# Add the last entity if it exists
if current_entity:
current_info.append(current_entity)
# Calculate the end positions for each entity
for entity in current_info:
entity["end"] = entity["start"] + len(entity["word"])
for entity in current_info:
entities.append((entity["tag"], entity["start"], entity["end"]))
return entities
[ ]:
# Write a loop to iterate over each row of your dataset and add the text, tokens, and tuple
records = [
rg.TokenClassificationRecord(
text=" ".join(row["tokens"]),
tokens=row["tokens"],
annotation=parse_entities(row),
)
for row in dataset
]
# Log the records with the name of your choice
rg.log(records, "spacy_sm_wnut17")
็ฐๅจๆจๅฐ่ฝๅคไปฅๆด็ด่ง็ๆนๅผๆฃๆฅๆจ็ๆ ๆณจ๏ผ็่ณๅจๅฟ ่ฆๆถ็ผ่พๅฎไปฌใ
ๆญคๅค๏ผArgilla ่ฟๆๆดๅค้้กน๏ผไพๅฆๆๅ ๆๆ ๏ผๅฆไธๆ็คบใ
[ ]:
# Select the dataset from Argilla and visualize the data
top_k_mentions(
name="spacy_sm_wnut17", k=30, threshold=2, compute_for=Annotations
).visualize()
โณ ้ขๅค็ๆฐๆฎ#
ๆฅไธๆฅ๏ผๆไปฌๅฐไปฅๆ้็ๆ ผๅผ้ขๅค็ๆไปฌ็ๆฐๆฎ๏ผไปฅไพฟๆจกๅๅฏไปฅไฝฟ็จๅฎใๅจๆไปฌ็ไพๅญไธญ๏ผๆไปฌๅฐไป HuggingFace ้ๆฐๅ ่ฝฝๅฎไปฌ๏ผๅ ไธบๅจ Argilla ไธญๆไปฌๅชๅ ่ฝฝไบ่ฎญ็ป้๏ผไฝๆฏ๏ผ่ฟไนๆฏๅฏ่ฝ็ใ
ไปฅไธไปฃ็ ๅฐๅ ่ฎธๆไปฌไฝฟ็จ Argilla ๅๅคๆไปฌ็ๆฐๆฎ๏ผ่ฟๅฏนไบๆๅจๆ ๆณจ็นๅซๆ็จ๏ผๅ ไธบๅฎไผ่ชๅจๅฐ B-๏ผๅผๅง๏ผๆ I-๏ผๅ ้จ๏ผๆทปๅ ๅฐๆไปฌ็ NER ๆ ็ญพ๏ผๅ ทไฝๅๅณไบๅฎไปฌ็ไฝ็ฝฎใ
dataset = rg.load("dataset_name").prepare_for_training()
dataset = dataset.train_test_split()
๐คฏ ๆ็คบ๏ผ ๅจๆไปฌ็ไพๅญไธญ๏ผๆไปฌๆญฃๅจๅค็ไธไธช้ๅธธๅฐ็ๆฐๆฎ้๏ผ่ฏฅๆฐๆฎ้ๅไธบ่ฎญ็ป้ๅๆต่ฏ้ใไฝๆฏ๏ผๆจๅฏ่ฝๆญฃๅจไฝฟ็จๅฆไธไธชๅทฒ็ปๆ
validation
ๅๅบ็ๆฐๆฎ้๏ผๆ่ ๅณไฝฟๅฎๆดๅคง๏ผๆจไนๅฏไปฅไฝฟ็จไปฅไธไปฃ็ ่ช่กๅๅปบๆญคๅๅบ
dataset['train'], dataset['validation'] = dataset['train'].train_test_split(.1).values()
้ฃไน๏ผ่ฎฉๆไปฌ็ปง็ปญ๏ผ
[ ]:
dataset = load_dataset("argilla/spacy_sm_wnut17")
print(dataset)
WARNING:datasets.builder:Found cached dataset parquet (/root/.cache/huggingface/datasets/argilla___parquet/argilla--spacy_sm_wnut17-1babd564207f27f8/0.0.0/14a00e99c0d15a23649d0db8944380ac81082d4b021f398733dd84f3a6c569a7)
DatasetDict({
train: Dataset({
features: ['tokens', 'ner_tags'],
num_rows: 119
})
test: Dataset({
features: ['tokens', 'ner_tags'],
num_rows: 30
})
})
ๆฏๆถๅ token ๅ ไบ๏ผ่ฝ็ถ็่ตทๆฅ่ฟๅทฒ็ปๅฎๆไบ๏ผไฝๆฏไธช token ไป็ถ้่ฆ่ฝฌๆขไธบๅ้๏ผID๏ผ๏ผๆจกๅๅฏไปฅไปๅ
ถ้ข่ฎญ็ป็่ฏๆฑ่กจไธญ่ฏปๅ่ฏฅๅ้ใไธบๆญค๏ผๆไปฌๅฐไฝฟ็จ AutoTokenizer.from_pretrained
ๅ FastTokenizer distilbert-base-uncased
ใ
[ ]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
assert isinstance(tokenizer, transformers.PreTrainedTokenizerFast)
[ ]:
# Example of original tokens
example = dataset["train"][0]
print(example["tokens"])
# Example after executing the AutoTokenizer
tokenized_input = tokenizer(example["tokens"], is_split_into_words=True)
tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
print(tokens)
['says', 'it', "'s", 'Saturday', '!', 'I', "'m", 'wearing', 'my', 'Weekend', '!', ':)']
['[CLS]', 'says', 'it', "'", 's', 'saturday', '!', 'i', "'", 'm', 'wearing', 'my', 'weekend', '!', ':', ')', '[SEP]']
ไฝๆฏ๏ผๆไปฌ็ฐๅจ้ๅฐไบไธไธชๆฐ้ฎ้ขใ็ฑไบๅฎๆ นๆฎ้ข่ฎญ็ป็่ฏๆฑ่กจ่ฟ่ก token ๅ๏ผ่ฟๅฐๅจๆไบๅ่ฏไธญๅๅปบๆฐ็็ปๅ๏ผไพๅฆ๏ผโโโ ๅ โsโ ไธญ็ โโsโ๏ผใๆญคๅค๏ผๅฎ่ฟๆทปๅ ไบไธคไธชๆฐๆ ็ญพ [CLS] ๅ [SEP]ใๅ ๆญค๏ผๆไปฌๅฟ
้กปๅๅฉ word-ids
ๆนๆณๅฐๅ่ฏ็ ID ไธ็ธๅบ็ NER ๆ ็ญพ้ๆฐๅฏน้ฝใ
[ ]:
label_all_tokens = True
def tokenize_and_align_labels(examples):
tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)
labels = []
for i, label in enumerate(examples["ner_tags"]):
word_ids = tokenized_inputs.word_ids(batch_index=i)
previous_word_idx = None
label_ids = []
for word_idx in word_ids:
# Special tokens have a word id that is None. We set the label to -100 so they are automatically
# ignored in the loss function.
if word_idx is None:
label_ids.append(-100)
# We set the label for the first token of each word.
elif word_idx != previous_word_idx:
label_ids.append(label[word_idx])
# For the other tokens in a word, we set the label to either the current label or -100, depending on
# the label_all_tokens flag.
else:
label_ids.append(label[word_idx] if label_all_tokens else -100)
previous_word_idx = word_idx
labels.append(label_ids)
tokenized_inputs["labels"] = labels
return tokenized_inputs
tokenized_dataset = dataset.map(tokenize_and_align_labels, batched=True)
WARNING:datasets.arrow_dataset:Loading cached processed dataset at /root/.cache/huggingface/datasets/argilla___parquet/argilla--spacy_sm_wnut17-1babd564207f27f8/0.0.0/14a00e99c0d15a23649d0db8944380ac81082d4b021f398733dd84f3a6c569a7/cache-55b667584ffacf49.arrow
๐ ๅพฎ่ฐๆจกๅ#
ๆไปฌ็ฐๅจๅบ่ฏฅๅผๅงๅๅคๆจกๅ็ๅๆฐ๏ผๅณๆไปฌๅบ่ฏฅๅผๅงๅพฎ่ฐใ
ๆจกๅ#
้ฆๅ
๏ผๆไปฌๅฐไฝฟ็จ AutoModelForTokenClassification
ไธ่ฝฝๆไปฌ็้ข่ฎญ็ปๆจกๅ๏ผๆไปฌๅฐๆ็คบๆ้ๆจกๅ็ๅ็งฐใๆ ็ญพๆฐ้๏ผๅนถๆ็คบๅ
ถ ID ๅๅ็งฐไน้ด็ๅฏนๅบๅ
ณ็ณปใ
ๆญคๅค๏ผๆไปฌ่ฟๅฐ่ฎพ็ฝฎๆไปฌ็ DataCollator
๏ผ้่ฟไฝฟ็จๆไปฌๅค็่ฟ็็คบไพไฝไธบ่พๅ
ฅๆฅๅฝขๆๆนๆฌกใๅจ่ฟ็งๆ
ๅตไธ๏ผๆไปฌๅฐไฝฟ็จ DataCollatorForTokenClassification
ใ
[ ]:
# Create a dictionary with the ids and the relevant label.
label_list = dataset["train"].features["ner_tags"].feature.names
id2label = {i: label for i, label in enumerate(label_list)}
label2id = {v: k for k, v in id2label.items()}
# Download the model.
model = AutoModelForTokenClassification.from_pretrained("distilbert-base-uncased", num_labels=len(label_list), id2label=id2label, label2id=label2id)
# Set the DataCollator
data_collator = DataCollatorForTokenClassification(tokenizer)
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForTokenClassification: ['vocab_transform.bias', 'vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
่ฎญ็ปๅๆฐ#
TrainingArguments
็ฑปๅฐๅ
ๅซ ๅๆฐ ไปฅ่ชๅฎไนๆไปฌ็่ฎญ็ปใ
๐ก ๆ็คบ๏ผ ๅฆๆๆจๆญฃๅจไฝฟ็จ HuggingFace๏ผ้ฃไน็ดๆฅๅจ้ฃ้ไฟๅญๆจ็ๆจกๅๅฏ่ฝไผๆดๅฎนๆใไธบๆญค๏ผ่ฏทไฝฟ็จไปฅไธไปฃ็ ๅนถๅฐไปฅไธๅๆฐๆทปๅ ๅฐ TrainingArgumentsใ
from huggingface_hub import notebook_login
notebook_login()
# Add the following parameter
training_args = TrainingArguments(
save_strategy="epoch",
load_best_model_at_end=True,
push_to_hub=True,
)
๐น๏ธ ่ฎฉๆไปฌ็ฉ็ฉ๏ผ ๆจๅฏไปฅ่ทๅพ็ๆไฝณๅ็กฎ็ๆฏๅคๅฐ๏ผ
[ ]:
training_args = TrainingArguments(
output_dir="ner-recognition",
learning_rate=2e-4,
per_device_train_batch_size=32,
per_device_eval_batch_size=32,
num_train_epochs=20,
weight_decay=0.05,
evaluation_strategy="epoch",
optim="adamw_torch",
logging_steps = 50
)
ๆๆ #
่ฆไบ่งฃๆไปฌ็่ฎญ็ป่ฟๅฑๅฆไฝ๏ผๅฝ็ถ๏ผๆไปฌๅฟ
้กปไฝฟ็จๆๆ ใๅ ๆญค๏ผๆไปฌๅฐไฝฟ็จ Seqeval
ๅไธไธชๅฝๆฐ๏ผ่ฏฅๅฝๆฐไปๅฎ้
ๆ ็ญพๅ้ขๆตๆ ็ญพ่ฎก็ฎ็ฒพ็กฎ็ใๅฌๅ็ใF1 ๅๅ็กฎ็ใ
[ ]:
# Load Sqeval.
metric = evaluate.load("seqeval")
# Create the list with the tags.
labels = [label_list[i] for i in example[f"ner_tags"]]
# Function to compute precision, recall, F1 and accuracy.
def compute_metrics(p):
predictions, labels = p
predictions = np.argmax(predictions, axis=2)
true_predictions = [
[label_list[p] for (p, l) in zip(prediction, label) if l != -100]
for prediction, label in zip(predictions, labels)
]
true_labels = [
[label_list[l] for (p, l) in zip(prediction, label) if l != -100]
for prediction, label in zip(predictions, labels)
]
results = metric.compute(predictions=true_predictions, references=true_labels)
return {
"precision": results["overall_precision"],
"recall": results["overall_recall"],
"f1": results["overall_f1"],
"accuracy": results["overall_accuracy"],
}
่ฎญ็ปๆถ้ดๅฐไบ#
้กพๅๆไน๏ผ็ฐๅจๆฏๅฐๆๆๅ
ๅ็ๅ
็ด ็ปๅๅจไธ่ตทๅนถๅผๅงไฝฟ็จ Trainer
่ฟ่ก่ฎญ็ป็ๆถๅไบใ
[ ]:
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset["train"],
eval_dataset=tokenized_dataset["test"],
tokenizer=tokenizer,
data_collator=data_collator,
compute_metrics=compute_metrics,
)
# Train.
trainer.train()
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Epoch | ่ฎญ็ปๆๅคฑ | ้ช่ฏๆๅคฑ | ็ฒพ็กฎ็ | ๅฌๅ็ | F1 | ๅ็กฎ็ |
---|---|---|---|---|---|---|
1 | ๆ ๆฅๅฟ | 1.445835 | 0.000000 | 0.000000 | 0.000000 | 0.720751 |
2 | ๆ ๆฅๅฟ | 1.540381 | 0.000000 | 0.000000 | 0.000000 | 0.720751 |
3 | ๆ ๆฅๅฟ | 1.300941 | 0.000000 | 0.000000 | 0.000000 | 0.720751 |
4 | ๆ ๆฅๅฟ | 1.259119 | 0.000000 | 0.000000 | 0.000000 | 0.720751 |
5 | ๆ ๆฅๅฟ | 1.256542 | 0.444444 | 0.025478 | 0.048193 | 0.720751 |
6 | ๆ ๆฅๅฟ | 1.154050 | 0.202703 | 0.095541 | 0.129870 | 0.736203 |
7 | ๆ ๆฅๅฟ | 1.388463 | 0.254545 | 0.089172 | 0.132075 | 0.718543 |
8 | ๆ ๆฅๅฟ | 1.246235 | 0.275362 | 0.121019 | 0.168142 | 0.737307 |
9 | ๆ ๆฅๅฟ | 1.254787 | 0.202020 | 0.127389 | 0.156250 | 0.731788 |
10 | ๆ ๆฅๅฟ | 1.388549 | 0.272727 | 0.171975 | 0.210938 | 0.735099 |
11 | ๆ ๆฅๅฟ | 1.494627 | 0.297619 | 0.159236 | 0.207469 | 0.740618 |
12 | ๆ ๆฅๅฟ | 1.331303 | 0.232558 | 0.191083 | 0.209790 | 0.746137 |
13 | 0.675300 | 1.473191 | 0.252252 | 0.178344 | 0.208955 | 0.748344 |
14 | 0.675300 | 1.566783 | 0.275510 | 0.171975 | 0.211765 | 0.742826 |
15 | 0.675300 | 1.500171 | 0.252336 | 0.171975 | 0.204545 | 0.739514 |
16 | 0.675300 | 1.541946 | 0.274336 | 0.197452 | 0.229630 | 0.742826 |
17 | 0.675300 | 1.546347 | 0.258333 | 0.197452 | 0.223827 | 0.745033 |
18 | 0.675300 | 1.534100 | 0.271186 | 0.203822 | 0.232727 | 0.743929 |
19 | 0.675300 | 1.535095 | 0.277311 | 0.210191 | 0.239130 | 0.745033 |
20 | 0.675300 | 1.539303 | 0.277311 | 0.210191 | 0.239130 | 0.745033 |
/usr/local/lib/python3.10/dist-packages/seqeval/metrics/v1.py:57: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
/usr/local/lib/python3.10/dist-packages/seqeval/metrics/v1.py:57: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
TrainOutput(global_step=80, training_loss=0.45428856909275056, metrics={'train_runtime': 14.9864, 'train_samples_per_second': 158.811, 'train_steps_per_second': 5.338, 'total_flos': 32769159790410.0, 'train_loss': 0.45428856909275056, 'epoch': 20.0})
evaluate
ๆนๆณๅฐๅ
่ฎธๆจๅๆฌกๅจ้ช่ฏ้ๆๅฆไธไธชๆฐๆฎ้ไธ่ฟ่ก่ฏไผฐ๏ผไพๅฆ๏ผๅฆๆๆจๆ่ฎญ็ป้ใ้ช่ฏ้ๅๆต่ฏ้๏ผใ
[ ]:
trainer.evaluate()
{'eval_loss': 1.5393034219741821,
'eval_precision': 0.2773109243697479,
'eval_recall': 0.21019108280254778,
'eval_f1': 0.2391304347826087,
'eval_accuracy': 0.7450331125827815,
'eval_runtime': 0.0918,
'eval_samples_per_second': 326.934,
'eval_steps_per_second': 10.898,
'epoch': 20.0}
๐ฎ ๅฐ่ฏ้ขๆต
ๅฝๆจๅๅปบๆจกๅๅนถๅฏนๅ ถๆๅฐๆปกๆๆถ๏ผ่ฏทไฝฟ็จๆจ่ชๅทฑ็ๆๆฌ่ช่กๆต่ฏใ
# Replace this with the directory where it was saved
model_checkpoint = "your-path"
token_classifier = pipeline("token-classification", model=model_checkpoint, aggregation_strategy="simple")
token_classifier("I heard Madrid is wonderful in spring.")
๐โ๏ธ ๆป็ป#
ๅจๆฌๆ็จไธญ๏ผๆไปฌๅญฆไน ไบๅฆไฝๅฐๆไปฌ็่ฎญ็ปๆฐๆฎ้ไธไผ ๅฐ Argilla๏ผไปฅไพฟๅฏ่งๅๅฎๅ
ๅซ็ๆฐๆฎๅๅ
ถไฝฟ็จ็ NER ๆ ็ญพ๏ผไปฅๅๅฆไฝไฝฟ็จ transformers
ไธบ NER ่ฏๅซๅพฎ่ฐ BERT ๆจกๅใ่ฟๅฏนไบๅญฆไน BERT ้ขๆจกๅ็ๅบ็ค็ฅ่ฏ้ๅธธๆ็จ๏ผๅนถไป้ฃ้่ฟไธๆญฅๅๅฑๆจ็ๆ่ฝๅนถๅฐ่ฏๅฏ่ฝไบง็ๆดๅฅฝ็ปๆ็ไธๅๆจกๅใ
๐ชๅนฒๆฏ๏ผ