Selog

전체 글

Summary of the tasks 2021.10.26
Under the hood : Pretrained Models 2021.10.19
Getting started on a task with a pipeline 2021.10.18
matplotlib.pyplot.quiver 함수 정리 2021.01.13

Summary of the tasks

지닝 2021. 10. 26. 15:41

2021. 10. 26. 15:41

🔗 Docs >> Using Transformers >> Summary of the tasks

이 페이지에는 라이브러리 사용 시 가장 많이 적용되는 사례가 소개되어 있습니다. 허깅페이스 트랜스포머의 모델들은 다양한 구성과 사용 사례를 지원합니다. 가장 간단한 것은 질문 답변(question answering), 시퀀스 분류(sequence classification), 개체명 인식(named entity recognition) 등과 같은 작업에 대한 사례들입니다.

이러한 예제에서는 오토모델(auto-models)을 활용합니다. 오토모델은 주어진 체크포인트에 따라 모델을 인스턴스화하고 올바른 모델 아키텍처를 자동으로 선택하는 클래스입니다. 자세한 내용은 AutoModel 문서를 참조하십시오. 문서를 참조하여 코드를 더 구체적으로 수정하고, 특정 사용 사례에 맞게 자유롭게 조정할 수 있습니다.

모델이 잘 실행되려면 해당 태스크에 해당하는 체크포인트에서 로드되어야 합니다. 이러한 체크포인트는 일반적으로 대규모 데이터 집합을 사용하여 프리트레인되고 특정 태스크에 대해 파인튜닝 됩니다. 이는 아래와 같습니다.

모든 모델이 모든 태스크에 대해 파인튜닝된 것은 아닙니다. 특정 태스크에서 모델을 파인튜닝하려면 예제 디렉토리의 run_$TASK.py스크립트를 활용할 수 있습니다.
파인튜닝된 모델은 특정 데이터셋을 사용하여 파인튜닝되었습니다. 이 데이터셋은 사용 예제 및 도메인과 관련이 있을 수 있지만, 그렇지 않을 수도 있습니다. 앞서 언급했듯이 예제 스크립트를 활용하여 모델을 파인튜닝하거나 모델 학습에 사용할 스크립트를 직접 작성할 수 있습니다.

추론 태스크를 위해 라이브러리에서 몇 가지 메커니즘을 사용할 수 있습니다.

파이프라인 : 사용하기 매우 쉬운 방식으로, 두 줄의 코드로 사용이 가능합니다.
직접 모델 사용하기 : 추상화가 덜 되지만, 토크나이저(파이토치/텐서플로우)에 직접 액세스할 수 있다는 점에서 유연성과 성능이 향상됩니다.

여기에 두 가지 접근 방식이 모두 제시되어 있습니다.

💛 주의
여기에 제시된 모든 태스크에서는 특정 태스크에 맞게 파인튜닝된 프리트레인 체크포인트를 활용합니다. 특정 작업에서 파인튜닝 되지 않은 체크포인트를 로드하면 태스크에 사용되는 추가 헤드가 아닌 기본 트랜스포머 레이어만 로드되어 해당 헤드의 가중치가 무작위로 초기화됩니다. 이렇게 하면 랜덤으로 출력이 생성됩니다.

시퀀스 분류(Sequence Classification)

시퀀스 분류는 주어진 클래스 수에 따라 시퀀스를 분류하는 태스크입니다. 시퀀스 분류의 예시로는 이 태스크를 기반으로 하는 GLUE 데이터셋이 있습니다. GLUE 시퀀스 분류 태스크에서 모델을 파인튜닝 하려면 run_glue.py, run_tf_glue.py, run_tf_classification.py 또는 run_xnli.py 스크립트를 활용할 수 있습니다.

다음은 파이프라인을 사용하여 시퀀스가 긍정인지 부정인지를 식별하여 감성분석을 수행하는 예입니다. GLUE 태스크인 sst2에서 파인튜닝된 모델을 활용합니다.

이렇게 하면 다음과 같이 스코어와 함께 라벨(POSITIVE-긍정 or NEGATIVE-부정)이 반환됩니다.

from transformers import pipeline

classifier = pipeline("sentiment-analysis")

result = classifier("I hate you")[0]
print(f"label: {result['label']}, with score: {round(result['score'], 4)}")

result = classifier("I love you")[0]
print(f"label: {result['label']}, with score: {round(result['score'], 4)}")

다음은 모델을 사용하여 두 시퀀스가 서로 같은 의미의 다른 문장인지의 여부(paraphrase or not)를 결정하는 시퀀스 분류의 예입니다. 프로세스는 다음과 같습니다.

체크포인트 이름에서 토크나이저 및 모델을 인스턴스화합니다. 모델은 BERT 모델로서 식별되며 체크포인트에 저장된 가중치로 로드됩니다.
올바른 모델별 구분 기호, 토큰 유형 ID 및 어텐션 마스크(토크나이저에 의해 자동으로 작성됨)를 사용하여 두 문장의 시퀀스를 작성합니다.
모델을 통해 이 시퀀스를 전달하고 사용 가능한 두 클래스 중 하나인 0(no paraphrase)과 1(paraphrase) 중 하나로 분류합니다.
클래스 분류에 대한 확률을 계산하기 위해 결과에 소프트맥스 함수를 적용하여 계산합니다.
결과를 프린트합니다.

# Pytorch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased-finetuned-mrpc")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased-finetuned-mrpc")

classes = ["not paraphrase", "is paraphrase"]

sequence_0 = "The company HuggingFace is based in New York City"
sequence_1 = "Apples are especially bad for your health"
sequence_2 = "HuggingFace's headquarters are situated in Manhattan"

# The tokenizer will automatically add any model specific separators (i.e. <CLS> and <SEP>) and tokens to
# the sequence, as well as compute the attention masks.
paraphrase = tokenizer(sequence_0, sequence_2, return_tensors="pt")
not_paraphrase = tokenizer(sequence_0, sequence_1, return_tensors="pt")

paraphrase_classification_logits = model(**paraphrase).logits
not_paraphrase_classification_logits = model(**not_paraphrase).logits

paraphrase_results = torch.softmax(paraphrase_classification_logits, dim=1).tolist()[0]
not_paraphrase_results = torch.softmax(not_paraphrase_classification_logits, dim=1).tolist()[0]

# Should be paraphrase
for i in range(len(classes)):
    print(f"{classes[i]}: {int(round(paraphrase_results[i] * 100))}%")
"""
not paraphrase: 10%
is paraphrase: 90%
"""

# Should not be paraphrase
for i in range(len(classes)):
    print(f"{classes[i]}: {int(round(not_paraphrase_results[i] * 100))}%")
"""
not paraphrase: 94%
is paraphrase: 6%
"""

# Tensorflow
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
import tensorflow as tf

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased-finetuned-mrpc")
model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased-finetuned-mrpc")

classes = ["not paraphrase", "is paraphrase"]

sequence_0 = "The company HuggingFace is based in New York City"
sequence_1 = "Apples are especially bad for your health"
sequence_2 = "HuggingFace's headquarters are situated in Manhattan"

# The tokenizer will automatically add any model specific separators (i.e. <CLS> and <SEP>) and tokens to
# the sequence, as well as compute the attention masks.
paraphrase = tokenizer(sequence_0, sequence_2, return_tensors="tf")
not_paraphrase = tokenizer(sequence_0, sequence_1, return_tensors="tf")

paraphrase_classification_logits = model(paraphrase).logits
not_paraphrase_classification_logits = model(not_paraphrase).logits

paraphrase_results = tf.nn.softmax(paraphrase_classification_logits, axis=1).numpy()[0]
not_paraphrase_results = tf.nn.softmax(not_paraphrase_classification_logits, axis=1).numpy()[0]

# Should be paraphrase
for i in range(len(classes)):
    print(f"{classes[i]}: {int(round(paraphrase_results[i] * 100))}%")
"""
not paraphrase: 10%
is paraphrase: 90%
"""

# Should not be paraphrase
for i in range(len(classes)):
    print(f"{classes[i]}: {int(round(not_paraphrase_results[i] * 100))}%")
"""
not paraphrase: 94%
is paraphrase: 6%
"""

추출 질의응답(Extractive Question Answering)

추출 질의응답은 주어진 질문 텍스트에서 답을 추출하는 작업입니다. 질문 답변 데이터셋의 예로는 해당 작업을 기반으로 하는 SQuAD 데이터셋이 있습니다. SQuAD 작업에서 모델을 파인튜닝하려면 run_qa.py 및 run_tf_squad.py 스크립트를 활용할 수 있습니다.

다음은 파이프라인을 사용하여 주어진 질문 텍스트에서 답변을 추출하는 질의응답을 수행하는 예입니다. SQuAD 데이터셋을 통해 파인튜닝된 모델을 활용합니다.

from transformers import pipeline

question_answerer = pipeline("question-answering")

context = r"""
Extractive Question Answering is the task of extracting an answer from a text given a question. An example of a
question answering dataset is the SQuAD dataset, which is entirely based on that task. If you would like to fine-tune
a model on a SQuAD task, you may leverage the examples/pytorch/question-answering/run_squad.py script.
"""

이렇게 하면 텍스트에서 추출된 답변과 **신뢰 점수(confidence score)**가 텍스트에서 추출된 답변의 위치인 '시작' 및 '종료' 값과 함께 반환됩니다.

result = question_answerer(question="What is extractive question answering?", context=context)
print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")

result = question_answerer(question="What is a good example of a question answering dataset?", context=context)
print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")

모델 및 토크나이저를 사용하여 질문에 대답하는 예입니다. 프로세스는 다음과 같습니다.

체크포인트 이름에서 토크나이저 및 모델을 인스턴스화합니다. 모델은 BERT 모델로 식별되며 체크포인트에 저장된 가중치로 로드됩니다.
텍스트와 몇 가지 질문을 정의합니다.
질문을 반복하고 올바른 모델별 식별자 토큰 타입 ID 및 어텐션 마스크를 사용하여 텍스트와 현재 질문의 시퀀스를 작성합니다.
이 시퀀스를 모델에 전달합니다. 그러면 시작 위치와 끝 위치 모두에 대해 전체 시퀀스 토큰(질문과 텍스트)에 걸쳐 다양한 점수가 출력됩니다.
토큰에 대한 확률을 얻기 위해 결과값에 소프트맥스 함수를 취합니다.
식별된 시작 및 끝 위치에서 토큰을 가져와 문자열로 변환합니다.
결과를 프린트합니다.

# Pytorch
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch

tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
model = AutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")

text = r"""
🤗 Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides general-purpose
architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural
Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between
TensorFlow 2.0 and PyTorch.
"""

questions = [
    "How many pretrained models are available in 🤗 Transformers?",
    "What does 🤗 Transformers provide?",
    "🤗 Transformers provides interoperability between which frameworks?",
]

for question in questions:
    inputs = tokenizer(question, text, add_special_tokens=True, return_tensors="pt")
    input_ids = inputs["input_ids"].tolist()[0]
    outputs = model(**inputs)
    answer_start_scores = outputs.start_logits
    answer_end_scores = outputs.end_logits
    # Get the most likely beginning of answer with the argmax of the score
    answer_start = torch.argmax(answer_start_scores)
    # Get the most likely end of answer with the argmax of the score
    answer_end = torch.argmax(answer_end_scores) + 1
    answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))
    print(f"Question: {question}")
    print(f"Answer: {answer}")

"""
Question: How many pretrained models are available in 🤗 Transformers?
Answer: over 32 +
Question: What does 🤗 Transformers provide?
Answer: general - purpose architectures
Question: 🤗 Transformers provides interoperability between which frameworks?
Answer: tensorflow 2. 0 and pytorch
"""

# Tensorflow
from transformers import AutoTokenizer, TFAutoModelForQuestionAnswering
import tensorflow as tf

tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
model = TFAutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")

text = r"""
🤗 Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides general-purpose
architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural
Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between
TensorFlow 2.0 and PyTorch.
"""

questions = [
    "How many pretrained models are available in 🤗 Transformers?",
    "What does 🤗 Transformers provide?",
    "🤗 Transformers provides interoperability between which frameworks?",
]

for question in questions:
    inputs = tokenizer(question, text, add_special_tokens=True, return_tensors="tf")
    input_ids = inputs["input_ids"].numpy()[0]
    outputs = model(inputs)
    answer_start_scores = outputs.start_logits
    answer_end_scores = outputs.end_logits
    # Get the most likely beginning of answer with the argmax of the score
    answer_start = tf.argmax(answer_start_scores, axis=1).numpy()[0]
    # Get the most likely end of answer with the argmax of the score
    answer_end = tf.argmax(answer_end_scores, axis=1).numpy()[0] + 1
    answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))
    print(f"Question: {question}")
    print(f"Answer: {answer}")

"""
Question: How many pretrained models are available in 🤗 Transformers?
Answer: over 32 +
Question: What does 🤗 Transformers provide?
Answer: general - purpose architectures
Question: 🤗 Transformers provides interoperability between which frameworks?
Answer: tensorflow 2. 0 and pytorch
"""

언어 모델링(Language Modeling)

언어 모델링은 모델을 코퍼스에 맞추는 작업이며, 특정 도메인에 특화시킬 수 있습니다. 모든 트랜스포머 기반 모델은 언어 모델링을 변형(예: 마스크된 언어 모델링을 사용한 BERT, 일상 언어 모델링을 사용한 GPT-2)하여 훈련됩니다.

언어 모델링은 프리트레이닝 이외에도 모델 배포를 각 도메인에 맞게 특화시키기 위해 유용하게 사용될 수 있습니다. 예를 들어, 대용량 코퍼스를 통해 훈련된 언어 모델을 사용한 다음 뉴스 데이터셋 또는 과학 논문 데이터셋(예 : LysandreJik/arxiv-nlp)으로 파인튜닝하는 것입니다.

마스크된 언어 모델링(Masked Language Modeling)

마스크된 언어 모델링은 마스킹 토큰을 사용하여 순서대로 토큰을 마스킹하고 모델이 해당 마스크를 적절한 토큰으로 채우도록 요청하는 작업입니다. 따라서 모델이 오른쪽 컨텍스트(마스크 오른쪽의 토큰)와 왼쪽 컨텍스트(마스크 왼쪽의 토큰)를 모두 살펴볼 수 있습니다. 이러한 훈련은 SQuAD(질의응답, Lewis, Lui, Goyal et al, 파트 4.2)와 같은 양방향 컨텍스트를 필요로 하는 다운스트림 작업에 대한 강력한 기초 모델을 만듭니다. 마스킹된 언어 모델링 작업에서 모델을 파인튜닝하려면 run_mlm.py 스크립트를 활용할 수 있습니다.

다음은 파이프라인을 사용하여 시퀀스에서 마스크를 교체하는 예입니다.

from transformers import pipeline

unmasker = pipeline("fill-mask")

그러면 마스크가 채워진 시퀀스, 스코어 및 토큰ID가 토크나이저를 통해 출력됩니다.

from pprint import pprint
pprint(unmasker(f"HuggingFace is creating a {unmasker.tokenizer.mask_token} that the community uses to solve NLP tasks."))
[{'score': 0.1793,
  'sequence': 'HuggingFace is creating a tool that the community uses to solve '
              'NLP tasks.',
  'token': 3944,
  'token_str': ' tool'},
 {'score': 0.1135,
  'sequence': 'HuggingFace is creating a framework that the community uses to '
              'solve NLP tasks.',
  'token': 7208,
  'token_str': ' framework'},
 {'score': 0.0524,
  'sequence': 'HuggingFace is creating a library that the community uses to '
              'solve NLP tasks.',
  'token': 5560,
  'token_str': ' library'},
 {'score': 0.0349,
  'sequence': 'HuggingFace is creating a database that the community uses to '
              'solve NLP tasks.',
  'token': 8503,
  'token_str': ' database'},
 {'score': 0.0286,
  'sequence': 'HuggingFace is creating a prototype that the community uses to '
              'solve NLP tasks.',
  'token': 17715,
  'token_str': ' prototype'}]

다음은 모델 및 토크나이저를 사용하여 마스킹된 언어 모델링을 수행하는 예입니다. 프로세스는 다음과 같습니다.

체크포인트 이름에서 토크라이저 및 모델을 인스턴스화합니다. 여기서는 DistilBERT 모델을 사용할 것이고, 가중치가 체크포인트에 저장됩니다.
단어 대신 tokenizer.mask_token을 배치하여 마스킹된 토큰으로 시퀀스를 정의합니다.
해당 시퀀스를 ID 목록으로 인코딩하고 해당 목록에서 마스킹된 토큰의 위치를 찾습니다.
마스킹된 토큰의 인덱스에서 예측값을 검색합니다. 이 텐서는 어휘와 크기가 같고, 값은 각 토큰에 귀속되는 점수입니다. 이 모델은 그런 맥락에서 가능성이 높다고 생각되는 토큰에 더 높은 점수를 부여합니다.
PyTorch topk 또는 TensorFlow top_k 메서드를 사용하여 상위 5개의 토큰을 검색합니다.
마스킹된 토큰을 토큰으로 바꾸고 결과를 프린트합니다.

# Pytorch
from transformers import AutoModelForMaskedLM, AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")
model = AutoModelForMaskedLM.from_pretrained("distilbert-base-cased")

sequence = "Distilled models are smaller than the models they mimic. Using them instead of the large " \
    f"versions would help {tokenizer.mask_token} our carbon footprint."

inputs = tokenizer(sequence, return_tensors="pt")
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]

token_logits = model(**inputs).logits
mask_token_logits = token_logits[0, mask_token_index, :]

top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()

for token in top_5_tokens:
    print(sequence.replace(tokenizer.mask_token, tokenizer.decode([token])))
"""
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help reduce our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help increase our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help decrease our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help offset our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help improve our carbon footprint.
"""

# Tensorflow
from transformers import TFAutoModelForMaskedLM, AutoTokenizer
import tensorflow as tf

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")
model = TFAutoModelForMaskedLM.from_pretrained("distilbert-base-cased")

sequence = "Distilled models are smaller than the models they mimic. Using them instead of the large " \
    f"versions would help {tokenizer.mask_token} our carbon footprint."

inputs = tokenizer(sequence, return_tensors="tf")
mask_token_index = tf.where(inputs["input_ids"] == tokenizer.mask_token_id)[0, 1]

token_logits = model(**inputs).logits
mask_token_logits = token_logits[0, mask_token_index, :]

top_5_tokens = tf.math.top_k(mask_token_logits, 5).indices.numpy()

for token in top_5_tokens:
    print(sequence.replace(tokenizer.mask_token, tokenizer.decode([token])))
"""
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help reduce our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help increase our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help decrease our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help offset our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help improve our carbon footprint.
"""

모델에서 예측한 상위 5개의 토큰들으로 이루어진 5개의 시퀀스가 프린트됩니다.

인과 언어 모델링(Causal Language Modeling)

인과 언어 모델링은 토큰 순서에 따라 다음 토큰을 예측하는 작업입니다. 이 과정에서는 모델이 왼쪽 컨텍스트(마스크 왼쪽에 있는 토큰)에만 집중하게 됩니다. 이러한 학습 과정은 문장 생성 작업과 특히 연관이 있습니다. 인과 언어 모델링 작업에서 모델을 파인튜닝하려면 run_clm.py 스크립트를 활용할 수 있습니다.

일반적으로 다음 토큰은 모델이 입력 시퀀스에서 생성하는 마지막 히든 레이어의 logit에서 샘플링되어 예측됩니다.

다음은 토크나이저와 모델을 사용하고 top_k_top_p_filtering() 메소드를 활용하여 인풋 토큰 시퀀스에 따라 다음 토큰을 샘플링하는 예입니다.

# Pytorch

from transformers import AutoModelForCausalLM, AutoTokenizer, top_k_top_p_filtering
import torch
from torch import nn

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

sequence = f"Hugging Face is based in DUMBO, New York City, and"

inputs = tokenizer(sequence, return_tensors="pt")
input_ids = inputs["input_ids"]

# get logits of last hidden state
next_token_logits = model(**inputs).logits[:, -1, :]

# filter
filtered_next_token_logits = top_k_top_p_filtering(next_token_logits, top_k=50, top_p=1.0)

# sample
probs = nn.functional.softmax(filtered_next_token_logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)

generated = torch.cat([input_ids, next_token], dim=-1)

resulting_string = tokenizer.decode(generated.tolist()[0])
print(resulting_string)
"""
Hugging Face is based in DUMBO, New York City, and ...
"""

# Tensorflow

from transformers import TFAutoModelForCausalLM, AutoTokenizer, tf_top_k_top_p_filtering
import tensorflow as tf

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = TFAutoModelForCausalLM.from_pretrained("gpt2")

sequence = f"Hugging Face is based in DUMBO, New York City, and"

inputs = tokenizer(sequence, return_tensors="tf")
input_ids = inputs["input_ids"]

# get logits of last hidden state
next_token_logits = model(**inputs).logits[:, -1, :]

# filter
filtered_next_token_logits = tf_top_k_top_p_filtering(next_token_logits, top_k=50, top_p=1.0)

# sample
next_token = tf.random.categorical(filtered_next_token_logits, dtype=tf.int32, num_samples=1)

generated = tf.concat([input_ids, next_token], axis=1)

resulting_string = tokenizer.decode(generated.numpy().tolist()[0])
print(resulting_string)

"""
Hugging Face is based in DUMBO, New York City, and ...
"""

이렇게 하면 원래의 순서에 따라 일관성 있는 다음 토큰이 출력됩니다. 이 토큰은 우리의 경우 단어 또는 특징입니다.

다음 섹션에서는 한 번에 하나의 토큰이 아니라 지정된 길이로 여러 토큰을 생성하는 데 *generate()*를 사용하는 방법을 보여 줍니다.

텍스트 생성(Text Generation)

텍스트 생성(개방형 텍스트 생성이라고도 함)의 목표는 주어진 Context와 일관되게 이어지는 텍스트를 만드는 것입니다. 다음 예는 파이프라인에서 GPT-2를 사용하여 텍스트를 생성하는 방법을 보여줍니다. 기본적으로 모든 모델은 파이프라인에서 사용할 때 각 Config에서 설정한 대로 Top-K 샘플링을 적용합니다(예시 : gpt-2 config 참조).

from transformers import pipeline

text_generator = pipeline("text-generation")
print(text_generator("As far as I am concerned, I will", max_length=50, do_sample=False))

"""
[{'generated_text': 'As far as I am concerned, I will be the first to admit that I am not a fan of the idea of a
"free market." I think that the idea of a free market is a bit of a stretch. I think that the idea'}]
"""

여기서 모델은 "As far as I am concerned, I will"라는 Context에서 총 최대 길이 50개의 토큰을 가진 임의의 텍스트를 생성합니다. 백그라운드에서 파이프라인 객체는 generate() 메서드를 호출하여 텍스트를 생성합니다. max_length 및 do_sample 인수와 같이 이 메서드의 기본 인수는 파이프라인에서 재정의할 수 있습니다.

다음은 XLNet 및 해당 토크나이저를 사용한 텍스트 생성 예제이며, generate() 메서드를 포함하고 있습니다.

# Pytorch

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("xlnet-base-cased")
tokenizer = AutoTokenizer.from_pretrained("xlnet-base-cased")

# Padding text helps XLNet with short prompts - proposed by Aman Rusia in <https://github.com/rusiaaman/XLNet-gen#methodology>
PADDING_TEXT = """In 1991, the remains of Russian Tsar Nicholas II and his family
(except for Alexei and Maria) are discovered.
The voice of Nicholas's young son, Tsarevich Alexei Nikolaevich, narrates the
remainder of the story. 1883 Western Siberia,
a young Grigori Rasputin is asked by his father and a group of men to perform magic.
Rasputin has a vision and denounces one of the men as a horse thief. Although his
father initially slaps him for making such an accusation, Rasputin watches as the
man is chased outside and beaten. Twenty years later, Rasputin sees a vision of
the Virgin Mary, prompting him to become a priest. Rasputin quickly becomes famous,
with people, even a bishop, begging for his blessing.   """

prompt = "Today the weather is really nice and I am planning on "
inputs = tokenizer(PADDING_TEXT + prompt, add_special_tokens=False, return_tensors="pt")["input_ids"]

prompt_length = len(tokenizer.decode(inputs[0]))
outputs = model.generate(inputs, max_length=250, do_sample=True, top_p=0.95, top_k=60)
generated = prompt + tokenizer.decode(outputs[0])[prompt_length+1:]

print(generated)

"""
Today the weather is really nice and I am planning ...
"""

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("xlnet-base-cased")
tokenizer = AutoTokenizer.from_pretrained("xlnet-base-cased")

# Padding text helps XLNet with short prompts - proposed by Aman Rusia in <https://github.com/rusiaaman/XLNet-gen#methodology>
PADDING_TEXT = """In 1991, the remains of Russian Tsar Nicholas II and his family
(except for Alexei and Maria) are discovered.
The voice of Nicholas's young son, Tsarevich Alexei Nikolaevich, narrates the
remainder of the story. 1883 Western Siberia,
a young Grigori Rasputin is asked by his father and a group of men to perform magic.
Rasputin has a vision and denounces one of the men as a horse thief. Although his
father initially slaps him for making such an accusation, Rasputin watches as the
man is chased outside and beaten. Twenty years later, Rasputin sees a vision of
the Virgin Mary, prompting him to become a priest. Rasputin quickly becomes famous,
with people, even a bishop, begging for his blessing.   """

prompt = "Today the weather is really nice and I am planning on "
inputs = tokenizer(PADDING_TEXT + prompt, add_special_tokens=False, return_tensors="pt")["input_ids"]

prompt_length = len(tokenizer.decode(inputs[0]))
outputs = model.generate(inputs, max_length=250, do_sample=True, top_p=0.95, top_k=60)
generated = prompt + tokenizer.decode(outputs[0])[prompt_length+1:]

print(generated)

"""
Today the weather is really nice and I am planning ...
"""

텍스트 생성은 현재 PyTorch의 GPT-2, OpenAi-GPT, CTRL, XLNet, Transpo-XL 및 Reformer와 Tensorflow의 대부분의 모델에서도 가능합니다. 위의 예에서 볼 수 있듯이, XLNet 및 Transpo-XL이 제대로 작동하려면 패딩이 필요한 경우가 많습니다. GPT-2는 인과 언어 모델링 목적으로 수백만 개의 웹 페이지를 통해 학습되었기 때문에 일반적으로 개방형 텍스트 생성에 적합합니다.

텍스트 생성을 위해 다양한 디코딩 전략을 적용하는 방법에 대한 자세한 내용은 텍스트 생성 블로그 게시물을 참조하십시오.

개체명 인식(Named Entity Recognition)

개체명 인식(NER)은 개인, 기관 또는 장소의 이름 등으로 식별 가능한 클래스에 따라 토큰을 분류하는 작업입니다. 개체명 인식 데이터셋의 예로는 CoNLL-2003 데이터셋이 있습니다. NER 작업에서 모델을 파인튜닝하려는 경우 run_ner.py 스크립트를 활용할 수 있습니다.

다음은 파이프라인을 사용하여 개체명 인식으로 토큰을 9개 클래스 중 하나에 속하도록 예측하는 예시입니다(BIO 표현).

O, 개체명이 아닌 부분
B-MIS, 기타 엔티티가 시작되는 부분
I-MIS, 기타 엔티티
B-PER, 사람의 이름이 시작되는 부분
I-PER, 사람의 이름
B-ORG, 기관명이 시작되는 부분
I-ORG, 기관명
B-LOC, 장소명이 시작되는 부분
I-LOC, 장소명

CoNLL-2003의 파인튜닝 모델을 사용하였으며, dbmdz의 @stefan-it에 의해 파인튜닝 되었습니다.

from transformers import pipeline

ner_pipe = pipeline("ner")

sequence = """Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO,
therefore very close to the Manhattan Bridge which is visible from the window."""

이렇게 하면 위에서 정의한 9개 클래스의 엔티티 중 하나로 식별된 모든 단어 목록이 출력됩니다. 예상되는 결과는 다음과 같습니다.

for entity in ner_pipe(sequence):
    print(entity)
"""
{'entity': 'I-ORG', 'score': 0.9996, 'index': 1, 'word': 'Hu', 'start': 0, 'end': 2}
{'entity': 'I-ORG', 'score': 0.9910, 'index': 2, 'word': '##gging', 'start': 2, 'end': 7}
{'entity': 'I-ORG', 'score': 0.9982, 'index': 3, 'word': 'Face', 'start': 8, 'end': 12}
{'entity': 'I-ORG', 'score': 0.9995, 'index': 4, 'word': 'Inc', 'start': 13, 'end': 16}
{'entity': 'I-LOC', 'score': 0.9994, 'index': 11, 'word': 'New', 'start': 40, 'end': 43}
{'entity': 'I-LOC', 'score': 0.9993, 'index': 12, 'word': 'York', 'start': 44, 'end': 48}
{'entity': 'I-LOC', 'score': 0.9994, 'index': 13, 'word': 'City', 'start': 49, 'end': 53}
{'entity': 'I-LOC', 'score': 0.9863, 'index': 19, 'word': 'D', 'start': 79, 'end': 80}
{'entity': 'I-LOC', 'score': 0.9514, 'index': 20, 'word': '##UM', 'start': 80, 'end': 82}
{'entity': 'I-LOC', 'score': 0.9337, 'index': 21, 'word': '##BO', 'start': 82, 'end': 84}
{'entity': 'I-LOC', 'score': 0.9762, 'index': 28, 'word': 'Manhattan', 'start': 114, 'end': 123}
{'entity': 'I-LOC', 'score': 0.9915, 'index': 29, 'word': 'Bridge', 'start': 124, 'end': 130}
"""

어떻게 "Huggingface" 시퀀스의 토큰이 기관명으로 식별되고 "New York City", "DUMBO" 및 "Manhattan Bridge"가 장소명으로 식별되는지에 주의해서 보십시오.

다음은 모델 및 토크나이저를 사용하여 개체명 인식을 수행하는 예시입니다. 프로세스는 다음과 같습니다.

체크포인트에서 토크나이저 및 모델을 인스턴스화합니다. BERT 모델을 사용하고, 체크포인트에 저장된 가중치를 로드합니다.
각 시퀀스의 엔티티를 정의합니다. 예를 들어 "Hugging Face"를 기관명으로, "New York City"를 장소명으로 정의할 수 있습니다.
단어를 토큰으로 분할하여 예측에 매핑할 수 있도록 합니다. 우리는 먼저 시퀀스를 완전히 인코딩하고 디코딩하여 특별한 토큰이 포함된 문자열을 남겨두도록 합니다.
해당 시퀀스를 ID로 인코딩합니다(특수 토큰이 자동으로 추가됨).
입력 토큰을 모델에 전달하고, 첫 번째 출력을 가져와서 예측을 수행합니다. 이 결과를 각 토큰에 대해 매칭 가능한 9개 클래스와 대조합니다. 각 토큰에 대해 가장 가능성이 높은 클래스를 검색하기 위해 argmax 함수를 사용합니다.
각각의 토큰을 예측 결과와 묶어 프린트합니다.

# Pytorch
from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch

model = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, " \\
           "therefore very close to the Manhattan Bridge."

inputs = tokenizer(sequence, return_tensors="pt")
tokens = inputs.tokens()

outputs = model(**inputs).logits
predictions = torch.argmax(outputs, dim=2)

# Tensorflow
from transformers import TFAutoModelForTokenClassification, AutoTokenizer
import tensorflow as tf

model = TFAutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, " \\
           "therefore very close to the Manhattan Bridge."

inputs = tokenizer(sequence, return_tensors="tf")
tokens = inputs.tokens()

outputs = model(**inputs)[0]
predictions = tf.argmax(outputs, axis=2)

해당 예측 결과로 매핑된 각 토큰 목록을 출력합니다. 파이프라인과 달리 모든 토큰에 예측 결과가 나오게 되는데, 엔티티가 없는 토큰인 클래스 0의 경우를 제거하지 않았기 때문입니다.

위의 예시에서 예측 결과는 정수로 표현됩니다. 아래 그림과 같이 정수 형태의 클래스 번호를 클래스 이름으로 바꾸기 위해 model.config.id2label 속성을 사용할 수 있습니다.

for token, prediction in zip(tokens, predictions[0].numpy()):
    print((token, model.config.id2label[prediction]))

"""
('[CLS]', 'O')
('Hu', 'I-ORG')
('##gging', 'I-ORG')
('Face', 'I-ORG')
('Inc', 'I-ORG')
('.', 'O')
('is', 'O')
('a', 'O')
('company', 'O')
('based', 'O')
('in', 'O')
('New', 'I-LOC')
('York', 'I-LOC')
('City', 'I-LOC')
('.', 'O')
('Its', 'O')
('headquarters', 'O')
('are', 'O')
('in', 'O')
('D', 'I-LOC')
('##UM', 'I-LOC')
('##BO', 'I-LOC')
(',', 'O')
('therefore', 'O')
('very', 'O')
('close', 'O')
('to', 'O')
('the', 'O')
('Manhattan', 'I-LOC')
('Bridge', 'I-LOC')
('.', 'O')
('[SEP]', 'O')
"""

요약(Summarization)

요약은 문서나 기사를 더 짧은 텍스트로 줄이는 작업입니다. 요약 작업에서 모델을 파인튜닝하려면 run_summarization.py를 활용할 수 있습니다.

요약 데이터셋 예로는 CNN / Daily Mail 데이터셋이 있습니다. 이 데이터셋은 긴 뉴스 기사로 구성되어 있으며 요약 작업을 위해 만들어졌습니다. 요약 작업에서 모델을 파인튜닝하려면, 이 문서에서 다양한 접근 방식을 배울 수 있습니다.

다음은 파이프라인을 사용하여 요약을 수행하는 예입니다. CNN/Daily Mail 데이터셋으로 파인튜닝된 Bart 모델을 활용합니다.

from transformers import pipeline

summarizer = pipeline("summarization")

ARTICLE = """ New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York.
A year later, she got married again in Westchester County, but to a different man and without divorcing her first husband.
Only 18 days after that marriage, she got hitched yet again. Then, Barrientos declared "I do" five more times, sometimes only within two weeks of each other.
In 2010, she married once more, this time in the Bronx. In an application for a marriage license, she stated it was her "first and only" marriage.
Barrientos, now 39, is facing two criminal counts of "offering a false instrument for filing in the first degree," referring to her false statements on the
2010 marriage license application, according to court documents.
Prosecutors said the marriages were part of an immigration scam.
On Friday, she pleaded not guilty at State Supreme Court in the Bronx, according to her attorney, Christopher Wright, who declined to comment further.
After leaving court, Barrientos was arrested and charged with theft of service and criminal trespass for allegedly sneaking into the New York subway through an emergency exit, said Detective
Annette Markowski, a police spokeswoman. In total, Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002.
All occurred either in Westchester County, Long Island, New Jersey or the Bronx. She is believed to still be married to four men, and at one time, she was married to eight men at once, prosecutors say.
Prosecutors said the immigration scam involved some of her husbands, who filed for permanent residence status shortly after the marriages.
Any divorces happened only after such filings were approved. It was unclear whether any of the men will be prosecuted.
The case was referred to the Bronx District Attorney\\'s Office by Immigration and Customs Enforcement and the Department of Homeland Security\\'s
Investigation Division. Seven of the men are from so-called "red-flagged" countries, including Egypt, Turkey, Georgia, Pakistan and Mali.
Her eighth husband, Rashid Rajput, was deported in 2006 to his native Pakistan after an investigation by the Joint Terrorism Task Force.
If convicted, Barrientos faces up to four years in prison.  Her next court appearance is scheduled for May 18.
"""

요약 파이프라인은 PreTrainedModel.generate() 메서드에 의존하므로 아래와 같이 파이프라인에서 max_length 및 min_length에 대한 *PreTrainedModel.generate()*의 기본 인수를 직접 재정의할 수 있습니다. 이렇게 하면 다음과 같은 요약 결과가 출력됩니다.

print(summarizer(ARTICLE, max_length=130, min_length=30, do_sample=False))
"""
[{'summary_text': ' Liana Barrientos, 39, is charged with two counts of "offering a false instrument for filing in
the first degree" In total, she has been married 10 times, with nine of her marriages occurring between 1999 and
2002 . At one time, she was married to eight men at once, prosecutors say .'}]
"""

다음은 모델과 토크나이저를 사용하여 요약을 수행하는 예시입니다. 프로세스는 다음과 같습니다.

체크포인트에서 토크나이저 및 모델을 인스턴스화합니다. 일반적으로 Bart 또는 T5와 같은 인코더-디코더 모델을 사용하여 수행합니다.
요약해야 할 문서를 정의합니다.
T5의 특수한 접두사인 "summarize: "를 추가합니다.
요약문 생성을 위해 PreTrainedModel.generate() 메서드를 사용합니다.

이 예시에서는 Google의 T5 모델을 사용합니다. 다중 작업 혼합 데이터셋(CNN/Daily Mail 포함)에서만 프리트레인을 했음에도 불구하고 매우 좋은 결과를 얻을 수 있습니다.

# Pytorch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")
tokenizer = AutoTokenizer.from_pretrained("t5-base")

# T5 uses a max_length of 512 so we cut the article to 512 tokens.
inputs = tokenizer("summarize: " + ARTICLE, return_tensors="pt", max_length=512, truncation=True)
outputs = model.generate(
    inputs["input_ids"], max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True
)

print(tokenizer.decode(outputs[0]))
"""
<pad> prosecutors say the marriages were part of an immigration scam. if convicted, barrientos faces two criminal
counts of "offering a false instrument for filing in the first degree" she has been married 10 times, nine of them
between 1999 and 2002.</s>
"""

번역(Translation)

번역은 한 언어에서 다른 언어로 텍스트를 바꾸는 작업입니다. 번역 작업에서 모델을 파인튜닝 하려면 run_translation.py 스크립트를 활용할 수 있습니다.

번역 데이터셋의 예로는 WMT English to German 데이터셋이 있는데, 이 데이터셋에는 영어로 된 문장이 입력 데이터로, 독일어로 된 문장이 타겟 데이터로 포함되어 있습니다. 번역 작업에서 모델을 파인튜닝하려는 경우에 대해 이 문서에서는 다양한 접근 방식을 설명합니다.

다음은 파이프라인을 사용하여 번역을 수행하는 예입니다. 다중 작업 혼합 데이터 세트(WMT 포함)에서 프리트레인된 T5 모델을 활용하여 번역 결과를 제공합니다.

from transformers import pipeline

translator = pipeline("translation_en_to_de")
print(translator("Hugging Face is a technology company based in New York and Paris", max_length=40))
"""
[{'translation_text': 'Hugging Face ist ein Technologieunternehmen mit Sitz in New York und Paris.'}]
"""

변역 파이프라인은 PreTrainedModel.generate() 메서드에 의존하므로 위와 같이 파이프라인에서 max_length에 대한 *PreTrainedModel.generate()*의 기본 인수를 직접 재정의할 수 있습니다.

다음은 모델과 토크나이저를 사용하여 번역을 수행하는 예시입니다. 프로세스는 다음과 같습니다.

체크포인트에서 토크나이저 및 모델을 인스턴스화합니다. 일반적으로 Bart 또는 T5와 같은 인코더-디코더 모델을 사용하여 수행합니다.
번역해야 할 문서를 정의합니다.
T5의 특수한 접두사인 "translate English to German:“을 추가합니다.
번역문 생성을 위해 PreTrainedModel.generate() 메서드를 사용합니다.

# Pytorch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")
tokenizer = AutoTokenizer.from_pretrained("t5-base")

inputs = tokenizer(
    "translate English to German: Hugging Face is a technology company based in New York and Paris",
    return_tensors="pt"
)
outputs = model.generate(inputs["input_ids"], max_length=40, num_beams=4, early_stopping=True)

print(tokenizer.decode(outputs[0]))
"""
<pad> Hugging Face ist ein Technologieunternehmen mit Sitz in New York und Paris.</s>
"""

# Tensorflow
from transformers import TFAutoModelForSeq2SeqLM, AutoTokenizer

model = TFAutoModelForSeq2SeqLM.from_pretrained("t5-base")
tokenizer = AutoTokenizer.from_pretrained("t5-base")

inputs = tokenizer(
    "translate English to German: Hugging Face is a technology company based in New York and Paris",
    return_tensors="tf"
)
outputs = model.generate(inputs["input_ids"], max_length=40, num_beams=4, early_stopping=True)

print(tokenizer.decode(outputs[0]))
"""
<pad> Hugging Face ist ein Technologieunternehmen mit Sitz in New York und Paris.
"""

위의 예시와 같이 번역문이 출력됩니다.

저작자표시 비영리 변경금지 (새창열림)

Under the hood : Pretrained Models

지닝 2021. 10. 19. 13:56

2021. 10. 19. 13:56

Huggingface Transformers 문서를 한글로 번역합니다. 오역, 의역이 존재하니 원문과 비교하여 읽어주세요.
영한 번역 출처 : 본인 @threegenie

🔗 Huggingface Transformers Docs >> Quick Tour - Under the hood : pretrained models

이제 파이프라인을 사용할 때 그 안에서 어떤 일이 일어나는지 알아보겠습니다.

아래 코드를 보면, 모델과 토크나이저는 from_pretrained 메서드를 통해 만들어집니다.

# Pytorch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
pt_model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Tensorflow
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
tf_model = TFAutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

토크나이저 사용하기

토크나이저는 텍스트의 전처리를 담당합니다. 먼저, 주어진 텍스트를 토큰(token) (또는 단어의 일부, 구두점 기호 등)으로 분리합니다. 이 과정을 처리할 수 있는 다양한 규칙들이 있으므로(토크나이저 요약에서 더 자세히 알아볼 수 있음), 모델명을 사용하여 토크나이저를 인스턴스화해야만 프리트레인 모델과 동일한 규칙을 사용할 수 있습니다.

두번째 단계는, 토큰(token)을 숫자 형태로 변환하여 텐서(tensor)를 구축하고 모델에 적용할 수 있도록 하는 것입니다. 이를 위해, 토크나이저에는 from_pretrained 메서드로 토크나이저를 인스턴스화할 때 다운로드하는 vocab이라는 것이 있습니다. 모델이 사전학습 되었을 때와 동일한 vocab을 사용해야 하기 때문입니다.

주어진 텍스트에 이 과정들을 적용하려면 토크나이저에 아래와 같이 텍스트를 넣으면 됩니다.

inputs = tokenizer("We are very happy to show you the 🤗 Transformers library.")

이렇게 하면, 딕셔너리 형태의 문자열이 정수 리스트로 변환됩니다. 이 리스트는 토큰 ID(ids of the tokens)를 포함하고 있고, 모델에 필요한 추가 인수 또한 가지고 있습니다. 예를 들면, 모델이 시퀀스를 더 잘 이해하기 위해 사용하는 어텐션 마스크(attention mask)도 포함하고 있습니다.

print(inputs)

"""
{'input_ids': [101, 2057, 2024, 2200, 3407, 2000, 2265, 2017, 1996, 100, 19081, 3075, 1012, 102],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
"""

토크나이저에 문장 리스트를 직접 전달할 수 있습니다. 배치(batch)로 모델에 전달하는 것이 목표라면, 동일한 길이로 패딩하고 모델이 허용할 수 있는 최대 길이로 잘라 텐서를 반환하는 것이 좋습니다. 토크나이저에 이러한 사항들을 모두 지정할 수 있습니다.

# Pytorch
pt_batch = tokenizer(
    ["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."],
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors="pt"
)

# Tensorflow
tf_batch = tokenizer(
    ["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."],
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors="tf"
)

모델이 예측하는 위치에(이 같은 경우엔 오른쪽) 프리트레이닝된 패딩 토큰을 이용하여 패딩이 자동으로 적용됩니다. 어텐션 마스크도 패딩을 고려하여 조정됩니다.

# Pytorch
for key, value in pt_batch.items():
    print(f"{key}: {value.numpy().tolist()}"

"""
input_ids: [[101, 2057, 2024, 2200, 3407, 2000, 2265, 2017, 1996, 100, 19081, 3075, 1012, 102], [101, 2057, 3246, 2017, 2123, 1005, 1056, 5223, 2009, 1012, 102, 0, 0, 0]]
attention_mask: [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0]]
"""

# Tensorflow
for key, value in tf_batch.items():
    print(f"{key}: {value.numpy().tolist()}")

"""
input_ids: [[101, 2057, 2024, 2200, 3407, 2000, 2265, 2017, 1996, 100, 19081, 3075, 1012, 102], [101, 2057, 3246, 2017, 2123, 1005, 1056, 5223, 2009, 1012, 102, 0, 0, 0]]
attention_mask: [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0]]
"""

토크나이저에 대해 이곳에서 더 자세히 알아볼 수 있습니다.

모델 사용하기

인풋 데이터가 토크나이저를 통해 전처리되면, 모델로 직접 보낼 수 있습니다. 앞서 언급한 것처럼, 모델에 필요한 모든 관련 정보가 포함됩니다. 만약 텐서플로우 모델을 사용한다면 딕셔너리의 키를 직접 텐서로 전달할 수 있고, 파이토치 모델을 사용한다면 '**'을 더해서 딕셔너리를 풀어 줘야 합니다.

# Pytorch 
pt_outputs = pt_model(**pt_batch) 

# Tensorflow 
tf_outputs = tf_model(tf_batch)

허깅페이스 트랜스포머에서 모든 아웃풋은 다른 메타데이터와 함께 모델의 최종 활성화 상태가 포함된 개체입니다. 이러한 개체는 여기에 더 자세히 설명되어 있습니다. 출력값을 살펴보겠습니다.

# Pytorch
print(pt_outputs)

"""
SequenceClassifierOutput(loss=None, logits=tensor([[-4.0833,  4.3364],
       [ 0.0818, -0.0418]], grad_fn=<AddmmBackward>), hidden_states=None, attentions=None)
"""


# Tensorflow
print(tf_outputs)
"""
TFSequenceClassifierOutput(loss=None, logits=<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[-4.0833 ,  4.3364  ],
       [ 0.0818, -0.0418]], dtype=float32)>, hidden_states=None, attentions=None)
"""

출력된 값에 있는 logits 항목에 주목하십시오. 이 항목을 사용하여 모델의 최종 활성화 상태에 접근할 수 있습니다.

💛 주의
모든 허깅페이스 트랜스포머 모델(파이토치 또는 텐서플로우)은 마지막 활성화 함수가 종종 손실(loss)과 더해지기 때문에 마지막 활성화 함수(소프트맥스 같은)를 적용하기 이전의 모델 활성화 상태를 리턴합니다.

예측을 위해 소프트맥스 활성화를 적용해 봅시다.

# Pytorch
from torch import nn
pt_predictions = nn.functional.softmax(pt_outputs.logits, dim=-1)

# Tensorflow
import tensorflow as tf
tf_predictions = tf.nn.softmax(tf_outputs.logits, axis=-1)

이전 과정에서 얻어진 숫자들을 볼 수 있습니다.

# Pytorch
print(pt_predictions)
"""
tensor([[2.2043e-04, 9.9978e-01],
        [5.3086e-01, 4.6914e-01]], grad_fn=<SoftmaxBackward>)
"""

# Tensorflow
print(tf_predictions)
"""
tf.Tensor(
[[2.2043e-04 9.9978e-01]
 [5.3086e-01 4.6914e-01]], shape=(2, 2), dtype=float32)
"""

모델에 인풋 데이터 외에 라벨을 넣는 경우에는, 모델 출력 개체에 다음과 같은 손실(loss) 속성도 포함됩니다.

# Pytorch
import torch
pt_outputs = pt_model(**pt_batch, labels = torch.tensor([1, 0]))
print(pt_outputs)
"""
SequenceClassifierOutput(loss=tensor(0.3167, grad_fn=<NllLossBackward>), logits=tensor([[-4.0833,  4.3364],
        [ 0.0818, -0.0418]], grad_fn=<AddmmBackward>), hidden_states=None, attentions=None)
"""

# Tensorflow
import tensorflow as tf
tf_outputs = tf_model(tf_batch, labels = tf.constant([1, 0]))
print(tf_outputs)
"""
TFSequenceClassifierOutput(loss=<tf.Tensor: shape=(2,), dtype=float32, numpy=array([2.2051e-04, 6.3326e-01], dtype=float32)>, logits=<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[-4.0833 ,  4.3364  ],
       [ 0.0818, -0.0418]], dtype=float32)>, hidden_states=None, attentions=None)
"""

모델은 표준 torch.nn.Module이나 tf.keras.Model로 트레이닝 루프에서 사용할 수 있습니다. 허깅페이스 트랜스포머는 Trainer(텐서플로우에서는 TFTrainer) 클래스를 제공하여 여러분이 모델을 학습시키는 것을 돕습니다(분산 트레이닝, 혼합 정밀도 등과 같은 과정에서는 주의해야 합니다). 자세한 내용은 트레이닝 튜토리얼을 참조하십시오.

💛 주의
Pytorch 모델 출력은 IDE의 속성에 대한 자동 완성을 가져올 수 있는 특수 데이터 클래스입니다. 또한 튜플 또는 딕셔너리처럼 작동합니다(정수, 슬라이스 또는 문자열로 인덱싱할 수 있음). 이 경우 설정되지 않은 속성(None 값을 가지고 있는)은 무시됩니다.

모델의 파인튜닝이 끝나면, 아래와 같은 방법으로 토크나이저와 함께 저장할 수 있습니다.

tokenizer.save_pretrained(save_directory)
model.save_pretrained(save_directory)

그런 다음 모델 이름 대신 디렉토리 이름을 전달하여 from_pretrained() 메서드를 사용하여 이 모델을 다시 로드할 수 있습니다. 허깅페이스 트랜스포머의 의 멋진 기능 중 하나는 파이토치와 텐서플로우 간에 쉽게 전환할 수 있다는 것입니다. 이전과 같이 저장된 모델은 파이토치 또는 텐서플로우에서 다시 로드할 수 있습니다. 저장된 파이토치 모델을 텐서플로우 모델에 로드하는 경우 from_pretrained()를 다음과 같이 사용합니다.

# Pytorch -> Tensorflow
from transformers import TFAutoModel
tokenizer = AutoTokenizer.from_pretrained(save_directory)
model = TFAutoModel.from_pretrained(save_directory, from_pt=True)

저장된 텐서플로우 모델을 파이토치 모델에 로드하는 경우 다음 코드를 사용해야 합니다.

# Tensorflow -> Pytorch
from transformers import AutoModel
tokenizer = AutoTokenizer.from_pretrained(save_directory)
model = AutoModel.from_pretrained(save_directory, from_tf=True)

마지막으로, 모델의 모든 은닉 상태(hidden state)와 모든 어텐션 가중치(attention weight)를 리턴하도록 설정할 수 있습니다.

# Pytorch
pt_outputs = pt_model(**pt_batch, output_hidden_states=True, output_attentions=True)
all_hidden_states  = pt_outputs.hidden_states
all_attentions = pt_outputs.attentions

# Tensorflow
tf_outputs = tf_model(tf_batch, output_hidden_states=True, output_attentions=True)
all_hidden_states =  tf_outputs.hidden_states
all_attentions = tf_outputs.attentions

코드에 엑세스하기

AutoModel 및 AutoTokenizer 클래스는 사전 교육된 모델로 자동으로 이동할 수 있는 바로가기일 뿐입니다. 이면에는 라이브러리가 아키텍처와 클래스의 조합당 하나의 모델 클래스를 가지고 있으므로 필요에 따라 코드를 쉽게 액세스하고 조정할 수 있습니다.

이전 예시에서, 이 모델은 'distilbert-base-cased-un-finetuned-sst-2-english'라고 불렸는데, 이는 DistilBERT 구조를 사용한다는 뜻입니다. AutoModelForSequenceClassification(또는 텐서플로우에서는 TFAutoModelForSequenceClassification)이 사용되었으므로 자동으로 생성된 모델은 DistilBertForSequenceClassification이 됩니다. 해당 모델의 설명서에서 해당 모델과 관련된 모든 세부 정보를 확인하거나 소스 코드를 찾아볼 수 있습니다. 모델 및 토크나이저를 직접 인스턴스화할 수 있는 방법은 다음과 같습니다.

# Pytorch
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
model = DistilBertForSequenceClassification.from_pretrained(model_name)
tokenizer = DistilBertTokenizer.from_pretrained(model_name)

# Tensorflow
from transformers import DistilBertTokenizer, TFDistilBertForSequenceClassification
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
model = TFDistilBertForSequenceClassification.from_pretrained(model_name)
tokenizer = DistilBertTokenizer.from_pretrained(model_name)

모델 커스터마이징 하기

모델 자체의 빌드 방법을 변경하려면 사용자 정의 구성 클래스를 정의할 수 있습니다. 각 아키텍처에는 고유한 관련 구성(Configuration)이 제공됩니다. 예를 들어, DistilBertConfig를 사용하면 DistilBERT에 대한 은닉 차원(hidden dimension), 드롭아웃 비율(dropout rate) 등의 매개변수(parameter)를 지정할 수 있습니다. 은닉 차원의 크기를 변경하는 것과 같이 중요한 수정 작업을 하면 사전 훈련된 모델을 더 이상 사용할 수 없고 처음부터 학습시켜야 합니다. 그런 다음 Config에서 직접 모델을 인스턴스화합니다.

아래에서는 from_pretrained() 메서드를 사용하여 토크나이저에 사전 정의된 어휘를 로드합니다. 그러나 토크나이저와 달리 우리는 처음부터 모델을 초기화하고자 합니다. 따라서 from_pretrained() 방법을 사용하는 대신 Config에서 모델을 인스턴스화합니다.

# Pytorch
from transformers import DistilBertConfig, DistilBertTokenizer, DistilBertForSequenceClassification
config = DistilBertConfig(n_heads=8, dim=512, hidden_dim=4*512)
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBertForSequenceClassification(config)

# Tensorflow
from transformers import DistilBertConfig, DistilBertTokenizer, TFDistilBertForSequenceClassification
config = DistilBertConfig(n_heads=8, dim=512, hidden_dim=4*512)
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = TFDistilBertForSequenceClassification(config)

모델 헤드만 변경하는 경우(라벨 수와 같은)에도 사전 훈련된 모델을 사용할 수 있습니다. 예를 들어, 사전 훈련된 모델을 사용하여 10개의 서로 다른 라벨에 대한 분류기(Classifier)를 정의해 보겠습니다. 라벨 수를 변경하기 위해 모든 기본값을 사용하여 새 Config를 생성하는 대신에 Config가 from_pretrained() 메서드에 인수를 전달하면 기본 Config가 적절히 업데이트됩니다.

# Pytorch
from transformers import DistilBertConfig, DistilBertTokenizer, DistilBertForSequenceClassification
model_name = "distilbert-base-uncased"
model = DistilBertForSequenceClassification.from_pretrained(model_name, num_labels=10)
tokenizer = DistilBertTokenizer.from_pretrained(model_name)

# Tensorflow
from transformers import DistilBertConfig, DistilBertTokenizer, TFDistilBertForSequenceClassification
model_name = "distilbert-base-uncased"
model = TFDistilBertForSequenceClassification.from_pretrained(model_name, num_labels=10)
tokenizer = DistilBertTokenizer.from_pretrained(model_name)

저작자표시 비영리 변경금지 (새창열림)

'Huggingface Transformers > GET STARTED' 카테고리의 다른 글

Getting started on a task with a pipeline (0)	2021.10.18

Getting started on a task with a pipeline

지닝 2021. 10. 18. 14:43

2021. 10. 18. 14:43

Huggingface Transformers 문서를 한글로 번역합니다. 오역, 의역이 존재하니 원문과 비교하여 읽어주세요.
영한 번역 출처 : 본인 @threegenie

🔗 Huggingface Transformers Docs >>Quick tour

Huggingface🤗 트랜스포머 라이브러리의 특징에 대해 간단히 알아보겠습니다. 이 라이브러리는 텍스트 감성 분석과 같은 자연어 이해(NLU) 태스크와, 새로운 텍스트를 만들어내거나 다른 언어로 번역하는 것과 같은 자연어 생성(NLG) 태스크를 위해 사전 훈련된 모델을 다운로드합니다.

💛 알아두면 좋은 점
모든 문서의 코드는 우측의 스위치를 왼쪽으로 바꾸면 Pytorch로, 반대로 바꾸면 Tensorflow로 볼 수 있습니다. 만약 그렇게 설정되어 있지 않다면, 코드를 수정하지 않아도 두 가지 언어에서 모두 작동합니다.

파이프라인으로 작업 시작하기

🔗 Getting started on a task with a pipeline

📺 Youtube video : The pipeline function

주어진 테스크에서 사전학습모델(Pre-trained Model)을 사용하는 가장 쉬운 방법은 pipeline() 함수를 사용하는 것 입니다.

트랜스포머는 아래와 같은 작업들을 제공합니다.

※ 감성 분석(Sentiment Analysis): 텍스트의 긍정 or 부정 판별
※ 영문 텍스트 생성(Text Generation) : 프롬프트를 제공하고, 모델이 뒷 문장을 생성함
※ 개체명 인식(Name Entity Recognition, NER): 입력 문장에서 각 단어에 나타내는 엔티티(사용자, 장소 등)로 라벨을 지정함
※ 질의응답(Question Answering): 모델에 문맥(Context)과 질문을 제공하고 문맥에서 정답 추출
※ 빈칸 채우기(Filling Masked Text): 마스크된 단어가 포함된 텍스트([MASK]로 대체됨)를 주면 빈 칸을 채움
※ 요약(Summarization): 긴 텍스트의 요약본을 생성
※ 번역(Translation): 텍스트를 다른 언어로 번역
※ 특성 추출(Feature Extraction): 텍스트를 텐서 형태로 반환

감성분석이 어떻게 이루어지는지 알아보겠습니다. (기타 작업들은 task summary에서 다룹니다)

from transformers import pipeline classifier = pipeline('sentiment-analysis')

이 코드를 처음 입력하면 사전학습모델과 해당 토크나이저가 다운로드 및 캐시됩니다. 이후에 두 가지 모두에 대해 알아보겠지만, 토크나이저의 역할은 모델에 대한 텍스트를 전처리하고 예측 작업을 수행하는 것입니다. 파이프라인은 이 모든 것을 그룹화하고 예측 결과를 후처리하여 사용자가 읽을 수 있도록 변환합니다.

예를 들면 이하와 같습니다.

classifier('We are very happy to show you the 🤗 Transformers library.') 
# [{'label': 'POSITIVE', 'score': 0.9998}]

흥미롭지 않나요? 이러한 문장들을 넣으면 모델을 통해 전처리되고, 딕셔너리 형태의 리스트를 반환합니다.

results = classifier(["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."]) 

for result in results: 
	print(f"label: {result['label']}, with score: {round(result['score'], 4)}") 
    
# label: POSITIVE, with score: 0.9998 # label: NEGATIVE, with score: 0.5309

대용량 데이터셋과 함께 이 라이브러리를 사용하려면 iterating over a pipeline을 참조하세요.

여러분은 위의 예시에서 두 번째 문장이 부정적으로 분류되었다는 것을 알 수 있지만(긍정 또는 부정으로 분류되어야 합니다), 스코어는 0.5에 가까운 중립적인 점수입니다.

이 파이프라인에 기본적으로 다운로드되는 모델은 distilbert-base-uncaseed-finetuned-sst-2-english입니다. 모델 페이지에서 더 자세한 정보를 얻을 수 있습니다. 이 모델은 DistilBERT 구조를 사용하며, 감성 분석 작업을 위해 SST-2라는 데이터셋을 통해 미세 조정(fine-tuning)되었습니다.

만약 다른 모델을 사용하길 원한다면(예를 들어 프랑스어 데이터), 연구소에서 대량의 데이터를 통해 사전학습된 모델과 커뮤니티 모델(특정 데이터셋을 통해 미세조정된 버전의 모델)들을 수집하는 모델 허브에서 다른 모델을 검색할 수 있습니다. 'French'나 'text-classification' 태그를 적용하면 'nlptown/bert-base-multilingual-uncased-sentiment'모델을 사용해 보라는 결과를 얻을 수 있습니다.

어떻게 다른 모델을 적용할지 알아봅시다.

pipeline() 함수에 모델명을 바로 넘겨줄 수 있습니다.

classifier = pipeline('sentiment-analysis', model="nlptown/bert-base-multilingual-uncased-sentiment")

이 분류기는 이제 영어, 프랑스어뿐만 아니라 네덜란드어, 독일어, 이탈리아어, 스페인어로 된 텍스트도 처리할 수 있습니다! 또한 사전학습된 모델을 저장한 로컬 폴더로 이름을 바꿀 수도 있습니다(이하 참조). 모델 개체 및 연관된 토큰나이저를 전달할 수도 있습니다.

이를 위해 두 개의 클래스가 필요합니다.

첫 번째는 AutoTokenizer입니다. 선택한 모델과 연결된 토크나이저를 다운로드하고 인스턴스화하는 데 사용됩니다.

두 번째는 AutoModelForSequenceClassification(or TensorFlow - TFAutoModelForSequenceClassification)으로, 모델 자체를 다운로드하는 데 사용됩니다. 라이브러리를 다른 작업에 사용하는 경우 모델의 클래스가 변경됩니다.

Task summary 튜토리얼에 어떤 클래스가 어떤 작업에 사용되는지 정리되어 있습니다.

# Pytorch 
from transformers import AutoTokenizer, AutoModelForSequenceClassification 

# Tensorflow 
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

이제 이전에 찾은 모델과 토크나이저를 다운로드하려면 from_pretricted() 메서드를 사용하면 됩니다(모델 허브에서 model_name을 다른 모델로 자유롭게 바꿀 수 있음).

# Pytorch 
model_name = "nlptown/bert-base-multilingual-uncased-sentiment" 
model = AutoModelForSequenceClassification.from_pretrained(model_name) 
tokenizer = AutoTokenizer.from_pretrained(model_name) 
classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer) 

# Tensorflow 
model_name = "nlptown/bert-base-multilingual-uncased-sentiment" 
# 이 모델은 파이토치에 있는 모델이기 때문에, 텐서플로에서 이용하려면 'from_pt'라고 지정해줘야 합니다. 
model = TFAutoModelForSequenceClassification.from_pretrained(model_name, from_pt=True) 
tokenizer = AutoTokenizer.from_pretrained(model_name) 
classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)

당신이 가지고 있는 데이터와 비슷한 데이터로 사전학습된 모델을 찾을 수 없는 경우엔, 당신의 데이터에 사전학습된 모델을 적용하여 파인튜닝을 해야 합니다. 이를 위한 예제 스크립트를 제공합니다.

파인튜닝을 완료한 후엔, 이 튜토리얼을 통해 커뮤니티 허브에 모델을 공유해 주시면 감사하겠습니다.

저작자표시 비영리 변경금지 (새창열림)

'Huggingface Transformers > GET STARTED' 카테고리의 다른 글

Under the hood : Pretrained Models (0)	2021.10.19

matplotlib.pyplot.quiver 함수 정리

지닝 2021. 1. 13. 23:53

2021. 1. 13. 23:53

3차원 벡터를 그래프로 그리고 싶은데, 마땅한 방법이 없어 여기저기 찾다가 quiver라는 함수를 발견했다.

지금까지는 2차원 벡터를 그리기 위해서 arrow함수만을 계속 사용했었기에 quiver는 사용법이 생소했다.

다른 블로그에도 생각보다 자료가 많지는 않았다.. 특히 내가 그리고 싶은 형태의 그래프를 그리는 법은 별로 나와있지 않았다ㅠㅠ그래서 미래의 나를 위해 한번 정리하고 넘어가려고 한다.

quiver plot ?

2차원 평면에서 화살표가 있는 직선을 그리는 그래프이다. arrow plot과 비슷한 형태를 가진다.

하지만 나는 3차원 벡터를 가지고 그래프를 그리기 위해서 사용해 보겠다...

공식문서 간단 정리

quiver([X, Y], U, V, [C], **kw)

- 필수 매개 변수

X , Y : 화살표 위치를 정의

U , V : 화살표 방향을 정의

- 선택 매개 변수

C : 색상 설정

**kw : PolyCollection 속성 - color, edgecolor, label, linestyle, visible 등등

- 기타 매개 변수

scale_units : { 'width', 'height', 'dots', 'inches', 'x', 'y', 'xy'}

headwidth, headlength, headaxislength, minshaft, minlength 등등

공식문서 링크 ↓

(matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.quiver.html)

공식문서 함수 사용 예시

from mpl_toolkits.mplot3d import Axes3D 

import matplotlib.pyplot as plt
import numpy as np

fig = plt.figure()
ax = fig.gca(projection='3d')

# Make the grid
x, y, z = np.meshgrid(np.arange(-0.8, 1, 0.2),
                      np.arange(-0.8, 1, 0.2),
                      np.arange(-0.8, 1, 0.8))

# Make the direction data for the arrows
u = np.sin(np.pi * x) * np.cos(np.pi * y) * np.cos(np.pi * z)
v = -np.cos(np.pi * x) * np.sin(np.pi * y) * np.cos(np.pi * z)
w = (np.sqrt(2.0 / 3.0) * np.cos(np.pi * x) * np.cos(np.pi * y) *
     np.sin(np.pi * z))

ax.quiver(x, y, z, u, v, w, length=0.1, normalize=True)

plt.show()

- Matplotlib에서 지원하는 플롯의 축은 기본적으로 2차원이기 때문에, mplot3d 툴킷에서 Axes3D 클래스를 가져와줘야 3차원 축에 대한 투영이 가능하다.

- fig.gca(projection='3d') 를 통해 3차원 객체를 만든다. fig.add_subplot(111,projection='3d') 라고 써도 같은 결과를 나타낸다.

내가 그려보는 간단버전 3차원 벡터 그래프

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from mpl_toolkits.mplot3d import Axes3D 

#3차원 벡터 아무거나 선언
a = np.array([2,4,5])
b = np.array([-3,1,4])
c = np.array([0,7,1])

#그래프를 그릴 3d 영역 설정
fig = plt.figure()
ax = fig.add_subplot(111,projection='3d') #또는 fig.gca(projection='3d')

#그래프를 그려보자 - ax.quiver(시작x좌표,시작y좌표,시작z좌표,x좌표,y좌표,z좌표 + 색상은 옵션)
a_plot = ax.quiver(0,0,0,a[0],a[1],a[2],color='red')
b_plot = ax.quiver(0,0,0,b[0],b[1],b[2],color='blue')
c_plot = ax.quiver(0,0,0,c[0],c[1],c[2],color='green')

#그래프를 나타낼 범위 (적당히 넉넉하게 지정해 주자)
ax.set_xlim([-5,3])
ax.set_ylim([0,8])
ax.set_zlim([0,5])

#그래프 범례 표시 - 왼쪽 위로 설정함
ax.legend([a_plot, b_plot, c_plot], ['a', 'b', 'c'], loc='upper left')

#그래프 제목 설정
ax.set_title('3 dimensional vector with quiver')