πŸ”— Docs >> Using Transformers >> Summary of the tasks

이 νŽ˜μ΄μ§€μ—λŠ” 라이브러리 μ‚¬μš© μ‹œ κ°€μž₯ 많이 μ μš©λ˜λŠ” 사둀가 μ†Œκ°œλ˜μ–΄ μžˆμŠ΅λ‹ˆλ‹€. ν—ˆκΉ…νŽ˜μ΄μŠ€ 트랜슀포머의 λͺ¨λΈλ“€μ€ λ‹€μ–‘ν•œ ꡬ성과 μ‚¬μš© 사둀λ₯Ό μ§€μ›ν•©λ‹ˆλ‹€. κ°€μž₯ κ°„λ‹¨ν•œ 것은 질문 λ‹΅λ³€(question answering), μ‹œν€€μŠ€ λΆ„λ₯˜(sequence classification), 개체λͺ… 인식(named entity recognition) λ“±κ³Ό 같은 μž‘μ—…μ— λŒ€ν•œ μ‚¬λ‘€λ“€μž…λ‹ˆλ‹€.

 

μ΄λŸ¬ν•œ μ˜ˆμ œμ—μ„œλŠ” μ˜€ν† λͺ¨λΈ(auto-models)을 ν™œμš©ν•©λ‹ˆλ‹€. μ˜€ν† λͺ¨λΈμ€ 주어진 μ²΄ν¬ν¬μΈνŠΈμ— 따라 λͺ¨λΈμ„ μΈμŠ€ν„΄μŠ€ν™”ν•˜κ³  μ˜¬λ°”λ₯Έ λͺ¨λΈ μ•„ν‚€ν…μ²˜λ₯Ό μžλ™μœΌλ‘œ μ„ νƒν•˜λŠ” ν΄λž˜μŠ€μž…λ‹ˆλ‹€. μžμ„Έν•œ λ‚΄μš©μ€ AutoModel λ¬Έμ„œλ₯Ό μ°Έμ‘°ν•˜μ‹­μ‹œμ˜€. λ¬Έμ„œλ₯Ό μ°Έμ‘°ν•˜μ—¬ μ½”λ“œλ₯Ό 더 ꡬ체적으둜 μˆ˜μ •ν•˜κ³ , νŠΉμ • μ‚¬μš© 사둀에 맞게 자유둭게 μ‘°μ •ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

 

λͺ¨λΈμ΄ 잘 μ‹€ν–‰λ˜λ €λ©΄ ν•΄λ‹Ή νƒœμŠ€ν¬μ— ν•΄λ‹Ήν•˜λŠ” μ²΄ν¬ν¬μΈνŠΈμ—μ„œ λ‘œλ“œλ˜μ–΄μ•Ό ν•©λ‹ˆλ‹€. μ΄λŸ¬ν•œ μ²΄ν¬ν¬μΈνŠΈλŠ” 일반적으둜 λŒ€κ·œλͺ¨ 데이터 집합을 μ‚¬μš©ν•˜μ—¬ ν”„λ¦¬νŠΈλ ˆμΈλ˜κ³  νŠΉμ • νƒœμŠ€ν¬μ— λŒ€ν•΄ νŒŒμΈνŠœλ‹ λ©λ‹ˆλ‹€. μ΄λŠ” μ•„λž˜μ™€ κ°™μŠ΅λ‹ˆλ‹€.

  • λͺ¨λ“  λͺ¨λΈμ΄ λͺ¨λ“  νƒœμŠ€ν¬μ— λŒ€ν•΄ νŒŒμΈνŠœλ‹λœ 것은 μ•„λ‹™λ‹ˆλ‹€. νŠΉμ • νƒœμŠ€ν¬μ—μ„œ λͺ¨λΈμ„ νŒŒμΈνŠœλ‹ν•˜λ €λ©΄ 예제 λ””λ ‰ν† λ¦¬μ˜ run_$TASK.py슀크립트λ₯Ό ν™œμš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€.
  • νŒŒμΈνŠœλ‹λœ λͺ¨λΈμ€ νŠΉμ • 데이터셋을 μ‚¬μš©ν•˜μ—¬ νŒŒμΈνŠœλ‹λ˜μ—ˆμŠ΅λ‹ˆλ‹€. 이 데이터셋은 μ‚¬μš© 예제 및 도메인과 관련이 μžˆμ„ 수 μžˆμ§€λ§Œ, 그렇지 μ•Šμ„ μˆ˜λ„ μžˆμŠ΅λ‹ˆλ‹€. μ•žμ„œ μ–ΈκΈ‰ν–ˆλ“―μ΄ 예제 슀크립트λ₯Ό ν™œμš©ν•˜μ—¬ λͺ¨λΈμ„ νŒŒμΈνŠœλ‹ν•˜κ±°λ‚˜ λͺ¨λΈ ν•™μŠ΅μ— μ‚¬μš©ν•  슀크립트λ₯Ό 직접 μž‘μ„±ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

μΆ”λ‘  νƒœμŠ€ν¬λ₯Ό μœ„ν•΄ λΌμ΄λΈŒλŸ¬λ¦¬μ—μ„œ λͺ‡ 가지 λ©”μ»€λ‹ˆμ¦˜μ„ μ‚¬μš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

  • νŒŒμ΄ν”„λΌμΈ : μ‚¬μš©ν•˜κΈ° 맀우 μ‰¬μš΄ λ°©μ‹μœΌλ‘œ, 두 μ€„μ˜ μ½”λ“œλ‘œ μ‚¬μš©μ΄ κ°€λŠ₯ν•©λ‹ˆλ‹€.
  • 직접 λͺ¨λΈ μ‚¬μš©ν•˜κΈ° : 좔상화가 덜 λ˜μ§€λ§Œ, ν† ν¬λ‚˜μ΄μ €(νŒŒμ΄ν† μΉ˜/ν…μ„œν”Œλ‘œμš°)에 직접 μ•‘μ„ΈμŠ€ν•  수 μžˆλ‹€λŠ” μ μ—μ„œ μœ μ—°μ„±κ³Ό μ„±λŠ₯이 ν–₯μƒλ©λ‹ˆλ‹€.

여기에 두 가지 μ ‘κ·Ό 방식이 λͺ¨λ‘ μ œμ‹œλ˜μ–΄ μžˆμŠ΅λ‹ˆλ‹€.

πŸ’› 주의
여기에 μ œμ‹œλœ λͺ¨λ“  νƒœμŠ€ν¬μ—μ„œλŠ” νŠΉμ • νƒœμŠ€ν¬μ— 맞게 νŒŒμΈνŠœλ‹λœ ν”„λ¦¬νŠΈλ ˆμΈ 체크포인트λ₯Ό ν™œμš©ν•©λ‹ˆλ‹€. νŠΉμ • μž‘μ—…μ—μ„œ νŒŒμΈνŠœλ‹ λ˜μ§€ μ•Šμ€ 체크포인트λ₯Ό λ‘œλ“œν•˜λ©΄ νƒœμŠ€ν¬μ— μ‚¬μš©λ˜λŠ” μΆ”κ°€ ν—€λ“œκ°€ μ•„λ‹Œ κΈ°λ³Έ 트랜슀포머 λ ˆμ΄μ–΄λ§Œ λ‘œλ“œλ˜μ–΄ ν•΄λ‹Ή ν—€λ“œμ˜ κ°€μ€‘μΉ˜κ°€ λ¬΄μž‘μœ„λ‘œ μ΄ˆκΈ°ν™”λ©λ‹ˆλ‹€. μ΄λ ‡κ²Œ ν•˜λ©΄ 랜덀으둜 좜λ ₯이 μƒμ„±λ©λ‹ˆλ‹€.

μ‹œν€€μŠ€ λΆ„λ₯˜(Sequence Classification)

μ‹œν€€μŠ€ λΆ„λ₯˜λŠ” 주어진 클래슀 μˆ˜μ— 따라 μ‹œν€€μŠ€λ₯Ό λΆ„λ₯˜ν•˜λŠ” νƒœμŠ€ν¬μž…λ‹ˆλ‹€. μ‹œν€€μŠ€ λΆ„λ₯˜μ˜ μ˜ˆμ‹œλ‘œλŠ” 이 νƒœμŠ€ν¬λ₯Ό 기반으둜 ν•˜λŠ” GLUE 데이터셋이 μžˆμŠ΅λ‹ˆλ‹€. GLUE μ‹œν€€μŠ€ λΆ„λ₯˜ νƒœμŠ€ν¬μ—μ„œ λͺ¨λΈμ„ νŒŒμΈνŠœλ‹ ν•˜λ €λ©΄ run_glue.py, run_tf_glue.py, run_tf_classification.py λ˜λŠ” run_xnli.py 슀크립트λ₯Ό ν™œμš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

λ‹€μŒμ€ νŒŒμ΄ν”„λΌμΈμ„ μ‚¬μš©ν•˜μ—¬ μ‹œν€€μŠ€κ°€ 긍정인지 뢀정인지λ₯Ό μ‹λ³„ν•˜μ—¬ 감성뢄석을 μˆ˜ν–‰ν•˜λŠ” μ˜ˆμž…λ‹ˆλ‹€. GLUE νƒœμŠ€ν¬μΈ sst2μ—μ„œ νŒŒμΈνŠœλ‹λœ λͺ¨λΈμ„ ν™œμš©ν•©λ‹ˆλ‹€.

μ΄λ ‡κ²Œ ν•˜λ©΄ λ‹€μŒκ³Ό 같이 μŠ€μ½”μ–΄μ™€ ν•¨κ»˜ 라벨(POSITIVE-긍정 or NEGATIVE-λΆ€μ •)이 λ°˜ν™˜λ©λ‹ˆλ‹€.

from transformers import pipeline

classifier = pipeline("sentiment-analysis")

result = classifier("I hate you")[0]
print(f"label: {result['label']}, with score: {round(result['score'], 4)}")

result = classifier("I love you")[0]
print(f"label: {result['label']}, with score: {round(result['score'], 4)}")

λ‹€μŒμ€ λͺ¨λΈμ„ μ‚¬μš©ν•˜μ—¬ 두 μ‹œν€€μŠ€κ°€ μ„œλ‘œ 같은 의미의 λ‹€λ₯Έ λ¬Έμž₯μΈμ§€μ˜ μ—¬λΆ€(paraphrase or not)λ₯Ό κ²°μ •ν•˜λŠ” μ‹œν€€μŠ€ λΆ„λ₯˜μ˜ μ˜ˆμž…λ‹ˆλ‹€. ν”„λ‘œμ„ΈμŠ€λŠ” λ‹€μŒκ³Ό κ°™μŠ΅λ‹ˆλ‹€.

  1. 체크포인트 μ΄λ¦„μ—μ„œ ν† ν¬λ‚˜μ΄μ € 및 λͺ¨λΈμ„ μΈμŠ€ν„΄μŠ€ν™”ν•©λ‹ˆλ‹€. λͺ¨λΈμ€ BERT λͺ¨λΈλ‘œμ„œ μ‹λ³„λ˜λ©° μ²΄ν¬ν¬μΈνŠΈμ— μ €μž₯된 κ°€μ€‘μΉ˜λ‘œ λ‘œλ“œλ©λ‹ˆλ‹€.
  2. μ˜¬λ°”λ₯Έ λͺ¨λΈλ³„ ꡬ뢄 기호, 토큰 μœ ν˜• ID 및 μ–΄ν…μ…˜ 마슀크(ν† ν¬λ‚˜μ΄μ €μ— μ˜ν•΄ μžλ™μœΌλ‘œ μž‘μ„±λ¨)λ₯Ό μ‚¬μš©ν•˜μ—¬ 두 λ¬Έμž₯의 μ‹œν€€μŠ€λ₯Ό μž‘μ„±ν•©λ‹ˆλ‹€.
  3. λͺ¨λΈμ„ 톡해 이 μ‹œν€€μŠ€λ₯Ό μ „λ‹¬ν•˜κ³  μ‚¬μš© κ°€λŠ₯ν•œ 두 클래슀 쀑 ν•˜λ‚˜μΈ 0(no paraphrase)κ³Ό 1(paraphrase) 쀑 ν•˜λ‚˜λ‘œ λΆ„λ₯˜ν•©λ‹ˆλ‹€.
  4. 클래슀 λΆ„λ₯˜μ— λŒ€ν•œ ν™•λ₯ μ„ κ³„μ‚°ν•˜κΈ° μœ„ν•΄ 결과에 μ†Œν”„νŠΈλ§₯슀 ν•¨μˆ˜λ₯Ό μ μš©ν•˜μ—¬ κ³„μ‚°ν•©λ‹ˆλ‹€.
  5. κ²°κ³Όλ₯Ό ν”„λ¦°νŠΈν•©λ‹ˆλ‹€.
# Pytorch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased-finetuned-mrpc")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased-finetuned-mrpc")

classes = ["not paraphrase", "is paraphrase"]

sequence_0 = "The company HuggingFace is based in New York City"
sequence_1 = "Apples are especially bad for your health"
sequence_2 = "HuggingFace's headquarters are situated in Manhattan"

# The tokenizer will automatically add any model specific separators (i.e. <CLS> and <SEP>) and tokens to
# the sequence, as well as compute the attention masks.
paraphrase = tokenizer(sequence_0, sequence_2, return_tensors="pt")
not_paraphrase = tokenizer(sequence_0, sequence_1, return_tensors="pt")

paraphrase_classification_logits = model(**paraphrase).logits
not_paraphrase_classification_logits = model(**not_paraphrase).logits

paraphrase_results = torch.softmax(paraphrase_classification_logits, dim=1).tolist()[0]
not_paraphrase_results = torch.softmax(not_paraphrase_classification_logits, dim=1).tolist()[0]

# Should be paraphrase
for i in range(len(classes)):
    print(f"{classes[i]}: {int(round(paraphrase_results[i] * 100))}%")
"""
not paraphrase: 10%
is paraphrase: 90%
"""

# Should not be paraphrase
for i in range(len(classes)):
    print(f"{classes[i]}: {int(round(not_paraphrase_results[i] * 100))}%")
"""
not paraphrase: 94%
is paraphrase: 6%
"""
# Tensorflow
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
import tensorflow as tf

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased-finetuned-mrpc")
model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased-finetuned-mrpc")

classes = ["not paraphrase", "is paraphrase"]

sequence_0 = "The company HuggingFace is based in New York City"
sequence_1 = "Apples are especially bad for your health"
sequence_2 = "HuggingFace's headquarters are situated in Manhattan"

# The tokenizer will automatically add any model specific separators (i.e. <CLS> and <SEP>) and tokens to
# the sequence, as well as compute the attention masks.
paraphrase = tokenizer(sequence_0, sequence_2, return_tensors="tf")
not_paraphrase = tokenizer(sequence_0, sequence_1, return_tensors="tf")

paraphrase_classification_logits = model(paraphrase).logits
not_paraphrase_classification_logits = model(not_paraphrase).logits

paraphrase_results = tf.nn.softmax(paraphrase_classification_logits, axis=1).numpy()[0]
not_paraphrase_results = tf.nn.softmax(not_paraphrase_classification_logits, axis=1).numpy()[0]

# Should be paraphrase
for i in range(len(classes)):
    print(f"{classes[i]}: {int(round(paraphrase_results[i] * 100))}%")
"""
not paraphrase: 10%
is paraphrase: 90%
"""

# Should not be paraphrase
for i in range(len(classes)):
    print(f"{classes[i]}: {int(round(not_paraphrase_results[i] * 100))}%")
"""
not paraphrase: 94%
is paraphrase: 6%
"""

μΆ”μΆœ μ§ˆμ˜μ‘λ‹΅(Extractive Question Answering)

μΆ”μΆœ μ§ˆμ˜μ‘λ‹΅μ€ 주어진 질문 ν…μŠ€νŠΈμ—μ„œ 닡을 μΆ”μΆœν•˜λŠ” μž‘μ—…μž…λ‹ˆλ‹€. 질문 λ‹΅λ³€ λ°μ΄ν„°μ…‹μ˜ μ˜ˆλ‘œλŠ” ν•΄λ‹Ή μž‘μ—…μ„ 기반으둜 ν•˜λŠ” SQuAD 데이터셋이 μžˆμŠ΅λ‹ˆλ‹€. SQuAD μž‘μ—…μ—μ„œ λͺ¨λΈμ„ νŒŒμΈνŠœλ‹ν•˜λ €λ©΄ run_qa.py 및 run_tf_squad.py 슀크립트λ₯Ό ν™œμš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

λ‹€μŒμ€ νŒŒμ΄ν”„λΌμΈμ„ μ‚¬μš©ν•˜μ—¬ 주어진 질문 ν…μŠ€νŠΈμ—μ„œ 닡변을 μΆ”μΆœν•˜λŠ” μ§ˆμ˜μ‘λ‹΅μ„ μˆ˜ν–‰ν•˜λŠ” μ˜ˆμž…λ‹ˆλ‹€. SQuAD 데이터셋을 톡해 νŒŒμΈνŠœλ‹λœ λͺ¨λΈμ„ ν™œμš©ν•©λ‹ˆλ‹€.

from transformers import pipeline

question_answerer = pipeline("question-answering")

context = r"""
Extractive Question Answering is the task of extracting an answer from a text given a question. An example of a
question answering dataset is the SQuAD dataset, which is entirely based on that task. If you would like to fine-tune
a model on a SQuAD task, you may leverage the examples/pytorch/question-answering/run_squad.py script.
"""

μ΄λ ‡κ²Œ ν•˜λ©΄ ν…μŠ€νŠΈμ—μ„œ μΆ”μΆœλœ λ‹΅λ³€κ³Ό **μ‹ λ’° 점수(confidence score)**κ°€ ν…μŠ€νŠΈμ—μ„œ μΆ”μΆœλœ λ‹΅λ³€μ˜ μœ„μΉ˜μΈ 'μ‹œμž‘' 및 'μ’…λ£Œ' κ°’κ³Ό ν•¨κ»˜ λ°˜ν™˜λ©λ‹ˆλ‹€.

result = question_answerer(question="What is extractive question answering?", context=context)
print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")

result = question_answerer(question="What is a good example of a question answering dataset?", context=context)
print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")

λͺ¨λΈ 및 ν† ν¬λ‚˜μ΄μ €λ₯Ό μ‚¬μš©ν•˜μ—¬ μ§ˆλ¬Έμ— λŒ€λ‹΅ν•˜λŠ” μ˜ˆμž…λ‹ˆλ‹€. ν”„λ‘œμ„ΈμŠ€λŠ” λ‹€μŒκ³Ό κ°™μŠ΅λ‹ˆλ‹€.

  1. 체크포인트 μ΄λ¦„μ—μ„œ ν† ν¬λ‚˜μ΄μ € 및 λͺ¨λΈμ„ μΈμŠ€ν„΄μŠ€ν™”ν•©λ‹ˆλ‹€. λͺ¨λΈμ€ BERT λͺ¨λΈλ‘œ μ‹λ³„λ˜λ©° μ²΄ν¬ν¬μΈνŠΈμ— μ €μž₯된 κ°€μ€‘μΉ˜λ‘œ λ‘œλ“œλ©λ‹ˆλ‹€.
  2. ν…μŠ€νŠΈμ™€ λͺ‡ 가지 μ§ˆλ¬Έμ„ μ •μ˜ν•©λ‹ˆλ‹€.
  3. μ§ˆλ¬Έμ„ λ°˜λ³΅ν•˜κ³  μ˜¬λ°”λ₯Έ λͺ¨λΈλ³„ μ‹λ³„μž 토큰 νƒ€μž… ID 및 μ–΄ν…μ…˜ 마슀크λ₯Ό μ‚¬μš©ν•˜μ—¬ ν…μŠ€νŠΈμ™€ ν˜„μž¬ 질문의 μ‹œν€€μŠ€λ₯Ό μž‘μ„±ν•©λ‹ˆλ‹€.
  4. 이 μ‹œν€€μŠ€λ₯Ό λͺ¨λΈμ— μ „λ‹¬ν•©λ‹ˆλ‹€. 그러면 μ‹œμž‘ μœ„μΉ˜μ™€ 끝 μœ„μΉ˜ λͺ¨λ‘μ— λŒ€ν•΄ 전체 μ‹œν€€μŠ€ 토큰(질문과 ν…μŠ€νŠΈ)에 걸쳐 λ‹€μ–‘ν•œ μ μˆ˜κ°€ 좜λ ₯λ©λ‹ˆλ‹€.
  5. 토큰에 λŒ€ν•œ ν™•λ₯ μ„ μ–»κΈ° μœ„ν•΄ 결과값에 μ†Œν”„νŠΈλ§₯슀 ν•¨μˆ˜λ₯Ό μ·¨ν•©λ‹ˆλ‹€.
  6. μ‹λ³„λœ μ‹œμž‘ 및 끝 μœ„μΉ˜μ—μ„œ 토큰을 가져와 λ¬Έμžμ—΄λ‘œ λ³€ν™˜ν•©λ‹ˆλ‹€.
  7. κ²°κ³Όλ₯Ό ν”„λ¦°νŠΈν•©λ‹ˆλ‹€.
# Pytorch
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch

tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
model = AutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")

text = r"""
πŸ€— Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides general-purpose
architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural
Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between
TensorFlow 2.0 and PyTorch.
"""

questions = [
    "How many pretrained models are available in πŸ€— Transformers?",
    "What does πŸ€— Transformers provide?",
    "πŸ€— Transformers provides interoperability between which frameworks?",
]

for question in questions:
    inputs = tokenizer(question, text, add_special_tokens=True, return_tensors="pt")
    input_ids = inputs["input_ids"].tolist()[0]
    outputs = model(**inputs)
    answer_start_scores = outputs.start_logits
    answer_end_scores = outputs.end_logits
    # Get the most likely beginning of answer with the argmax of the score
    answer_start = torch.argmax(answer_start_scores)
    # Get the most likely end of answer with the argmax of the score
    answer_end = torch.argmax(answer_end_scores) + 1
    answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))
    print(f"Question: {question}")
    print(f"Answer: {answer}")

"""
Question: How many pretrained models are available in πŸ€— Transformers?
Answer: over 32 +
Question: What does πŸ€— Transformers provide?
Answer: general - purpose architectures
Question: πŸ€— Transformers provides interoperability between which frameworks?
Answer: tensorflow 2. 0 and pytorch
"""
# Tensorflow
from transformers import AutoTokenizer, TFAutoModelForQuestionAnswering
import tensorflow as tf

tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
model = TFAutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")

text = r"""
πŸ€— Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides general-purpose
architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural
Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between
TensorFlow 2.0 and PyTorch.
"""

questions = [
    "How many pretrained models are available in πŸ€— Transformers?",
    "What does πŸ€— Transformers provide?",
    "πŸ€— Transformers provides interoperability between which frameworks?",
]

for question in questions:
    inputs = tokenizer(question, text, add_special_tokens=True, return_tensors="tf")
    input_ids = inputs["input_ids"].numpy()[0]
    outputs = model(inputs)
    answer_start_scores = outputs.start_logits
    answer_end_scores = outputs.end_logits
    # Get the most likely beginning of answer with the argmax of the score
    answer_start = tf.argmax(answer_start_scores, axis=1).numpy()[0]
    # Get the most likely end of answer with the argmax of the score
    answer_end = tf.argmax(answer_end_scores, axis=1).numpy()[0] + 1
    answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))
    print(f"Question: {question}")
    print(f"Answer: {answer}")

"""
Question: How many pretrained models are available in πŸ€— Transformers?
Answer: over 32 +
Question: What does πŸ€— Transformers provide?
Answer: general - purpose architectures
Question: πŸ€— Transformers provides interoperability between which frameworks?
Answer: tensorflow 2. 0 and pytorch
"""

μ–Έμ–΄ λͺ¨λΈλ§(Language Modeling)

μ–Έμ–΄ λͺ¨λΈλ§μ€ λͺ¨λΈμ„ μ½”νΌμŠ€μ— λ§žμΆ”λŠ” μž‘μ—…μ΄λ©°, νŠΉμ • 도메인에 νŠΉν™”μ‹œν‚¬ 수 μžˆμŠ΅λ‹ˆλ‹€. λͺ¨λ“  트랜슀포머 기반 λͺ¨λΈμ€ μ–Έμ–΄ λͺ¨λΈλ§μ„ λ³€ν˜•(예: 마슀크된 μ–Έμ–΄ λͺ¨λΈλ§μ„ μ‚¬μš©ν•œ BERT, 일상 μ–Έμ–΄ λͺ¨λΈλ§μ„ μ‚¬μš©ν•œ GPT-2)ν•˜μ—¬ ν›ˆλ ¨λ©λ‹ˆλ‹€.

μ–Έμ–΄ λͺ¨λΈλ§μ€ ν”„λ¦¬νŠΈλ ˆμ΄λ‹ 이외에도 λͺ¨λΈ 배포λ₯Ό 각 도메인에 맞게 νŠΉν™”μ‹œν‚€κΈ° μœ„ν•΄ μœ μš©ν•˜κ²Œ μ‚¬μš©λ  수 μžˆμŠ΅λ‹ˆλ‹€. 예λ₯Ό λ“€μ–΄, λŒ€μš©λŸ‰ μ½”νΌμŠ€λ₯Ό 톡해 ν›ˆλ ¨λœ μ–Έμ–΄ λͺ¨λΈμ„ μ‚¬μš©ν•œ λ‹€μŒ λ‰΄μŠ€ 데이터셋 λ˜λŠ” κ³Όν•™ λ…Όλ¬Έ 데이터셋(예 : LysandreJik/arxiv-nlp)으둜 νŒŒμΈνŠœλ‹ν•˜λŠ” κ²ƒμž…λ‹ˆλ‹€.

 

마슀크된 μ–Έμ–΄ λͺ¨λΈλ§(Masked Language Modeling)

마슀크된 μ–Έμ–΄ λͺ¨λΈλ§μ€ λ§ˆμŠ€ν‚Ή 토큰을 μ‚¬μš©ν•˜μ—¬ μˆœμ„œλŒ€λ‘œ 토큰을 λ§ˆμŠ€ν‚Ήν•˜κ³  λͺ¨λΈμ΄ ν•΄λ‹Ή 마슀크λ₯Ό μ μ ˆν•œ ν† ν°μœΌλ‘œ μ±„μš°λ„λ‘ μš”μ²­ν•˜λŠ” μž‘μ—…μž…λ‹ˆλ‹€. λ”°λΌμ„œ λͺ¨λΈμ΄ 였λ₯Έμͺ½ μ»¨ν…μŠ€νŠΈ(마슀크 였λ₯Έμͺ½μ˜ 토큰)와 μ™Όμͺ½ μ»¨ν…μŠ€νŠΈ(마슀크 μ™Όμͺ½μ˜ 토큰)λ₯Ό λͺ¨λ‘ μ‚΄νŽ΄λ³Ό 수 μžˆμŠ΅λ‹ˆλ‹€. μ΄λŸ¬ν•œ ν›ˆλ ¨μ€ SQuAD(μ§ˆμ˜μ‘λ‹΅, Lewis, Lui, Goyal et al, 파트 4.2)와 같은 μ–‘λ°©ν–₯ μ»¨ν…μŠ€νŠΈλ₯Ό ν•„μš”λ‘œ ν•˜λŠ” λ‹€μš΄μŠ€νŠΈλ¦Ό μž‘μ—…μ— λŒ€ν•œ κ°•λ ₯ν•œ 기초 λͺ¨λΈμ„ λ§Œλ“­λ‹ˆλ‹€. λ§ˆμŠ€ν‚Ήλœ μ–Έμ–΄ λͺ¨λΈλ§ μž‘μ—…μ—μ„œ λͺ¨λΈμ„ νŒŒμΈνŠœλ‹ν•˜λ €λ©΄ run_mlm.py 슀크립트λ₯Ό ν™œμš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

λ‹€μŒμ€ νŒŒμ΄ν”„λΌμΈμ„ μ‚¬μš©ν•˜μ—¬ μ‹œν€€μŠ€μ—μ„œ 마슀크λ₯Ό κ΅μ²΄ν•˜λŠ” μ˜ˆμž…λ‹ˆλ‹€.

from transformers import pipeline

unmasker = pipeline("fill-mask")

그러면 λ§ˆμŠ€ν¬κ°€ μ±„μ›Œμ§„ μ‹œν€€μŠ€, μŠ€μ½”μ–΄ 및 토큰IDκ°€ ν† ν¬λ‚˜μ΄μ €λ₯Ό 톡해 좜λ ₯λ©λ‹ˆλ‹€.

from pprint import pprint
pprint(unmasker(f"HuggingFace is creating a {unmasker.tokenizer.mask_token} that the community uses to solve NLP tasks."))
[{'score': 0.1793,
  'sequence': 'HuggingFace is creating a tool that the community uses to solve '
              'NLP tasks.',
  'token': 3944,
  'token_str': ' tool'},
 {'score': 0.1135,
  'sequence': 'HuggingFace is creating a framework that the community uses to '
              'solve NLP tasks.',
  'token': 7208,
  'token_str': ' framework'},
 {'score': 0.0524,
  'sequence': 'HuggingFace is creating a library that the community uses to '
              'solve NLP tasks.',
  'token': 5560,
  'token_str': ' library'},
 {'score': 0.0349,
  'sequence': 'HuggingFace is creating a database that the community uses to '
              'solve NLP tasks.',
  'token': 8503,
  'token_str': ' database'},
 {'score': 0.0286,
  'sequence': 'HuggingFace is creating a prototype that the community uses to '
              'solve NLP tasks.',
  'token': 17715,
  'token_str': ' prototype'}]

λ‹€μŒμ€ λͺ¨λΈ 및 ν† ν¬λ‚˜μ΄μ €λ₯Ό μ‚¬μš©ν•˜μ—¬ λ§ˆμŠ€ν‚Ήλœ μ–Έμ–΄ λͺ¨λΈλ§μ„ μˆ˜ν–‰ν•˜λŠ” μ˜ˆμž…λ‹ˆλ‹€. ν”„λ‘œμ„ΈμŠ€λŠ” λ‹€μŒκ³Ό κ°™μŠ΅λ‹ˆλ‹€.

  1. 체크포인트 μ΄λ¦„μ—μ„œ 토크라이저 및 λͺ¨λΈμ„ μΈμŠ€ν„΄μŠ€ν™”ν•©λ‹ˆλ‹€. μ—¬κΈ°μ„œλŠ” DistilBERT λͺ¨λΈμ„ μ‚¬μš©ν•  것이고, κ°€μ€‘μΉ˜κ°€ μ²΄ν¬ν¬μΈνŠΈμ— μ €μž₯λ©λ‹ˆλ‹€.
  2. 단어 λŒ€μ‹  tokenizer.mask_token을 λ°°μΉ˜ν•˜μ—¬ λ§ˆμŠ€ν‚Ήλœ ν† ν°μœΌλ‘œ μ‹œν€€μŠ€λ₯Ό μ •μ˜ν•©λ‹ˆλ‹€.
  3. ν•΄λ‹Ή μ‹œν€€μŠ€λ₯Ό ID λͺ©λ‘μœΌλ‘œ μΈμ½”λ”©ν•˜κ³  ν•΄λ‹Ή λͺ©λ‘μ—μ„œ λ§ˆμŠ€ν‚Ήλœ ν† ν°μ˜ μœ„μΉ˜λ₯Ό μ°ΎμŠ΅λ‹ˆλ‹€.
  4. λ§ˆμŠ€ν‚Ήλœ ν† ν°μ˜ μΈλ±μŠ€μ—μ„œ μ˜ˆμΈ‘κ°’μ„ κ²€μƒ‰ν•©λ‹ˆλ‹€. 이 ν…μ„œλŠ” μ–΄νœ˜μ™€ 크기가 κ°™κ³ , 값은 각 토큰에 κ·€μ†λ˜λŠ” μ μˆ˜μž…λ‹ˆλ‹€. 이 λͺ¨λΈμ€ 그런 λ§₯λ½μ—μ„œ κ°€λŠ₯성이 λ†’λ‹€κ³  μƒκ°λ˜λŠ” 토큰에 더 높은 점수λ₯Ό λΆ€μ—¬ν•©λ‹ˆλ‹€.
  5. PyTorch topk λ˜λŠ” TensorFlow top_k λ©”μ„œλ“œλ₯Ό μ‚¬μš©ν•˜μ—¬ μƒμœ„ 5개의 토큰을 κ²€μƒ‰ν•©λ‹ˆλ‹€.
  6. λ§ˆμŠ€ν‚Ήλœ 토큰을 ν† ν°μœΌλ‘œ λ°”κΎΈκ³  κ²°κ³Όλ₯Ό ν”„λ¦°νŠΈν•©λ‹ˆλ‹€.
# Pytorch
from transformers import AutoModelForMaskedLM, AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")
model = AutoModelForMaskedLM.from_pretrained("distilbert-base-cased")

sequence = "Distilled models are smaller than the models they mimic. Using them instead of the large " \
    f"versions would help {tokenizer.mask_token} our carbon footprint."

inputs = tokenizer(sequence, return_tensors="pt")
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]

token_logits = model(**inputs).logits
mask_token_logits = token_logits[0, mask_token_index, :]

top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()

for token in top_5_tokens:
    print(sequence.replace(tokenizer.mask_token, tokenizer.decode([token])))
"""
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help reduce our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help increase our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help decrease our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help offset our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help improve our carbon footprint.
"""
# Tensorflow
from transformers import TFAutoModelForMaskedLM, AutoTokenizer
import tensorflow as tf

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")
model = TFAutoModelForMaskedLM.from_pretrained("distilbert-base-cased")

sequence = "Distilled models are smaller than the models they mimic. Using them instead of the large " \
    f"versions would help {tokenizer.mask_token} our carbon footprint."

inputs = tokenizer(sequence, return_tensors="tf")
mask_token_index = tf.where(inputs["input_ids"] == tokenizer.mask_token_id)[0, 1]

token_logits = model(**inputs).logits
mask_token_logits = token_logits[0, mask_token_index, :]

top_5_tokens = tf.math.top_k(mask_token_logits, 5).indices.numpy()

for token in top_5_tokens:
    print(sequence.replace(tokenizer.mask_token, tokenizer.decode([token])))
"""
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help reduce our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help increase our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help decrease our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help offset our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help improve our carbon footprint.
"""

λͺ¨λΈμ—μ„œ μ˜ˆμΈ‘ν•œ μƒμœ„ 5개의 ν† ν°λ“€μœΌλ‘œ 이루어진 5개의 μ‹œν€€μŠ€κ°€ ν”„λ¦°νŠΈλ©λ‹ˆλ‹€.

 

인과 μ–Έμ–΄ λͺ¨λΈλ§(Causal Language Modeling)

인과 μ–Έμ–΄ λͺ¨λΈλ§μ€ 토큰 μˆœμ„œμ— 따라 λ‹€μŒ 토큰을 μ˜ˆμΈ‘ν•˜λŠ” μž‘μ—…μž…λ‹ˆλ‹€. 이 κ³Όμ •μ—μ„œλŠ” λͺ¨λΈμ΄ μ™Όμͺ½ μ»¨ν…μŠ€νŠΈ(마슀크 μ™Όμͺ½μ— μžˆλŠ” 토큰)μ—λ§Œ μ§‘μ€‘ν•˜κ²Œ λ©λ‹ˆλ‹€. μ΄λŸ¬ν•œ ν•™μŠ΅ 과정은 λ¬Έμž₯ 생성 μž‘μ—…κ³Ό 특히 연관이 μžˆμŠ΅λ‹ˆλ‹€. 인과 μ–Έμ–΄ λͺ¨λΈλ§ μž‘μ—…μ—μ„œ λͺ¨λΈμ„ νŒŒμΈνŠœλ‹ν•˜λ €λ©΄ run_clm.py 슀크립트λ₯Ό ν™œμš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

일반적으둜 λ‹€μŒ 토큰은 λͺ¨λΈμ΄ μž…λ ₯ μ‹œν€€μŠ€μ—μ„œ μƒμ„±ν•˜λŠ” λ§ˆμ§€λ§‰ νžˆλ“  λ ˆμ΄μ–΄μ˜ logitμ—μ„œ μƒ˜ν”Œλ§λ˜μ–΄ μ˜ˆμΈ‘λ©λ‹ˆλ‹€.

λ‹€μŒμ€ ν† ν¬λ‚˜μ΄μ €μ™€ λͺ¨λΈμ„ μ‚¬μš©ν•˜κ³  top_k_top_p_filtering() λ©”μ†Œλ“œλ₯Ό ν™œμš©ν•˜μ—¬ 인풋 토큰 μ‹œν€€μŠ€μ— 따라 λ‹€μŒ 토큰을 μƒ˜ν”Œλ§ν•˜λŠ” μ˜ˆμž…λ‹ˆλ‹€.

# Pytorch

from transformers import AutoModelForCausalLM, AutoTokenizer, top_k_top_p_filtering
import torch
from torch import nn

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

sequence = f"Hugging Face is based in DUMBO, New York City, and"

inputs = tokenizer(sequence, return_tensors="pt")
input_ids = inputs["input_ids"]

# get logits of last hidden state
next_token_logits = model(**inputs).logits[:, -1, :]

# filter
filtered_next_token_logits = top_k_top_p_filtering(next_token_logits, top_k=50, top_p=1.0)

# sample
probs = nn.functional.softmax(filtered_next_token_logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)

generated = torch.cat([input_ids, next_token], dim=-1)

resulting_string = tokenizer.decode(generated.tolist()[0])
print(resulting_string)
"""
Hugging Face is based in DUMBO, New York City, and ...
"""
# Tensorflow

from transformers import TFAutoModelForCausalLM, AutoTokenizer, tf_top_k_top_p_filtering
import tensorflow as tf

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = TFAutoModelForCausalLM.from_pretrained("gpt2")

sequence = f"Hugging Face is based in DUMBO, New York City, and"

inputs = tokenizer(sequence, return_tensors="tf")
input_ids = inputs["input_ids"]

# get logits of last hidden state
next_token_logits = model(**inputs).logits[:, -1, :]

# filter
filtered_next_token_logits = tf_top_k_top_p_filtering(next_token_logits, top_k=50, top_p=1.0)

# sample
next_token = tf.random.categorical(filtered_next_token_logits, dtype=tf.int32, num_samples=1)

generated = tf.concat([input_ids, next_token], axis=1)

resulting_string = tokenizer.decode(generated.numpy().tolist()[0])
print(resulting_string)

"""
Hugging Face is based in DUMBO, New York City, and ...
"""

μ΄λ ‡κ²Œ ν•˜λ©΄ μ›λž˜μ˜ μˆœμ„œμ— 따라 일관성 μžˆλŠ” λ‹€μŒ 토큰이 좜λ ₯λ©λ‹ˆλ‹€. 이 토큰은 우리의 경우 단어 λ˜λŠ” νŠΉμ§•μž…λ‹ˆλ‹€.

λ‹€μŒ μ„Ήμ…˜μ—μ„œλŠ” ν•œ λ²ˆμ— ν•˜λ‚˜μ˜ 토큰이 μ•„λ‹ˆλΌ μ§€μ •λœ 길이둜 μ—¬λŸ¬ 토큰을 μƒμ„±ν•˜λŠ” 데 *generate()*λ₯Ό μ‚¬μš©ν•˜λŠ” 방법을 보여 μ€λ‹ˆλ‹€.

ν…μŠ€νŠΈ 생성(Text Generation)

ν…μŠ€νŠΈ 생성(κ°œλ°©ν˜• ν…μŠ€νŠΈ 생성이라고도 함)의 λͺ©ν‘œλŠ” 주어진 Context와 μΌκ΄€λ˜κ²Œ μ΄μ–΄μ§€λŠ” ν…μŠ€νŠΈλ₯Ό λ§Œλ“œλŠ” κ²ƒμž…λ‹ˆλ‹€. λ‹€μŒ μ˜ˆλŠ” νŒŒμ΄ν”„λΌμΈμ—μ„œ GPT-2λ₯Ό μ‚¬μš©ν•˜μ—¬ ν…μŠ€νŠΈλ₯Ό μƒμ„±ν•˜λŠ” 방법을 λ³΄μ—¬μ€λ‹ˆλ‹€. 기본적으둜 λͺ¨λ“  λͺ¨λΈμ€ νŒŒμ΄ν”„λΌμΈμ—μ„œ μ‚¬μš©ν•  λ•Œ 각 Configμ—μ„œ μ„€μ •ν•œ λŒ€λ‘œ Top-K μƒ˜ν”Œλ§μ„ μ μš©ν•©λ‹ˆλ‹€(μ˜ˆμ‹œ : gpt-2 config μ°Έμ‘°).

from transformers import pipeline

text_generator = pipeline("text-generation")
print(text_generator("As far as I am concerned, I will", max_length=50, do_sample=False))

"""
[{'generated_text': 'As far as I am concerned, I will be the first to admit that I am not a fan of the idea of a
"free market." I think that the idea of a free market is a bit of a stretch. I think that the idea'}]
"""

μ—¬κΈ°μ„œ λͺ¨λΈμ€ "As far as I am concerned, I will"λΌλŠ” Contextμ—μ„œ 총 μ΅œλŒ€ 길이 50개의 토큰을 가진 μž„μ˜μ˜ ν…μŠ€νŠΈλ₯Ό μƒμ„±ν•©λ‹ˆλ‹€. λ°±κ·ΈλΌμš΄λ“œμ—μ„œ νŒŒμ΄ν”„λΌμΈ κ°μ²΄λŠ” generate() λ©”μ„œλ“œλ₯Ό ν˜ΈμΆœν•˜μ—¬ ν…μŠ€νŠΈλ₯Ό μƒμ„±ν•©λ‹ˆλ‹€. max_length 및 do_sample μΈμˆ˜μ™€ 같이 이 λ©”μ„œλ“œμ˜ κΈ°λ³Έ μΈμˆ˜λŠ” νŒŒμ΄ν”„λΌμΈμ—μ„œ μž¬μ •μ˜ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

λ‹€μŒμ€ XLNet 및 ν•΄λ‹Ή ν† ν¬λ‚˜μ΄μ €λ₯Ό μ‚¬μš©ν•œ ν…μŠ€νŠΈ 생성 예제이며, generate() λ©”μ„œλ“œλ₯Ό ν¬ν•¨ν•˜κ³  μžˆμŠ΅λ‹ˆλ‹€.

# Pytorch

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("xlnet-base-cased")
tokenizer = AutoTokenizer.from_pretrained("xlnet-base-cased")

# Padding text helps XLNet with short prompts - proposed by Aman Rusia in <https://github.com/rusiaaman/XLNet-gen#methodology>
PADDING_TEXT = """In 1991, the remains of Russian Tsar Nicholas II and his family
(except for Alexei and Maria) are discovered.
The voice of Nicholas's young son, Tsarevich Alexei Nikolaevich, narrates the
remainder of the story. 1883 Western Siberia,
a young Grigori Rasputin is asked by his father and a group of men to perform magic.
Rasputin has a vision and denounces one of the men as a horse thief. Although his
father initially slaps him for making such an accusation, Rasputin watches as the
man is chased outside and beaten. Twenty years later, Rasputin sees a vision of
the Virgin Mary, prompting him to become a priest. Rasputin quickly becomes famous,
with people, even a bishop, begging for his blessing.   """

prompt = "Today the weather is really nice and I am planning on "
inputs = tokenizer(PADDING_TEXT + prompt, add_special_tokens=False, return_tensors="pt")["input_ids"]

prompt_length = len(tokenizer.decode(inputs[0]))
outputs = model.generate(inputs, max_length=250, do_sample=True, top_p=0.95, top_k=60)
generated = prompt + tokenizer.decode(outputs[0])[prompt_length+1:]

print(generated)

"""
Today the weather is really nice and I am planning ...
"""
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("xlnet-base-cased")
tokenizer = AutoTokenizer.from_pretrained("xlnet-base-cased")

# Padding text helps XLNet with short prompts - proposed by Aman Rusia in <https://github.com/rusiaaman/XLNet-gen#methodology>
PADDING_TEXT = """In 1991, the remains of Russian Tsar Nicholas II and his family
(except for Alexei and Maria) are discovered.
The voice of Nicholas's young son, Tsarevich Alexei Nikolaevich, narrates the
remainder of the story. 1883 Western Siberia,
a young Grigori Rasputin is asked by his father and a group of men to perform magic.
Rasputin has a vision and denounces one of the men as a horse thief. Although his
father initially slaps him for making such an accusation, Rasputin watches as the
man is chased outside and beaten. Twenty years later, Rasputin sees a vision of
the Virgin Mary, prompting him to become a priest. Rasputin quickly becomes famous,
with people, even a bishop, begging for his blessing.   """

prompt = "Today the weather is really nice and I am planning on "
inputs = tokenizer(PADDING_TEXT + prompt, add_special_tokens=False, return_tensors="pt")["input_ids"]

prompt_length = len(tokenizer.decode(inputs[0]))
outputs = model.generate(inputs, max_length=250, do_sample=True, top_p=0.95, top_k=60)
generated = prompt + tokenizer.decode(outputs[0])[prompt_length+1:]

print(generated)

"""
Today the weather is really nice and I am planning ...
"""

ν…μŠ€νŠΈ 생성은 ν˜„μž¬ PyTorch의 GPT-2, OpenAi-GPT, CTRL, XLNet, Transpo-XL 및 Reformer와 Tensorflow의 λŒ€λΆ€λΆ„μ˜ λͺ¨λΈμ—μ„œλ„ κ°€λŠ₯ν•©λ‹ˆλ‹€. μœ„μ˜ μ˜ˆμ—μ„œ λ³Ό 수 μžˆλ“―μ΄, XLNet 및 Transpo-XL이 μ œλŒ€λ‘œ μž‘λ™ν•˜λ €λ©΄ νŒ¨λ”©μ΄ ν•„μš”ν•œ κ²½μš°κ°€ λ§ŽμŠ΅λ‹ˆλ‹€. GPT-2λŠ” 인과 μ–Έμ–΄ λͺ¨λΈλ§ λͺ©μ μœΌλ‘œ 수백만 개의 μ›Ή νŽ˜μ΄μ§€λ₯Ό 톡해 ν•™μŠ΅λ˜μ—ˆκΈ° λ•Œλ¬Έμ— 일반적으둜 κ°œλ°©ν˜• ν…μŠ€νŠΈ 생성에 μ ν•©ν•©λ‹ˆλ‹€.

ν…μŠ€νŠΈ 생성을 μœ„ν•΄ λ‹€μ–‘ν•œ λ””μ½”λ”© μ „λž΅μ„ μ μš©ν•˜λŠ” 방법에 λŒ€ν•œ μžμ„Έν•œ λ‚΄μš©μ€ ν…μŠ€νŠΈ 생성 λΈ”λ‘œκ·Έ κ²Œμ‹œλ¬Όμ„ μ°Έμ‘°ν•˜μ‹­μ‹œμ˜€.

개체λͺ… 인식(Named Entity Recognition)

개체λͺ… 인식(NER)은 개인, κΈ°κ΄€ λ˜λŠ” μž₯μ†Œμ˜ 이름 λ“±μœΌλ‘œ 식별 κ°€λŠ₯ν•œ ν΄λž˜μŠ€μ— 따라 토큰을 λΆ„λ₯˜ν•˜λŠ” μž‘μ—…μž…λ‹ˆλ‹€. 개체λͺ… 인식 λ°μ΄ν„°μ…‹μ˜ μ˜ˆλ‘œλŠ” CoNLL-2003 데이터셋이 μžˆμŠ΅λ‹ˆλ‹€. NER μž‘μ—…μ—μ„œ λͺ¨λΈμ„ νŒŒμΈνŠœλ‹ν•˜λ €λŠ” 경우 run_ner.py 슀크립트λ₯Ό ν™œμš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

λ‹€μŒμ€ νŒŒμ΄ν”„λΌμΈμ„ μ‚¬μš©ν•˜μ—¬ 개체λͺ… μΈμ‹μœΌλ‘œ 토큰을 9개 클래슀 쀑 ν•˜λ‚˜μ— μ†ν•˜λ„λ‘ μ˜ˆμΈ‘ν•˜λŠ” μ˜ˆμ‹œμž…λ‹ˆλ‹€(BIO ν‘œν˜„).

O, 개체λͺ…이 μ•„λ‹Œ λΆ€λΆ„
B-MIS, 기타 μ—”ν‹°ν‹°κ°€ μ‹œμž‘λ˜λŠ” λΆ€λΆ„
I-MIS, 기타 μ—”ν‹°ν‹°
B-PER, μ‚¬λžŒμ˜ 이름이 μ‹œμž‘λ˜λŠ” λΆ€λΆ„
I-PER, μ‚¬λžŒμ˜ 이름
B-ORG, κΈ°κ΄€λͺ…이 μ‹œμž‘λ˜λŠ” λΆ€λΆ„
I-ORG, κΈ°κ΄€λͺ…
B-LOC, μž₯μ†Œλͺ…이 μ‹œμž‘λ˜λŠ” λΆ€λΆ„
I-LOC, μž₯μ†Œλͺ…

CoNLL-2003의 νŒŒμΈνŠœλ‹ λͺ¨λΈμ„ μ‚¬μš©ν•˜μ˜€μœΌλ©°, dbmdz의 @stefan-it에 μ˜ν•΄ νŒŒμΈνŠœλ‹ λ˜μ—ˆμŠ΅λ‹ˆλ‹€.

from transformers import pipeline

ner_pipe = pipeline("ner")

sequence = """Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO,
therefore very close to the Manhattan Bridge which is visible from the window."""

μ΄λ ‡κ²Œ ν•˜λ©΄ μœ„μ—μ„œ μ •μ˜ν•œ 9개 클래슀의 μ—”ν‹°ν‹° 쀑 ν•˜λ‚˜λ‘œ μ‹λ³„λœ λͺ¨λ“  단어 λͺ©λ‘μ΄ 좜λ ₯λ©λ‹ˆλ‹€. μ˜ˆμƒλ˜λŠ” κ²°κ³ΌλŠ” λ‹€μŒκ³Ό κ°™μŠ΅λ‹ˆλ‹€.

for entity in ner_pipe(sequence):
    print(entity)
"""
{'entity': 'I-ORG', 'score': 0.9996, 'index': 1, 'word': 'Hu', 'start': 0, 'end': 2}
{'entity': 'I-ORG', 'score': 0.9910, 'index': 2, 'word': '##gging', 'start': 2, 'end': 7}
{'entity': 'I-ORG', 'score': 0.9982, 'index': 3, 'word': 'Face', 'start': 8, 'end': 12}
{'entity': 'I-ORG', 'score': 0.9995, 'index': 4, 'word': 'Inc', 'start': 13, 'end': 16}
{'entity': 'I-LOC', 'score': 0.9994, 'index': 11, 'word': 'New', 'start': 40, 'end': 43}
{'entity': 'I-LOC', 'score': 0.9993, 'index': 12, 'word': 'York', 'start': 44, 'end': 48}
{'entity': 'I-LOC', 'score': 0.9994, 'index': 13, 'word': 'City', 'start': 49, 'end': 53}
{'entity': 'I-LOC', 'score': 0.9863, 'index': 19, 'word': 'D', 'start': 79, 'end': 80}
{'entity': 'I-LOC', 'score': 0.9514, 'index': 20, 'word': '##UM', 'start': 80, 'end': 82}
{'entity': 'I-LOC', 'score': 0.9337, 'index': 21, 'word': '##BO', 'start': 82, 'end': 84}
{'entity': 'I-LOC', 'score': 0.9762, 'index': 28, 'word': 'Manhattan', 'start': 114, 'end': 123}
{'entity': 'I-LOC', 'score': 0.9915, 'index': 29, 'word': 'Bridge', 'start': 124, 'end': 130}
"""

μ–΄λ–»κ²Œ "Huggingface" μ‹œν€€μŠ€μ˜ 토큰이 κΈ°κ΄€λͺ…μœΌλ‘œ μ‹λ³„λ˜κ³  "New York City", "DUMBO" 및 "Manhattan Bridge"κ°€ μž₯μ†Œλͺ…μœΌλ‘œ μ‹λ³„λ˜λŠ”μ§€μ— μ£Όμ˜ν•΄μ„œ λ³΄μ‹­μ‹œμ˜€.

λ‹€μŒμ€ λͺ¨λΈ 및 ν† ν¬λ‚˜μ΄μ €λ₯Ό μ‚¬μš©ν•˜μ—¬ 개체λͺ… 인식을 μˆ˜ν–‰ν•˜λŠ” μ˜ˆμ‹œμž…λ‹ˆλ‹€. ν”„λ‘œμ„ΈμŠ€λŠ” λ‹€μŒκ³Ό κ°™μŠ΅λ‹ˆλ‹€.

  1. μ²΄ν¬ν¬μΈνŠΈμ—μ„œ ν† ν¬λ‚˜μ΄μ € 및 λͺ¨λΈμ„ μΈμŠ€ν„΄μŠ€ν™”ν•©λ‹ˆλ‹€. BERT λͺ¨λΈμ„ μ‚¬μš©ν•˜κ³ , μ²΄ν¬ν¬μΈνŠΈμ— μ €μž₯된 κ°€μ€‘μΉ˜λ₯Ό λ‘œλ“œν•©λ‹ˆλ‹€.
  2. 각 μ‹œν€€μŠ€μ˜ μ—”ν‹°ν‹°λ₯Ό μ •μ˜ν•©λ‹ˆλ‹€. 예λ₯Ό λ“€μ–΄ "Hugging Face"λ₯Ό κΈ°κ΄€λͺ…μœΌλ‘œ, "New York City"λ₯Ό μž₯μ†Œλͺ…μœΌλ‘œ μ •μ˜ν•  수 μžˆμŠ΅λ‹ˆλ‹€.
  3. 단어λ₯Ό ν† ν°μœΌλ‘œ λΆ„ν• ν•˜μ—¬ μ˜ˆμΈ‘μ— 맀핑할 수 μžˆλ„λ‘ ν•©λ‹ˆλ‹€. μš°λ¦¬λŠ” λ¨Όμ € μ‹œν€€μŠ€λ₯Ό μ™„μ „νžˆ μΈμ½”λ”©ν•˜κ³  λ””μ½”λ”©ν•˜μ—¬ νŠΉλ³„ν•œ 토큰이 ν¬ν•¨λœ λ¬Έμžμ—΄μ„ 남겨두도둝 ν•©λ‹ˆλ‹€.
  4. ν•΄λ‹Ή μ‹œν€€μŠ€λ₯Ό ID둜 μΈμ½”λ”©ν•©λ‹ˆλ‹€(특수 토큰이 μžλ™μœΌλ‘œ 좔가됨).
  5. μž…λ ₯ 토큰을 λͺ¨λΈμ— μ „λ‹¬ν•˜κ³ , 첫 번째 좜λ ₯을 κ°€μ Έμ™€μ„œ μ˜ˆμΈ‘μ„ μˆ˜ν–‰ν•©λ‹ˆλ‹€. 이 κ²°κ³Όλ₯Ό 각 토큰에 λŒ€ν•΄ 맀칭 κ°€λŠ₯ν•œ 9개 ν΄λž˜μŠ€μ™€ λŒ€μ‘°ν•©λ‹ˆλ‹€. 각 토큰에 λŒ€ν•΄ κ°€μž₯ κ°€λŠ₯성이 높은 클래슀λ₯Ό κ²€μƒ‰ν•˜κΈ° μœ„ν•΄ argmax ν•¨μˆ˜λ₯Ό μ‚¬μš©ν•©λ‹ˆλ‹€.
  6. 각각의 토큰을 예츑 결과와 λ¬Άμ–΄ ν”„λ¦°νŠΈν•©λ‹ˆλ‹€.
# Pytorch
from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch

model = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, " \\
           "therefore very close to the Manhattan Bridge."

inputs = tokenizer(sequence, return_tensors="pt")
tokens = inputs.tokens()

outputs = model(**inputs).logits
predictions = torch.argmax(outputs, dim=2)
# Tensorflow
from transformers import TFAutoModelForTokenClassification, AutoTokenizer
import tensorflow as tf

model = TFAutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, " \\
           "therefore very close to the Manhattan Bridge."

inputs = tokenizer(sequence, return_tensors="tf")
tokens = inputs.tokens()

outputs = model(**inputs)[0]
predictions = tf.argmax(outputs, axis=2)

ν•΄λ‹Ή 예츑 결과둜 λ§€ν•‘λœ 각 토큰 λͺ©λ‘μ„ 좜λ ₯ν•©λ‹ˆλ‹€. νŒŒμ΄ν”„λΌμΈκ³Ό 달리 λͺ¨λ“  토큰에 예츑 κ²°κ³Όκ°€ λ‚˜μ˜€κ²Œ λ˜λŠ”λ°, μ—”ν‹°ν‹°κ°€ μ—†λŠ” 토큰인 클래슀 0의 경우λ₯Ό μ œκ±°ν•˜μ§€ μ•Šμ•˜κΈ° λ•Œλ¬Έμž…λ‹ˆλ‹€.

μœ„μ˜ μ˜ˆμ‹œμ—μ„œ 예츑 κ²°κ³ΌλŠ” μ •μˆ˜λ‘œ ν‘œν˜„λ©λ‹ˆλ‹€. μ•„λž˜ κ·Έλ¦Όκ³Ό 같이 μ •μˆ˜ ν˜•νƒœμ˜ 클래슀 번호λ₯Ό 클래슀 μ΄λ¦„μœΌλ‘œ λ°”κΎΈκΈ° μœ„ν•΄ model.config.id2label 속성을 μ‚¬μš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

for token, prediction in zip(tokens, predictions[0].numpy()):
    print((token, model.config.id2label[prediction]))

"""
('[CLS]', 'O')
('Hu', 'I-ORG')
('##gging', 'I-ORG')
('Face', 'I-ORG')
('Inc', 'I-ORG')
('.', 'O')
('is', 'O')
('a', 'O')
('company', 'O')
('based', 'O')
('in', 'O')
('New', 'I-LOC')
('York', 'I-LOC')
('City', 'I-LOC')
('.', 'O')
('Its', 'O')
('headquarters', 'O')
('are', 'O')
('in', 'O')
('D', 'I-LOC')
('##UM', 'I-LOC')
('##BO', 'I-LOC')
(',', 'O')
('therefore', 'O')
('very', 'O')
('close', 'O')
('to', 'O')
('the', 'O')
('Manhattan', 'I-LOC')
('Bridge', 'I-LOC')
('.', 'O')
('[SEP]', 'O')
"""

 

μš”μ•½(Summarization)

μš”μ•½μ€ λ¬Έμ„œλ‚˜ 기사λ₯Ό 더 짧은 ν…μŠ€νŠΈλ‘œ μ€„μ΄λŠ” μž‘μ—…μž…λ‹ˆλ‹€. μš”μ•½ μž‘μ—…μ—μ„œ λͺ¨λΈμ„ νŒŒμΈνŠœλ‹ν•˜λ €λ©΄ run_summarization.pyλ₯Ό ν™œμš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

μš”μ•½ 데이터셋 μ˜ˆλ‘œλŠ” CNN / Daily Mail 데이터셋이 μžˆμŠ΅λ‹ˆλ‹€. 이 데이터셋은 κΈ΄ λ‰΄μŠ€ κΈ°μ‚¬λ‘œ κ΅¬μ„±λ˜μ–΄ 있으며 μš”μ•½ μž‘μ—…μ„ μœ„ν•΄ λ§Œλ“€μ–΄μ‘ŒμŠ΅λ‹ˆλ‹€. μš”μ•½ μž‘μ—…μ—μ„œ λͺ¨λΈμ„ νŒŒμΈνŠœλ‹ν•˜λ €λ©΄, 이 λ¬Έμ„œμ—μ„œ λ‹€μ–‘ν•œ μ ‘κ·Ό 방식을 배울 수 μžˆμŠ΅λ‹ˆλ‹€.

λ‹€μŒμ€ νŒŒμ΄ν”„λΌμΈμ„ μ‚¬μš©ν•˜μ—¬ μš”μ•½μ„ μˆ˜ν–‰ν•˜λŠ” μ˜ˆμž…λ‹ˆλ‹€. CNN/Daily Mail λ°μ΄ν„°μ…‹μœΌλ‘œ νŒŒμΈνŠœλ‹λœ Bart λͺ¨λΈμ„ ν™œμš©ν•©λ‹ˆλ‹€.

from transformers import pipeline

summarizer = pipeline("summarization")

ARTICLE = """ New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York.
A year later, she got married again in Westchester County, but to a different man and without divorcing her first husband.
Only 18 days after that marriage, she got hitched yet again. Then, Barrientos declared "I do" five more times, sometimes only within two weeks of each other.
In 2010, she married once more, this time in the Bronx. In an application for a marriage license, she stated it was her "first and only" marriage.
Barrientos, now 39, is facing two criminal counts of "offering a false instrument for filing in the first degree," referring to her false statements on the
2010 marriage license application, according to court documents.
Prosecutors said the marriages were part of an immigration scam.
On Friday, she pleaded not guilty at State Supreme Court in the Bronx, according to her attorney, Christopher Wright, who declined to comment further.
After leaving court, Barrientos was arrested and charged with theft of service and criminal trespass for allegedly sneaking into the New York subway through an emergency exit, said Detective
Annette Markowski, a police spokeswoman. In total, Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002.
All occurred either in Westchester County, Long Island, New Jersey or the Bronx. She is believed to still be married to four men, and at one time, she was married to eight men at once, prosecutors say.
Prosecutors said the immigration scam involved some of her husbands, who filed for permanent residence status shortly after the marriages.
Any divorces happened only after such filings were approved. It was unclear whether any of the men will be prosecuted.
The case was referred to the Bronx District Attorney\\'s Office by Immigration and Customs Enforcement and the Department of Homeland Security\\'s
Investigation Division. Seven of the men are from so-called "red-flagged" countries, including Egypt, Turkey, Georgia, Pakistan and Mali.
Her eighth husband, Rashid Rajput, was deported in 2006 to his native Pakistan after an investigation by the Joint Terrorism Task Force.
If convicted, Barrientos faces up to four years in prison.  Her next court appearance is scheduled for May 18.
"""

μš”μ•½ νŒŒμ΄ν”„λΌμΈμ€ PreTrainedModel.generate() λ©”μ„œλ“œμ— μ˜μ‘΄ν•˜λ―€λ‘œ μ•„λž˜μ™€ 같이 νŒŒμ΄ν”„λΌμΈμ—μ„œ max_length 및 min_length에 λŒ€ν•œ *PreTrainedModel.generate()*의 κΈ°λ³Έ 인수λ₯Ό 직접 μž¬μ •μ˜ν•  수 μžˆμŠ΅λ‹ˆλ‹€. μ΄λ ‡κ²Œ ν•˜λ©΄ λ‹€μŒκ³Ό 같은 μš”μ•½ κ²°κ³Όκ°€ 좜λ ₯λ©λ‹ˆλ‹€.

print(summarizer(ARTICLE, max_length=130, min_length=30, do_sample=False))
"""
[{'summary_text': ' Liana Barrientos, 39, is charged with two counts of "offering a false instrument for filing in
the first degree" In total, she has been married 10 times, with nine of her marriages occurring between 1999 and
2002 . At one time, she was married to eight men at once, prosecutors say .'}]
"""

λ‹€μŒμ€ λͺ¨λΈκ³Ό ν† ν¬λ‚˜μ΄μ €λ₯Ό μ‚¬μš©ν•˜μ—¬ μš”μ•½μ„ μˆ˜ν–‰ν•˜λŠ” μ˜ˆμ‹œμž…λ‹ˆλ‹€. ν”„λ‘œμ„ΈμŠ€λŠ” λ‹€μŒκ³Ό κ°™μŠ΅λ‹ˆλ‹€.

  1. μ²΄ν¬ν¬μΈνŠΈμ—μ„œ ν† ν¬λ‚˜μ΄μ € 및 λͺ¨λΈμ„ μΈμŠ€ν„΄μŠ€ν™”ν•©λ‹ˆλ‹€. 일반적으둜 Bart λ˜λŠ” T5와 같은 인코더-디코더 λͺ¨λΈμ„ μ‚¬μš©ν•˜μ—¬ μˆ˜ν–‰ν•©λ‹ˆλ‹€.
  2. μš”μ•½ν•΄μ•Ό ν•  λ¬Έμ„œλ₯Ό μ •μ˜ν•©λ‹ˆλ‹€.
  3. T5의 νŠΉμˆ˜ν•œ 접두사인 "summarize: "λ₯Ό μΆ”κ°€ν•©λ‹ˆλ‹€.
  4. μš”μ•½λ¬Έ 생성을 μœ„ν•΄ PreTrainedModel.generate() λ©”μ„œλ“œλ₯Ό μ‚¬μš©ν•©λ‹ˆλ‹€.

이 μ˜ˆμ‹œμ—μ„œλŠ” Google의 T5 λͺ¨λΈμ„ μ‚¬μš©ν•©λ‹ˆλ‹€. 닀쀑 μž‘μ—… ν˜Όν•© 데이터셋(CNN/Daily Mail 포함)μ—μ„œλ§Œ ν”„λ¦¬νŠΈλ ˆμΈμ„ ν–ˆμŒμ—λ„ λΆˆκ΅¬ν•˜κ³  맀우 쒋은 κ²°κ³Όλ₯Ό 얻을 수 μžˆμŠ΅λ‹ˆλ‹€.

# Pytorch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")
tokenizer = AutoTokenizer.from_pretrained("t5-base")

# T5 uses a max_length of 512 so we cut the article to 512 tokens.
inputs = tokenizer("summarize: " + ARTICLE, return_tensors="pt", max_length=512, truncation=True)
outputs = model.generate(
    inputs["input_ids"], max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True
)

print(tokenizer.decode(outputs[0]))
"""
<pad> prosecutors say the marriages were part of an immigration scam. if convicted, barrientos faces two criminal
counts of "offering a false instrument for filing in the first degree" she has been married 10 times, nine of them
between 1999 and 2002.</s>
"""

 

λ²ˆμ—­(Translation)

λ²ˆμ—­μ€ ν•œ μ–Έμ–΄μ—μ„œ λ‹€λ₯Έ μ–Έμ–΄λ‘œ ν…μŠ€νŠΈλ₯Ό λ°”κΎΈλŠ” μž‘μ—…μž…λ‹ˆλ‹€. λ²ˆμ—­ μž‘μ—…μ—μ„œ λͺ¨λΈμ„ νŒŒμΈνŠœλ‹ ν•˜λ €λ©΄ run_translation.py 슀크립트λ₯Ό ν™œμš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

λ²ˆμ—­ λ°μ΄ν„°μ…‹μ˜ μ˜ˆλ‘œλŠ” WMT English to German 데이터셋이 μžˆλŠ”λ°, 이 λ°μ΄ν„°μ…‹μ—λŠ” μ˜μ–΄λ‘œ 된 λ¬Έμž₯이 μž…λ ₯ λ°μ΄ν„°λ‘œ, λ…μΌμ–΄λ‘œ 된 λ¬Έμž₯이 νƒ€κ²Ÿ λ°μ΄ν„°λ‘œ ν¬ν•¨λ˜μ–΄ μžˆμŠ΅λ‹ˆλ‹€. λ²ˆμ—­ μž‘μ—…μ—μ„œ λͺ¨λΈμ„ νŒŒμΈνŠœλ‹ν•˜λ €λŠ” κ²½μš°μ— λŒ€ν•΄ 이 λ¬Έμ„œμ—μ„œλŠ” λ‹€μ–‘ν•œ μ ‘κ·Ό 방식을 μ„€λͺ…ν•©λ‹ˆλ‹€.

λ‹€μŒμ€ νŒŒμ΄ν”„λΌμΈμ„ μ‚¬μš©ν•˜μ—¬ λ²ˆμ—­μ„ μˆ˜ν–‰ν•˜λŠ” μ˜ˆμž…λ‹ˆλ‹€. 닀쀑 μž‘μ—… ν˜Όν•© 데이터 μ„ΈνŠΈ(WMT 포함)μ—μ„œ ν”„λ¦¬νŠΈλ ˆμΈλœ T5 λͺ¨λΈμ„ ν™œμš©ν•˜μ—¬ λ²ˆμ—­ κ²°κ³Όλ₯Ό μ œκ³΅ν•©λ‹ˆλ‹€.

from transformers import pipeline

translator = pipeline("translation_en_to_de")
print(translator("Hugging Face is a technology company based in New York and Paris", max_length=40))
"""
[{'translation_text': 'Hugging Face ist ein Technologieunternehmen mit Sitz in New York und Paris.'}]
"""

λ³€μ—­ νŒŒμ΄ν”„λΌμΈμ€ PreTrainedModel.generate() λ©”μ„œλ“œμ— μ˜μ‘΄ν•˜λ―€λ‘œ μœ„μ™€ 같이 νŒŒμ΄ν”„λΌμΈμ—μ„œ max_length에 λŒ€ν•œ *PreTrainedModel.generate()*의 κΈ°λ³Έ 인수λ₯Ό 직접 μž¬μ •μ˜ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

λ‹€μŒμ€ λͺ¨λΈκ³Ό ν† ν¬λ‚˜μ΄μ €λ₯Ό μ‚¬μš©ν•˜μ—¬ λ²ˆμ—­μ„ μˆ˜ν–‰ν•˜λŠ” μ˜ˆμ‹œμž…λ‹ˆλ‹€. ν”„λ‘œμ„ΈμŠ€λŠ” λ‹€μŒκ³Ό κ°™μŠ΅λ‹ˆλ‹€.

  1. μ²΄ν¬ν¬μΈνŠΈμ—μ„œ ν† ν¬λ‚˜μ΄μ € 및 λͺ¨λΈμ„ μΈμŠ€ν„΄μŠ€ν™”ν•©λ‹ˆλ‹€. 일반적으둜 Bart λ˜λŠ” T5와 같은 인코더-디코더 λͺ¨λΈμ„ μ‚¬μš©ν•˜μ—¬ μˆ˜ν–‰ν•©λ‹ˆλ‹€.
  2. λ²ˆμ—­ν•΄μ•Ό ν•  λ¬Έμ„œλ₯Ό μ •μ˜ν•©λ‹ˆλ‹€.
  3. T5의 νŠΉμˆ˜ν•œ 접두사인 "translate English to German:“을 μΆ”κ°€ν•©λ‹ˆλ‹€.
  4. λ²ˆμ—­λ¬Έ 생성을 μœ„ν•΄ PreTrainedModel.generate() λ©”μ„œλ“œλ₯Ό μ‚¬μš©ν•©λ‹ˆλ‹€.
# Pytorch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")
tokenizer = AutoTokenizer.from_pretrained("t5-base")

inputs = tokenizer(
    "translate English to German: Hugging Face is a technology company based in New York and Paris",
    return_tensors="pt"
)
outputs = model.generate(inputs["input_ids"], max_length=40, num_beams=4, early_stopping=True)

print(tokenizer.decode(outputs[0]))
"""
<pad> Hugging Face ist ein Technologieunternehmen mit Sitz in New York und Paris.</s>
"""
# Tensorflow
from transformers import TFAutoModelForSeq2SeqLM, AutoTokenizer

model = TFAutoModelForSeq2SeqLM.from_pretrained("t5-base")
tokenizer = AutoTokenizer.from_pretrained("t5-base")

inputs = tokenizer(
    "translate English to German: Hugging Face is a technology company based in New York and Paris",
    return_tensors="tf"
)
outputs = model.generate(inputs["input_ids"], max_length=40, num_beams=4, early_stopping=True)

print(tokenizer.decode(outputs[0]))
"""
<pad> Hugging Face ist ein Technologieunternehmen mit Sitz in New York und Paris.
"""

μœ„μ˜ μ˜ˆμ‹œμ™€ 같이 λ²ˆμ—­λ¬Έμ΄ 좜λ ₯λ©λ‹ˆλ‹€.

+ Recent posts