「検索してはいけない言葉」をGPT-2に自動生成させる

GPT-2のファインチューニングが簡単にできると聞いたので，試してみる．

1. 実行環境

Google Colab
- ランタイムにGPU（T4）を使用

2.1 訓練データの収集

まず「検索してはいけない言葉アットウィキ」に登録されている全ての言葉をスクレイピングにより取得した（なお，@wikiの利用規約を読み，スクレイピングが禁止されていないことを確認済である）．
以下のプログラムでスクレイピングを行い，<s>危険度1[SEP]言葉</s>\nのような形式ですべての言葉をまとめた．

from bs4 import BeautifulSoup
import requests, time


# wikiのソースを引っ張ってくる
urls: list[str] = [
    "https://w.atwiki.jp/mustnotsearch/pages/1328.html",
    "https://w.atwiki.jp/mustnotsearch/pages/1329.html",
    "https://w.atwiki.jp/mustnotsearch/pages/1330.html",
    "https://w.atwiki.jp/mustnotsearch/pages/950.html",
    "https://w.atwiki.jp/mustnotsearch/pages/951.html",
    "https://w.atwiki.jp/mustnotsearch/pages/952.html",
    "https://w.atwiki.jp/mustnotsearch/pages/953.html",
    "https://w.atwiki.jp/mustnotsearch/pages/954.html",
    "https://w.atwiki.jp/mustnotsearch/pages/955.html",
    "https://w.atwiki.jp/mustnotsearch/pages/956.html",
    "https://w.atwiki.jp/mustnotsearch/pages/3073.html",
    "https://w.atwiki.jp/mustnotsearch/pages/3074.html",
    "https://w.atwiki.jp/mustnotsearch/pages/3080.html",
    "https://w.atwiki.jp/mustnotsearch/pages/3076.html",
]

contents: list[str] = []
for url in urls:
    contents.append(requests.get(url).text)
    time.sleep(2)


# 使用する関数群
def flatten(list_of_list: list[list]) -> list:
    return sum(list_of_list, [])

def _select_something_from_main_table(content: str, select_query: str) -> list[str]:
    soup = BeautifulSoup(content, "html.parser")
    return [
        line.get_text().replace("\u3000", " ") for line in soup.select(select_query)[1:]
    ]

def select_words(content: str) -> list[str]:
    return _select_something_from_main_table(
        content, "table:nth-child(10) > tr > td:nth-child(1)"
    )

def select_levels(content: str) -> list[str]:
    return _select_something_from_main_table(
        content, "table:nth-child(10) > tr > td:nth-child(2)"
    )


# 定義した関数を使い，検索してはいけない言葉とその危険度を全取得
words: list[str] = flatten([select_words(content) for content in contents])
levels: list[str] = flatten([select_levels(content) for content in contents])


# 言葉と危険度を指定の形式に整形してから，txtファイルとして保存
words_and_levels: list[str] = [
    f"<s>危険度{level}[SEP]{word}</s>\n" for word, level in list(zip(words, levels))
]
text_for_train: str = "".join(words_and_levels)
with open("./mns_traindata.txt", mode="w", encoding="utf-8") as f:
    f.write(text_for_train)

2.2 ファインチューニング・テキスト生成

コードについては次の記事から丸々コピーさせてもらい，タスクに合わせて適宜変更を行った．なお，実行前に先程用意した訓練データのtxtファイルをGoogle Driveにコピーし，Google Driveのマウントを行う必要がある．

qiita.com

# ライブラリ等の準備
!pip install git+https://github.com/huggingface/transformers
!pip install sentencepiece
!pip install datasets
!pip install evaluate
!pip install --upgrade accelerate
!git clone https://github.com/huggingface/transformers


# ファインチューニングの実行
## 環境によりtrain_file, validation_file, output_dirを変更する必要あり
!python ./transformers/examples/pytorch/language-modeling/run_clm.py \
    --model_name_or_path=rinna/japanese-gpt2-medium \
    --train_file="drive/MyDrive/Colab Notebooks/mns/mns_traindata.txt" \
    --validation_file="drive/MyDrive/Colab Notebooks/mns/mns_traindata.txt" \
    --do_train \
    --do_eval \
    --num_train_epochs=80 \
    --save_steps=10000 \
    --save_total_limit=3 \
    --per_device_train_batch_size=1 \
    --per_device_eval_batch_size=1 \
    --output_dir="drive/MyDrive/Colab Notebooks/mns/output/" \
    --use_fast_tokenizer=False \
    --overwrite_output_dir   


# 作成したモデルを読み込み，テキスト作成の準備を行う
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

## モデルの準備
model = AutoModelForCausalLM.from_pretrained("drive/MyDrive/Colab Notebooks/mns/output/").to(device)
model.eval()

## トークナイザの準備
tokenizer = AutoTokenizer.from_pretrained("rinna/japanese-gpt2-medium")
tokenizer.do_lower_case = True

テキストの生成は次のコードで行った．

# テキスト生成
level = 4 # 危険度
first_word = "知らない" # 開始単語
num_return_sequences = 10 # テキストの生成数

input_text = f"<s>危険度{level}[SEP]{first_word} "
input_ids = tokenizer.encode(input_text, return_tensors='pt', add_special_tokens=False).to(device)
out = model.generate(input_ids, do_sample=True, top_p=0.9, top_k=30,
                     num_return_sequences=num_return_sequences, max_length=32,  # たぶんテキストの最大文字数
                     bad_words_ids=[[tokenizer.bos_token_id], [tokenizer.sep_token_id], [tokenizer.eos_token_id]], # NGワード設定
                     repetition_penalty=1.2) # 繰り返しペナルティ
out_decoded = tokenizer.batch_decode(out, skip_special_tokens=True)

for i in range(num_return_sequences):
    print(out_decoded[i])

3. 生成例

危険度4，開始単語に「知らない」を指定．

危険度4 知らない じゃなくて 知ってるんだからな
危険度4 知らない ayaso
危険度4 知らない おじさん
危険度4 知らない うしろのしょうめん
危険度4 知らない 見たこともない
危険度4 知らない 見た
危険度4 知らない パンダ
危険度4 知らない モンスーン
危険度4 知らない 見るだけ
危険度4 知らない 見たくなってしまった

危険度6，開始単語に「メキシコ」を指定．

危険度6 メキシコ ビンセント
危険度6 メキシコ vs イタリア
危険度6 メキシコ vs コロンビア
危険度6 メキシコ ボートショー
危険度6 メキシコ バハ・カリフォルニア州
危険度6 メキシコ 様の更新履歴へ
危険度6 メキシコ 様ですね。お気の毒に...
危険度6 メキシコ 人身事故
危険度6 メキシコ ボート 沈没
危険度6 メキシコ バハ・カリフォルニア

危険度2，開始単語に「アスファルト」を指定．
「へその緒」など人体に関する単語や，「踏まれた」などのネガティブな単語が生成されているのが個人的にポイント高い．

危険度2 アスファルト ブーツ
危険度2 アスファルト あくどい
危険度2 アスファルト で出来た神社 小銭
危険度2 アスファルト へその緒
危険度2 アスファルト べちゃついた
危険度2 アスファルト ソリティア
危険度2 アスファルト ぶるーり
危険度2 アスファルト 踏まれた
危険度2 アスファルト にかわった世界
危険度2 アスファルト 舗装

ちなみに危険度のみ指定してもいけるのではと思い，開始単語を「」にして試してみたが，なぜか「ベジタリアンプーケット」という言葉しか生成されなかった．なお危険度を変えても同じ結果となった．

危険度2 ベジタリアン プーケット
危険度2 ベジタリアン プーケット
危険度2 ベジタリアン プーケット
危険度2 ベジタリアン プーケット
危険度2 ベジタリアン プーケット
危険度2 ベジタリアン プーケット
危険度2 ベジタリアン プーケット
危険度2 ベジタリアン プーケット
危険度2 ベジタリアン プーケット
危険度2 ベジタリアン プーケット

日記

日本語の勉強のためのブログ

「検索してはいけない言葉」をGPT-2に自動生成させる

1. 実行環境

2.1 訓練データの収集

2.2 ファインチューニング・テキスト生成

3. 生成例