RNN(Recurrent Neural Network)の実例の一つに翻訳機があります。今回はkerasのRecurrentレイヤーを使い、Seq2Seq(Encoder-Decoder)モデルの英日翻訳機を実装してみます。

データセット

Kerasのexampleでは文字単位での英仏翻訳が行われていますが、今回は英日の翻訳なので田中コーパス¹をデータセットとして使い、単語レベルでの翻訳を行います。

1.田中コーパスのダウンロード

$ wget ftp://ftp.monash.edu/pub/nihongo/examples.utf.gz
$ gunzip examples.utf.gz

2.整形

田中コーパスのデータは以下のようになっています。

$ head examples.utf
A: ムーリエルは２０歳になりました。 Muiriel is 20 now.#ID=1282_4707
B: は 二十歳(はたち){２０歳} になる[01]{になりました}
A: すぐに戻ります。 I will be back soon.#ID=1284_4709
B: 直ぐに{すぐに} 戻る{戻ります}
A: すぐに諦めて昼寝をするかも知れない。   I may give up soon and just nap instead.#ID=1300_4727
B: 直ぐに{すぐに} 諦める{諦めて} 昼寝 を 為る(する){する} かも知れない
A: 愛してる。  I love you.#ID=1434_4851
B: 愛する{愛してる}
A: ログアウトするんじゃなかったよ。 I shouldn't have logged off.#ID=1442_4858
B: ログアウト~ 為る(する){する} の{ん} だ{じゃなかった} よ[01]

このファイルに対し、1) Aの行を抜き出し 2) #以下を取り除いて 3) tabで区切り 4) それぞれ日本語(教師データ)・英語(学習データ)ファイルとして保存します。

norm_corpus.py · GitHub

(日本語の場合は一文字をデータの最小単位とするため、半角スペースを間にいれました。)

文章のベクトル化

入力である文章はそのままではモデルへの入力として使う事ができません。keras.preprocessing.text.Tokenizerクラスにより文章をベクトルに直します。 (文章の先頭と末尾には系列の先頭・終了を表すタグをつけておきます。)

from keras.preprocessing.text import Tokenizer

def load_dataset(file_path):
    tokenizer = Tokenizer(filters="")
    texts = []
    for line in open(file_path, 'r'):
        texts.append("<s> " + line.strip() + " </s>")

    tokenizer.fit_on_texts(texts)
    return tokenizer.texts_to_sequences(texts), tokenizer

train_X, tokenizer_e = load_dataset('tanaka_corpus_e.txt')
train_Y, tokenizer_j = load_dataset('tanaka_corpus_j.txt')

これにより、

<s> Muiriel is 20 now. </s> → [1, 16504, 7, 1851, 170, 2]
<s> ムーリエルは２０歳になりました。 </s> → [2, 142, 38, 93, 328, 71, 4, 134, 106, 505, 8, 9, 33, 21, 11, 7, 1, 3]

と表現することができました。

モデル

学習時

Encoderの入力にtrain_Xデータ、Decoderの入力にtrain_Yデータを使い、教師データとしては入力として用いたtrain_Yの一時刻先のデータを使います。

f:id:KK462:20180218020543j:plain

from keras.models import Model
from keras.layers import Input, Embedding, Dense, LSTM
emb_dim = 256
hid_dim = 256

## エンコーダ

encoder_inputs = Input(shape=(seqX_len,))
# Embed処理: https://keras.io/ja/layers/embeddings/
encoder_embedded = Embedding(word_num_e, emb_dim, mask_zero=True)(encoder_inputs)

encoder = LSTM(hid_dim, return_state=True) # 内部状態を返すよう、return_state=Trueとしておく
_, state_h, state_c = encoder(encoder_embedded) # outputsは捨てる

## デコーダ

decoder_inputs = Input(shape=(seqY_len,))
decoder_embedding = Embedding(word_num_j, emb_dim)
decoder_embedded = decoder_embedding(decoder_inputs)

decoder = LSTM(hid_dim, return_sequences=True, return_state=True) # 一入力ごとの出力を得るため、return_sequences=Trueとしておく
# initial_stateにencoder_statesを与えることでエンコーダとデコーダを繋ぐ
decoder_outputs, _, _ = decoder(decoder_embedded, initial_state=encoder_states)
decoder_dense = Dense(word_num_j, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)

model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
model.compile(optimizer='rmsprop', loss='sparse_categorical_crossentropy')

学習

## 教師データの用意
# 1. 各データ先頭を捨てる
next_inputs = train_Y[:, 1:]
# 2. 末尾を0でパディングする
decoder_target_data = np.hstack((next_inputs, np.zeros((len(train_Y),1), dtype=np.int32)))
# 3. 出力は(系列の数, 入力の数, 単語のベクトル)の3次元なので一次元上げる
decoder_target_data = np.expand_dims(decoder_target_data, -1)

model.fit([train_X, train_Y], decoder_target_data, batch_size=128, epochs=15, verbose=2, validation_split=0.2)

予測時

予測時はt-1時点でのデコーダの出力がtの入力となるため、RNNのループを手で回します。

f:id:KK462:20180218200616j:plain

モデル

# エンコーダ：最終的な状態を返す
encoder_model = Model(encoder_inputs, encoder_states)

decoder_state_input_h = Input(shape=hid_dim,)
decoder_state_input_c = Input(shape=hid_dim,)
decoder_state_inputs = [decoder_state_input_h, decoder_state_input_c]

decoder_inputs = Input(shape=(1,))
decoder_embedded = decoder_embedding(decoder_inputs)
decoder_outputs, state_h, state_c = decoder(
    decoder_embedded, initial_state=decoder_state_inputs
)
decoder_states = [state_h, state_c]
decoder_outputs = decoder_dense(decoder_outputs)
# デコーダ
decoder_model = Model(
    [decoder_inputs] + decoder_state_inputs, # t-1での出力・状態を受け取る
    [decoder_outputs] + decoder_states
)

出力

def decode_sequence(input_seq):
    # エンコーダから最終的な状態を得る
    states_value = encoder_model.predict(input_seq)
    bos_eos = tokenizer_j.texts_to_sequences(["<s>", "</s>"])
    # 最初の入力として先頭文字<s>を与える
    target_seq = np.array(bos_eos[0])
    output_seq = bos_eos[0]

    # ループを回す
    while True:
        # 前回の出力と状態ベクトルで予測
        output_tokens, h, c = decoder_model.predict(
            [target_seq] + states_value
        )
        sampled_token_index = [np.argmax(output_tokens[0, -1, :])]
        output_seq += sampled_token_index

        if (sampled_token_index == bos_eos[1] or len(output_seq) > 1000):
            break
        
        # 入力・状態の更新
        target_seq = np.array(sampled_token_index)
        states_value = [h, c]
    
    return output_seq

detokenizer_e = dict(map(reversed, tokenizer_e.word_index.items()))
detokenizer_j = dict(map(reversed, tokenizer_j.word_index.items()))
input_seq = pad_sequences([test_X[0]], seqX_len, padding='post')

print(' '.join([detokenizer_e[i] for i in test_X[0]])) # 英文
print(' '.join([detokenizer_j[i] for i in decode_sequence(input_seq)])) # 予測
print(' '.join([detokenizer_j[i] for i in test_Y[0]])) # 正解

結果

データセットを50,000にし、5エポック回してみます。

<s> he never makes a show of his learning. </s>
<s> 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は 彼 は
<s> 彼 は 決 し て 自 分 の 学 識 を 見 せ び ら か せ な い 。 </s>

うーん。。。

コード全体

seq2seq.py · GitHub

Next Step

Attentionモデルの導入
- How to Develop an Encoder-Decoder Model with Attention for Sequence-to-Sequence Prediction in Keras - Machine Learning Mastery
日本語を正しくトークナイズ
入力を逆から与えてみる
- 日本語と英語だと語順が大きく違うのでうまくいくか疑問

Ref

https://detail.chiebukuro.yahoo.co.jp/qa/question_detail/q11119577901 ↩

技術について語るときに僕の語ること

Seq2Seqを使った英日翻訳機

データセット

文章のベクトル化

モデル

学習時

予測時

Next Step

Ref