{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Transformer翻译项目" ] }, { "cell_type": "markdown", "metadata": { "toc": true }, "source": [ "<h1>Table of Contents<span class=\"tocSkip\"></span></h1>\n", "<div class=\"toc\"><ul class=\"toc-item\"><li><span><a href=\"#一.-建立Transformer模型的直观认识\" data-toc-modified-id=\"一.-建立Transformer模型的直观认识-1\">一. 建立Transformer模型的直观认识</a></span></li><li><span><a href=\"#二.-编码器部分(Encoder)\" data-toc-modified-id=\"二.-编码器部分(Encoder)-2\">二. 编码器部分(Encoder)</a></span><ul class=\"toc-item\"><li><span><a href=\"#0.-先准备好输入的数据\" data-toc-modified-id=\"0.-先准备好输入的数据-2.1\">0. 先准备好输入的数据</a></span></li><li><span><a href=\"#1.-positional-encoding(即位置嵌入或位置编码)\" data-toc-modified-id=\"1.-positional-encoding(即位置嵌入或位置编码)-2.2\">1. positional encoding(即位置嵌入或位置编码)</a></span></li><li><span><a href=\"#2.-self-attention(自注意力机制)\" data-toc-modified-id=\"2.-self-attention(自注意力机制)-2.3\">2. self attention(自注意力机制)</a></span></li><li><span><a href=\"#3.-Attention-Mask\" data-toc-modified-id=\"3.-Attention-Mask-2.4\">3. Attention Mask</a></span></li><li><span><a href=\"#4.-Layer-Normalization-和残差连接\" data-toc-modified-id=\"4.-Layer-Normalization-和残差连接-2.5\">4. Layer Normalization 和残差连接</a></span></li><li><span><a href=\"#5.-Transformer-Encoder-整体结构\" data-toc-modified-id=\"5.-Transformer-Encoder-整体结构-2.6\">5. Transformer Encoder 整体结构</a></span></li></ul></li><li><span><a href=\"#三.-解码器部分(Decoder)\" data-toc-modified-id=\"三.-解码器部分(Decoder)-3\">三. 解码器部分(Decoder)</a></span></li><li><span><a href=\"#四.-Transformer模型\" data-toc-modified-id=\"四.-Transformer模型-4\">四. Transformer模型</a></span></li><li><span><a href=\"#五.-模型训练\" data-toc-modified-id=\"五.-模型训练-5\">五. 模型训练</a></span></li><li><span><a href=\"#六.-模型预测\" data-toc-modified-id=\"六.-模型预测-6\">六. 模型预测</a></span></li></ul></div>" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "在这份notebook当中,我们会(尽可能)实现 $Transformer$ 模型来完成翻译任务。 \n", "(参考论文:$Attention\\; Is\\; All\\; You\\; Need$ https://arxiv.org/pdf/1706.03762.pdf )\n", "\n", "我们的数据集非常小,只有一万多个句子的训练数据,从结果来看训练出来的模型在测试集上的表现其实已经算还可以了。 \n", "如果想得到更好的效果,则需要更大的数据量并进行更多的训练迭代次数,感兴趣(并且有硬件条件)的同学可以进行尝试。 " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 一. 建立Transformer模型的直观认识\n", " \n", "首先来说一下**Transformer**和**LSTM**的最大区别,就是LSTM的训练是迭代(自回归)的,是一个接一个字的来,当前这个字过完LSTM单元,才可以进下一个字,而 $Transformer$ 的训练是并行了,就是所有字是全部同时训练的,这样就大大加快了计算效率,$Transformer$ 使用了位置嵌入$(positional \\ encoding)$来理解语言的顺序,使用自注意力机制和全连接层来进行计算,这些后面都会详细讲解。 \n", " \n", "$Transformer$ 模型主要分为**两大部分**,分别是**编码器($Encoder$)**和**解码器($Decoder$)**: \n", "- **编码器($Encoder$)**负责把自然语言序列映射成为**隐藏层**(下图中**第2步**用九宫格比喻的部分),含有自然语言序列的数学表达\n", "- **解码器($Decoder$)**再把隐藏层映射为自然语言序列,从而使我们可以解决各种问题,如情感分类、命名实体识别、语义关系抽取、摘要生成、机器翻译等等。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "<img src=\"./imgs/intuition.jpg\" width=650>" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 二. 编码器部分(Encoder)\n", " \n", " \n", "我们会**重点介绍编码器的结构**,因为理解了编码器中的结构, 理解解码器就非常简单了。而且我们用编码器就能够完成一些自然语言处理中比较主流的任务, 如情感分类, 语义关系分析, 命名实体识别等。 \n", " \n", "**编码器($Encoder$)**部分, 即把**自然语言序列映射为隐藏层的数学表达的过程**。 " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**以下为一个Transformer Encoder Block结构示意图**\n", "\n", "> 注意: 为方便查看, 下面各部分的内容分别对应着图中第1, 2, 3, 4个方框的序号:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "<img src=\"./imgs/encoder.jpg\" width=550>" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 0. 先准备好输入的数据" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import os\n", "import math\n", "import copy\n", "import time\n", "import numpy as np\n", "import torch\n", "import torch.nn as nn\n", "import torch.nn.functional as F\n", "\n", "from nltk import word_tokenize\n", "from collections import Counter\n", "from torch.autograd import Variable" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# 初始化参数设置\n", "UNK = 0 # 未登录词的标识符对应的词典id\n", "PAD = 1 # padding占位符对应的词典id\n", "BATCH_SIZE = 64 # 每批次训练数据数量\n", "EPOCHS = 20 # 训练轮数\n", "LAYERS = 6 # transformer中堆叠的encoder和decoder block层数\n", "H_NUM = 8 # multihead attention hidden个数\n", "D_MODEL = 256 # embedding维数\n", "D_FF = 1024 # feed forward第一个全连接层维数\n", "DROPOUT = 0.1 # dropout比例\n", "MAX_LENGTH = 60 # 最大句子长度\n", "\n", "TRAIN_FILE = 'nmt/en-cn/train.txt' # 训练集数据文件\n", "DEV_FILE = \"nmt/en-cn/dev.txt\" # 验证(开发)集数据文件\n", "SAVE_FILE = 'save/model.pt' # 模型保存路径(注意如当前目录无save文件夹需要自己创建)\n", "DEVICE = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "def seq_padding(X, padding=0):\n", " \"\"\"\n", " 对一个batch批次(以单词id表示)的数据进行padding填充对齐长度\n", " \"\"\"\n", " # 计算该批次数据各条数据句子长度\n", " L = [len(x) for x in X]\n", " # 获取该批次数据最大句子长度\n", " ML = max(L)\n", " # 对X中各条数据x进行遍历,如果长度短于该批次数据最大长度ML,则以padding id填充缺失长度ML-len(x)\n", " # (注意这里默认padding id是0,相当于是拿<UNK>来做了padding)\n", " return np.array([\n", " np.concatenate([x, [padding] * (ML - len(x))]) if len(x) < ML else x for x in X\n", " ])\n", "\n", "\n", "class PrepareData:\n", " def __init__(self, train_file, dev_file):\n", " # 读取数据 并分词\n", " self.train_en, self.train_cn = self.load_data(train_file)\n", " self.dev_en, self.dev_cn = self.load_data(dev_file)\n", "\n", " # 构建单词表\n", " self.en_word_dict, self.en_total_words, self.en_index_dict = self.build_dict(self.train_en)\n", " self.cn_word_dict, self.cn_total_words, self.cn_index_dict = self.build_dict(self.train_cn)\n", "\n", " # id化\n", " self.train_en, self.train_cn = self.wordToID(self.train_en, self.train_cn, self.en_word_dict, self.cn_word_dict)\n", " self.dev_en, self.dev_cn = self.wordToID(self.dev_en, self.dev_cn, self.en_word_dict, self.cn_word_dict)\n", "\n", " # 划分batch + padding + mask\n", " self.train_data = self.splitBatch(self.train_en, self.train_cn, BATCH_SIZE)\n", " self.dev_data = self.splitBatch(self.dev_en, self.dev_cn, BATCH_SIZE)\n", "\n", " def load_data(self, path):\n", " \"\"\"\n", " 读取翻译前(英文)和翻译后(中文)的数据文件\n", " 每条数据都进行分词,然后构建成包含起始符(BOS)和终止符(EOS)的单词(中文为字符)列表\n", " 形式如:en = [['BOS', 'i', 'love', 'you', 'EOS'], ['BOS', 'me', 'too', 'EOS'], ...]\n", " cn = [['BOS', '我', '爱', '你', 'EOS'], ['BOS', '我', '也', '是', 'EOS'], ...]\n", " \"\"\"\n", " en = []\n", " cn = []\n", " with open(path, 'r', encoding='utf-8') as f:\n", " for line in f:\n", " line = line.strip().split('\\t')\n", "\n", " en.append([\"BOS\"] + word_tokenize(line[0].lower()) + [\"EOS\"])\n", " cn.append([\"BOS\"] + word_tokenize(\" \".join([w for w in line[1]])) + [\"EOS\"])\n", "\n", " return en, cn\n", " \n", " def build_dict(self, sentences, max_words=50000):\n", " \"\"\"\n", " 传入load_data构造的分词后的列表数据\n", " 构建词典(key为单词,value为id值)\n", " \"\"\"\n", " # 对数据中所有单词进行计数\n", " word_count = Counter()\n", "\n", " for sentence in sentences:\n", " for s in sentence:\n", " word_count[s] += 1\n", " # 只保留最高频的前max_words数的单词构建词典\n", " # 并添加上UNK和PAD两个单词,对应id已经初始化设置过\n", " ls = word_count.most_common(max_words)\n", " # 统计词典的总词数\n", " total_words = len(ls) + 2\n", "\n", " word_dict = {w[0]: index + 2 for index, w in enumerate(ls)}\n", " word_dict['UNK'] = UNK\n", " word_dict['PAD'] = PAD\n", " # 再构建一个反向的词典,供id转单词使用\n", " index_dict = {v: k for k, v in word_dict.items()}\n", "\n", " return word_dict, total_words, index_dict\n", "\n", " def wordToID(self, en, cn, en_dict, cn_dict, sort=True):\n", " \"\"\"\n", " 该方法可以将翻译前(英文)数据和翻译后(中文)数据的单词列表表示的数据\n", " 均转为id列表表示的数据\n", " 如果sort参数设置为True,则会以翻译前(英文)的句子(单词数)长度排序\n", " 以便后续分batch做padding时,同批次各句子需要padding的长度相近减少padding量\n", " \"\"\"\n", " # 计算英文数据条数\n", " length = len(en)\n", " # 将翻译前(英文)数据和翻译后(中文)数据都转换为id表示的形式\n", " out_en_ids = [[en_dict.get(w, 0) for w in sent] for sent in en]\n", " out_cn_ids = [[cn_dict.get(w, 0) for w in sent] for sent in cn]\n", "\n", " # 构建一个按照句子长度排序的函数\n", " def len_argsort(seq):\n", " \"\"\"\n", " 传入一系列句子数据(分好词的列表形式),\n", " 按照句子长度排序后,返回排序后原来各句子在数据中的索引下标\n", " \"\"\"\n", " return sorted(range(len(seq)), key=lambda x: len(seq[x]))\n", "\n", " # 把中文和英文按照同样的顺序排序\n", " if sort:\n", " # 以英文句子长度排序的(句子下标)顺序为基准\n", " sorted_index = len_argsort(out_en_ids)\n", " # 对翻译前(英文)数据和翻译后(中文)数据都按此基准进行排序\n", " out_en_ids = [out_en_ids[i] for i in sorted_index]\n", " out_cn_ids = [out_cn_ids[i] for i in sorted_index]\n", " \n", " return out_en_ids, out_cn_ids\n", "\n", " def splitBatch(self, en, cn, batch_size, shuffle=True):\n", " \"\"\"\n", " 将以单词id列表表示的翻译前(英文)数据和翻译后(中文)数据\n", " 按照指定的batch_size进行划分\n", " 如果shuffle参数为True,则会对这些batch数据顺序进行随机打乱\n", " \"\"\"\n", " # 在按数据长度生成的各条数据下标列表[0, 1, ..., len(en)-1]中\n", " # 每隔指定长度(batch_size)取一个下标作为后续生成batch的起始下标\n", " idx_list = np.arange(0, len(en), batch_size)\n", " # 如果shuffle参数为True,则将这些各batch起始下标打乱\n", " if shuffle:\n", " np.random.shuffle(idx_list)\n", " # 存放各个batch批次的句子数据索引下标\n", " batch_indexs = []\n", " for idx in idx_list:\n", " # 注意,起始下标最大的那个batch可能会超出数据大小\n", " # 因此要限定其终止下标不能超过数据大小\n", " \"\"\"\n", " 形如[array([4, 5, 6, 7]), \n", " array([0, 1, 2, 3]), \n", " array([8, 9, 10, 11]),\n", " ...]\n", " \"\"\"\n", " batch_indexs.append(np.arange(idx, min(idx + batch_size, len(en))))\n", " \n", " # 按各batch批次的句子数据索引下标,构建实际的单词id列表表示的各batch句子数据\n", " batches = []\n", " for batch_index in batch_indexs:\n", " # 按当前batch的各句子下标(数组批量索引)提取对应的单词id列表句子表示数据\n", " batch_en = [en[index] for index in batch_index] \n", " batch_cn = [cn[index] for index in batch_index]\n", " # 对当前batch的各个句子都进行padding对齐长度\n", " # 维度为:batch数量×batch_size×每个batch最大句子长度\n", " batch_cn = seq_padding(batch_cn)\n", " batch_en = seq_padding(batch_en)\n", " # 将当前batch的英文和中文数据添加到存放所有batch数据的列表中\n", " batches.append(Batch(batch_en, batch_cn))\n", "\n", " return batches" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "注意,上述预处理中使用的 $Batch$ 类在后面的 $Encoder$ 内容的 $Attention\\ Mask$ 部分定义" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Embeddings** \n", " \n", "与其他序列传导模型类似,我们使用learned embeddings将输入标记和输出标记转换为维度 $d_{model}$ 的向量。我们还使用通常学习的线性变换和softmax函数将 $Decoder$(解码器)的输出转换为预测的下一个标签的概率。 \n", " \n", "在我们的模型中,我们在两个embedding层和pre-softmax线性变换层之间共享相同的权重矩阵。这么做可以节省参数,也是一种正则化方式。 \n", "在其中的embedding层,我们会将这些权重乘以 $\\sqrt{d_{model}}$ 。" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "class Embeddings(nn.Module):\n", " def __init__(self, d_model, vocab):\n", " super(Embeddings, self).__init__()\n", " # Embedding层\n", " self.lut = nn.Embedding(vocab, d_model)\n", " # Embedding维数\n", " self.d_model = d_model\n", "\n", " def forward(self, x):\n", " # 返回x对应的embedding矩阵(需要乘以math.sqrt(d_model))\n", " return self.lut(x) * math.sqrt(self.d_model)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "数据全部处理完成,现在我们开始理解和构建 $Transformer$ 模型" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1. positional encoding(即位置嵌入或位置编码)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "由于 $Transformer$ 模型**没有**循环神经网络的迭代操作,所以我们必须提供每个字的位置信息给 $Transformer$,才能识别出语言中的顺序关系。 \n", " \n", "因此,我们定义一个位置嵌入的概念,也就是$positional \\ encoding$,位置嵌入的维度为$[max \\ sequence \\ length,\\ embedding \\ dimension]$,嵌入的维度同词向量的维度,$max \\ sequence \\ length$属于超参数,指的是限定的最大单个句长。 \n", " \n", "注意,我们一般以字为单位训练transformer模型,也就是说我们不用分词了,首先我们要初始化字向量为$[vocab \\ size,\\ embedding \\ dimension]$,$vocab \\ size$为总共的字库数量,$embedding \\ dimension$为字向量的维度,也是每个字的数学表达。 \n", " \n", "在论文 **attention is all you need**( https://arxiv.org/pdf/1706.03762.pdf )中使用了$sine$和$cosine$函数的线性变换来提供给模型位置信息: \n", " \n", "$$PE_{(pos,2i)} = sin(pos / 10000^{2i/d_{\\text{model}}}) \\quad \\quad PE_{(pos,2i+1)} = cos(pos / 10000^{2i/d_{\\text{model}}})\\tag{eq.1}$$ \n", " \n", "上式中$pos$指的是句中字的位置,取值范围是$[0, \\ max \\ sequence \\ length)$,$i$指的是词向量的维度,取值范围是$[0, \\ embedding \\ dimension)$,上面有$sin$和$cos$一组公式,也就是对应着$embedding \\ dimension$维度的一组奇数和偶数的序号的维度,例如$0, 1$一组,$2, 3$一组,分别用上面的$sin$和$cos$函数做处理,从而产生不同的周期性变化,而位置嵌入在$embedding \\ dimension$维度上随着维度序号增大,周期变化会越来越慢,而产生一种包含位置信息的纹理,就像论文原文中第六页讲的,位置嵌入函数的周期从$2 \\pi$到$10000 * 2 \\pi$变化,而每一个位置在$embedding \\ dimension$维度上都会得到不同周期的$sin$和$cos$函数的取值组合,从而产生独一的纹理位置信息,模型从而学到位置之间的依赖关系和自然语言的时序特性。 " ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# 导入依赖库\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "class PositionalEncoding(nn.Module):\n", " def __init__(self, d_model, dropout, max_len=5000):\n", " super(PositionalEncoding, self).__init__()\n", " self.dropout = nn.Dropout(p=dropout)\n", " \n", " # 初始化一个size为 max_len(设定的最大长度)×embedding维度 的全零矩阵\n", " # 来存放所有小于这个长度位置对应的porisional embedding\n", " pe = torch.zeros(max_len, d_model, device=DEVICE)\n", " # 生成一个位置下标的tensor矩阵(每一行都是一个位置下标)\n", " \"\"\"\n", " 形式如:\n", " tensor([[0.],\n", " [1.],\n", " [2.],\n", " [3.],\n", " [4.],\n", " ...])\n", " \"\"\"\n", " position = torch.arange(0., max_len, device=DEVICE).unsqueeze(1)\n", " # 这里幂运算太多,我们使用exp和log来转换实现公式中pos下面要除以的分母(由于是分母,要注意带负号)\n", " div_term = torch.exp(torch.arange(0., d_model, 2, device=DEVICE) * -(math.log(10000.0) / d_model))\n", " # 得到各个位置在各embedding维度上的位置纹理值,存放到pe矩阵中\n", " pe[:, 0::2] = torch.sin(position * div_term)\n", " pe[:, 1::2] = torch.cos(position * div_term)\n", " # 加1个维度,使得pe维度变为:1×max_len×embedding维度\n", " # (方便后续与一个batch的句子所有词的embedding批量相加)\n", " pe = pe.unsqueeze(0) \n", " # 将pe矩阵以持久的buffer状态存下(不会作为要训练的参数)\n", " self.register_buffer('pe', pe)\n", "\n", " def forward(self, x):\n", " # 将一个batch的句子所有词的embedding与已构建好的positional embeding相加\n", " # (这里按照该批次数据的最大句子长度来取对应需要的那些positional embedding值)\n", " x = x + Variable(self.pe[:, :x.size(1)], requires_grad=False)\n", " return self.dropout(x)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "可见,这里首先是按照最大长度max_len生成一个位置,而后根据公式计算出所有的向量,在forward函数中根据长度取用即可,非常方便。 \n", " \n", "> 注意要设置requires_grad=False,因其不参与训练。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "下面画一下位置嵌入,可见纵向观察,随着$embedding \\ dimension$增大,位置嵌入函数呈现不同的周期变化。" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "scrolled": false }, "outputs": [ { "ename": "RuntimeError", "evalue": "expected device cpu and dtype Float but got device cuda:0 and dtype Float", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mRuntimeError\u001b[0m Traceback (most recent call last)", "\u001b[1;32m<ipython-input-7-21b673eab013>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m\u001b[0m\n\u001b[0;32m 1\u001b[0m \u001b[0mpe\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mPositionalEncoding\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;36m16\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;36m0\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;36m100\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 2\u001b[1;33m \u001b[0mpositional_encoding\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mpe\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mforward\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mVariable\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mtorch\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mzeros\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;36m1\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;36m100\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;36m16\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 3\u001b[0m \u001b[0mplt\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mfigure\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mfigsize\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;36m10\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;36m10\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 4\u001b[0m \u001b[0msns\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mheatmap\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mpositional_encoding\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0msqueeze\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 5\u001b[0m \u001b[0mplt\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mtitle\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m\"Sinusoidal Function\"\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;32m<ipython-input-6-16fb9a312388>\u001b[0m in \u001b[0;36mforward\u001b[1;34m(self, x)\u001b[0m\n\u001b[0;32m 32\u001b[0m \u001b[1;31m# 将一个batch的句子所有词的embedding与已构建好的positional embeding相加\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 33\u001b[0m \u001b[1;31m# (这里按照该批次数据的最大句子长度来取对应需要的那些positional embedding值)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m---> 34\u001b[1;33m \u001b[0mx\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mx\u001b[0m \u001b[1;33m+\u001b[0m \u001b[0mVariable\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mpe\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m:\u001b[0m\u001b[0mx\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0msize\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;36m1\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mrequires_grad\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;32mFalse\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 35\u001b[0m \u001b[1;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mdropout\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mx\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;31mRuntimeError\u001b[0m: expected device cpu and dtype Float but got device cuda:0 and dtype Float" ] } ], "source": [ "pe = PositionalEncoding(16, 0, 100)\n", "positional_encoding = pe.forward(Variable(torch.zeros(1, 100, 16)))\n", "plt.figure(figsize=(10,10))\n", "sns.heatmap(positional_encoding.squeeze())\n", "plt.title(\"Sinusoidal Function\")\n", "plt.xlabel(\"hidden dimension\")\n", "plt.ylabel(\"sequence length\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(15, 5))\n", "pe = PositionalEncoding(20, 0)\n", "y = pe.forward(Variable(torch.zeros(1, 100, 20)))\n", "plt.plot(np.arange(100), y[0, :, 4:8].data.numpy())\n", "plt.legend([\"dim %d\"%p for p in [4,5,6,7]])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2. self attention(自注意力机制)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "<img src=\"./imgs/attention_0.jpg\" width=600>\n", "<img src=\"./imgs/attention_1.jpg\" width=600>" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**除以$\\sqrt{d_k}$的解释** \n", " \n", "假设 $q$ 和 $k$ 是独立的随机变量,平均值为 0,方差 1,这样他们的点积后形成的注意力矩阵为 $q⋅k=\\sum_{i=1}^{d_k}{q_i k_i}$,均值为 0 但方差放大为 $d_k$ 。为了抵消这种影响,我们用$\\sqrt{d_k}$来缩放点积,可以使得Softmax归一化时结果更稳定(不至于点积后得到注意力矩阵的值差别太大),以便反向传播时获取平衡的梯度" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def attention(query, key, value, mask=None, dropout=None):\n", " # 将query矩阵的最后一个维度值作为d_k\n", " d_k = query.size(-1)\n", " # 将key的最后两个维度互换(转置),才能与query矩阵相乘,乘完了还要除以d_k开根号\n", " scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)\n", " # 如果存在要进行mask的内容,则将那些为0的部分替换成一个很大的负数\n", " if mask is not None:\n", " scores = scores.masked_fill(mask==0, -1e9)\n", " # 将mask后的attention矩阵按照最后一个维度进行softmax\n", " p_attn = F.softmax(scores, dim=-1)\n", " # 如果dropout参数设置为非空,则进行dropout操作\n", " if dropout is not None:\n", " p_attn = dropout(p_attn)\n", " # 最后返回注意力矩阵跟value的乘积,以及注意力矩阵\n", " return torch.matmul(p_attn, value), p_attn\n", "\n", "\n", "class MultiHeadedAttention(nn.Module):\n", " def __init__(self, h, d_model, dropout=0.1):\n", " super(MultiHeadedAttention, self).__init__()\n", " # 保证可以整除\n", " assert d_model % h == 0\n", " # 得到一个head的attention表示维度\n", " self.d_k = d_model // h\n", " # head数量\n", " self.h = h\n", " # 定义4个全连接函数,供后续作为WQ,WK,WV矩阵和最后h个多头注意力矩阵concat之后进行变换的矩阵\n", " self.linears = clones(nn.Linear(d_model, d_model), 4)\n", " self.attn = None\n", " self.dropout = nn.Dropout(p=dropout)\n", "\n", " def forward(self, query, key, value, mask=None):\n", " if mask is not None:\n", " mask = mask.unsqueeze(1)\n", " # query的第一个维度值为batch size\n", " nbatches = query.size(0)\n", " # 将embedding层乘以WQ,WK,WV矩阵(均为全连接)\n", " # 并将结果拆成h块,然后将第二个和第三个维度值互换(具体过程见上述解析)\n", " query, key, value = [l(x).view(nbatches, -1, self.h, self.d_k).transpose(1, 2) \n", " for l, x in zip(self.linears, (query, key, value))]\n", " # 调用上述定义的attention函数计算得到h个注意力矩阵跟value的乘积,以及注意力矩阵\n", " x, self.attn = attention(query, key, value, mask=mask, dropout=self.dropout)\n", " # 将h个多头注意力矩阵concat起来(注意要先把h变回到第三维的位置)\n", " x = x.transpose(1, 2).contiguous().view(nbatches, -1, self.h * self.d_k)\n", " # 使用self.linears中构造的最后一个全连接函数来存放变换后的矩阵进行返回\n", " return self.linears[-1](x)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "参数里面的 $h$ 和 $d_{model}$ 分别表示注意力头的个数,以及模型的隐层单元数。 \n", " \n", "另外在 $\\_\\_init\\_\\_$ 函数中,我们定义了 $self.linears = clones(nn.Linear(d_{model},\\; d_{model}),\\; 4),\\; clone(x,\\; N)$ 即为深拷贝N份,这里定义了4个全连接函数,实际上是3+1,其中的3个分别是Q、K和V的变换矩阵,最后一个是用于最后将 $h$ 个多头注意力矩阵concat之后进行变换的矩阵。 \n", " \n", "在 $forward$ 函数中,是首先将 $query$、$key$ 和 $value$ 进行相应的变换,然后需要经过 $attention$ 这个函数的计算,这个函数实际上就是论文中“Scaled Dot-Product Attention”这个模块的计算" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3. Attention Mask" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "<img src=\"./imgs/attention_mask.jpg\" width=750>" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "注意, 在上面$self \\ attention$的计算过程中, 我们通常使用$mini \\ batch$来计算, 也就是一次计算多句话, 也就是$X$的维度是$[batch \\ size, \\ sequence \\ length]$, $sequence \\ length$是句长, 而一个$mini \\ batch$是由多个不等长的句子组成的, 我们就需要按照这个$mini \\ batch$中最大的句长对剩余的句子进行补齐长度, 我们一般用$0$来进行填充, 这个过程叫做$padding$. \n", " \n", "但这时在进行$softmax$的时候就会产生问题, 回顾$softmax$函数$\\sigma (\\mathbf {z} )_{i}={\\frac {e^{z_{i}}}{\\sum _{j=1}^{K}e^{z_{j}}}}$, $e^0$是1, 是有值的, 这样的话$softmax$中被$padding$的部分就参与了运算, 就等于是让无效的部分参与了运算, 会产生很大隐患, 这时就需要做一个$mask$让这些无效区域不参与运算, 我们一般给无效区域加一个很大的负数的偏置, 也就是: \n", " \n", "$$z_{illegal} = z_{illegal} + bias_{illegal}$$\n", "$$bias_{illegal} \\to -\\infty$$\n", "$$e^{z_{illegal}} \\to 0 $$ \n", " \n", "经过上式的$masking$我们使无效区域经过$softmax$计算之后几乎为$0$, 这样就避免了无效区域参与计算." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "在 $Transformer$ 里面,$Encoder$ 和 $Decoder $的 $Attention$ 计算都需要相应的 $Mask$ 处理,但功能却不同。 \n", " \n", "在 $Encoder$ 中,就如上述介绍的,$Mask$ 就是为了让那些在一个 $batch$ 中长度较短的序列的 $padding$ 部分不参与 $Attention$ 的计算。因此我们定义一个 $Batch$ 批处理对象,它包含用于训练的 $src$(翻译前)和 $trg$(翻译后)句子,以及构造其中的 $Mask$ 掩码。 \n", " \n", "**加了 $Mask$ 的 $Attention$ 原理如图(另附 $Multi\\text{-}Head\\ Attention$ ):** \n", "> 注意:这里的 $Attention\\ Mask$ 是加在 $Scale$ 和 $Softmax$ 之间 \n", " \n", "<img src=\"./imgs/attention_mask2.jpg\" width=550>" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class Batch:\n", " \"Object for holding a batch of data with mask during training.\"\n", " def __init__(self, src, trg=None, pad=0):\n", " # 将输入与输出的单词id表示的数据规范成整数类型\n", " src = torch.from_numpy(src).to(DEVICE).long()\n", " trg = torch.from_numpy(trg).to(DEVICE).long()\n", " self.src = src\n", " # 对于当前输入的句子非空部分进行判断成bool序列\n", " # 并在seq length前面增加一维,形成维度为 1×seq length 的矩阵\n", " self.src_mask = (src != pad).unsqueeze(-2)\n", " # 如果输出目标不为空,则需要对decoder要使用到的target句子进行mask\n", " if trg is not None:\n", " # decoder要用到的target输入部分\n", " self.trg = trg[:, :-1]\n", " # decoder训练时应预测输出的target结果\n", " self.trg_y = trg[:, 1:]\n", " # 将target输入部分进行attention mask\n", " self.trg_mask = self.make_std_mask(self.trg, pad)\n", " # 将应输出的target结果中实际的词数进行统计\n", " self.ntokens = (self.trg_y != pad).data.sum()\n", " \n", " # Mask掩码操作\n", " @staticmethod\n", " def make_std_mask(tgt, pad):\n", " \"Create a mask to hide padding and future words.\"\n", " tgt_mask = (tgt != pad).unsqueeze(-2)\n", " tgt_mask = tgt_mask & Variable(subsequent_mask(tgt.size(-1)).type_as(tgt_mask.data))\n", " return tgt_mask" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4. Layer Normalization 和残差连接" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1). **LayerNorm**: \n", " \n", "$Layer Normalization$的作用是把神经网络中隐藏层归一为标准正态分布, 也就是 $i.i.d$(独立同分布), 以起到加快训练速度, 加速收敛的作用:\n", "$$\\mu_{i}=\\frac{1}{m} \\sum^{m}_{i=1}x_{ij}$$ \n", " \n", "上式中以矩阵的行$(row)$为单位求均值; \n", " \n", "$$\\sigma^{2}_{j}=\\frac{1}{m} \\sum^{m}_{i=1}\n", "(x_{ij}-\\mu_{j})^{2}$$ \n", " \n", "上式中以矩阵的行$(row)$为单位求方差; \n", " \n", "$$LayerNorm(x)=\\alpha \\odot \\frac{x_{ij}-\\mu_{i}}\n", "{\\sqrt{\\sigma^{2}_{i}+\\epsilon}} + \\beta \\tag{eq.5}$$ \n", " \n", "然后用**每一行**的**每一个元素**减去**这行的均值**, 再除以**这行的标准差**, 从而得到归一化后的数值, $\\epsilon$是为了防止除$0$; \n", "之后引入两个**可训练参数**$\\alpha, \\ \\beta$来弥补归一化的过程中损失掉的信息, 注意$\\odot$表示元素相乘而不是点积, 我们一般初始化$\\alpha$为全$1$, 而$\\beta$为全$0$. \n", "\n", "> 注:有关Batch Normalization和Layer Normalization的区别可参考如下文章——https://zhuanlan.zhihu.com/p/33173246" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2). **残差连接**: \n", " \n", "我们在上一步得到了经过注意力矩阵加权之后的$V$, 也就是$Attention(Q, \\ K, \\ V)$, 我们对它进行一下转置, 使其和$X_{embedding}$的维度一致, 也就是$[batch \\ size, \\ sequence \\ length, \\ embedding \\ dimension]$, 然后把他们加起来做残差连接, 直接进行元素相加, 因为他们的维度一致: \n", " \n", "$$X_{embedding} + Attention(Q, \\ K, \\ V)$$ \n", " \n", "在之后的运算里, 每经过一个模块的运算, 都要把运算之前的值和运算之后的值相加, 从而得到残差连接, 训练的时候可以使梯度直接走捷径反传到最初始层: \n", " \n", "$$X + SubLayer(X) \\tag{eq. 6}$$ \n", " \n", "> **注意:这里我们对$SubLayer(X)$一般会进行dropout后再与X连接,即 $X + Dropout(SubLayer(X))$**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class LayerNorm(nn.Module):\n", " def __init__(self, features, eps=1e-6):\n", " super(LayerNorm, self).__init__()\n", " # 初始化α为全1, 而β为全0\n", " self.a_2 = nn.Parameter(torch.ones(features))\n", " self.b_2 = nn.Parameter(torch.zeros(features))\n", " # 平滑项\n", " self.eps = eps\n", "\n", " def forward(self, x):\n", " # 按最后一个维度计算均值和方差\n", " mean = x.mean(-1, keepdim=True)\n", " std = x.std(-1, keepdim=True)\n", " # return self.a_2 * (x - mean) / (std + self.eps) + self.b_2\n", " # 返回Layer Norm的结果\n", " return self.a_2 * (x - mean) / torch.sqrt(std ** 2 + self.eps) + self.b_2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "以上是 $LayerNormalization$ 的实现,其实PyTorch里面已经集成好了nn.LayerNorm,这里实现出来是为了学习其中的原理。而实际中,为了代码简洁,可以直接使用PyTorch里面实现好的函数。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class SublayerConnection(nn.Module):\n", " \"\"\"\n", " SublayerConnection的作用就是把Multi-Head Attention和Feed Forward层连在一起\n", " 只不过每一层输出之后都要先做Layer Norm再残差连接\n", " \"\"\"\n", " def __init__(self, size, dropout):\n", " super(SublayerConnection, self).__init__()\n", " self.norm = LayerNorm(size)\n", " self.dropout = nn.Dropout(dropout)\n", "\n", " def forward(self, x, sublayer):\n", " # 返回Layer Norm和残差连接后结果\n", " return x + self.dropout(sublayer(self.norm(x)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 5. Transformer Encoder 整体结构" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "经过上面4个步骤, 我们已经基本了解到来$transformer$编码器的主要构成部分, 我们下面用公式把一个$transformer \\ block$的计算过程整理一下: \n", " \n", "1). **字向量与位置编码:** \n", "$$X = EmbeddingLookup(X) + PositionalEncoding \\tag{eq.2}$$\n", "$$X \\in \\mathbb{R}^{batch \\ size \\ * \\ seq. \\ len. \\ * \\ embed. \\ dim.} $$ \n", " \n", "2). **自注意力机制:** \n", "$$Q = Linear(X) = XW_{Q}$$ \n", "$$K = Linear(X) = XW_{K} \\tag{eq.3}$$\n", "$$V = Linear(X) = XW_{V}$$\n", "$$X_{attention} = SelfAttention(Q, \\ K, \\ V) \\tag{eq.4}$$ \n", " \n", "3). **残差连接与$Layer \\ Normalization$**\n", "$$X_{attention} = LayerNorm(X_{attention}) \\tag{eq. 5}$$\n", "$$X_{attention} = X + X_{attention} \\tag{eq. 6}$$ \n", " \n", "4). **$FeedForward$,其实就是两层线性映射并用激活函数(比如说$ReLU$)激活:** \n", "$$X_{hidden} = Linear(Activate(Linear(X_{attention}))) \\tag{eq. 7}$$ \n", " \n", "5). **重复3).:**\n", "$$X_{hidden} = LayerNorm(X_{hidden})$$\n", "$$X_{hidden} = X_{attention} + X_{hidden}$$\n", "$$X_{hidden} \\in \\mathbb{R}^{batch \\ size \\ * \\ seq. \\ len. \\ * \\ embed. \\ dim.} $$" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def clones(module, N):\n", " \"\"\"\n", " 克隆模型块,克隆的模型块参数不共享\n", " \"\"\"\n", " return nn.ModuleList([copy.deepcopy(module) for _ in range(N)])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "$Feed Forward$(前馈网络)层其实就是两层线性映射并用激活函数激活" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class PositionwiseFeedForward(nn.Module):\n", " def __init__(self, d_model, d_ff, dropout=0.1):\n", " super(PositionwiseFeedForward, self).__init__()\n", " self.w_1 = nn.Linear(d_model, d_ff)\n", " self.w_2 = nn.Linear(d_ff, d_model)\n", " self.dropout = nn.Dropout(dropout)\n", "\n", " def forward(self, x):\n", " return self.w_2(self.dropout(F.relu(self.w_1(x))))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "$Encoder$ 由 $N=6$ 个相同的层组成。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class Encoder(nn.Module):\n", " # layer = EncoderLayer\n", " # N = 6\n", " def __init__(self, layer, N):\n", " super(Encoder, self).__init__()\n", " # 复制N个encoder layer\n", " self.layers = clones(layer, N)\n", " # Layer Norm\n", " self.norm = LayerNorm(layer.size)\n", "\n", " def forward(self, x, mask):\n", " \"\"\"\n", " 使用循环连续eecode N次(这里为6次)\n", " 这里的Eecoderlayer会接收一个对于输入的attention mask处理\n", " \"\"\"\n", " for layer in self.layers:\n", " x = layer(x, mask)\n", " return self.norm(x)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "每层 $Encoder\\ Block$ 都有两个子层组成。第一个子层实现了“多头”的 $Self\\text{-}attention$,第二个子层则是一个简单的 $Position\\text{-}wise$ 的全连接前馈网络。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class EncoderLayer(nn.Module):\n", " def __init__(self, size, self_attn, feed_forward, dropout):\n", " super(EncoderLayer, self).__init__()\n", " self.self_attn = self_attn\n", " self.feed_forward = feed_forward\n", " # SublayerConnection的作用就是把multi和ffn连在一起\n", " # 只不过每一层输出之后都要先做Layer Norm再残差连接\n", " self.sublayer = clones(SublayerConnection(size, dropout), 2)\n", " # d_model\n", " self.size = size\n", "\n", " def forward(self, x, mask):\n", " # 将embedding层进行Multi head Attention\n", " x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, mask))\n", " # 注意到attn得到的结果x直接作为了下一层的输入\n", " return self.sublayer[1](x, self.feed_forward)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 三. 解码器部分(Decoder)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "<img src=\"./imgs/decoder.jpg\" width=550>" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "接着来看 $Decoder$ 部分(右半部分),它同样也是由 $N$ 层(在论文中,仍取 $N=6$ )堆叠起来。 \n", "> 对于其中的每一层,除了与 $Encoder$ 中相同的 $self\\text{-}attention$ 及 $Feed Forward$ 两层之外,还在中间插入了一个传统的 $Encoder\\text{-}Decoder$ 框架中的 $context\\text{-}attention$ 层(上图中的$sub\\text{-}layer\\ 2$),即将 $Decoder$ 的输出作为 $query$ 去查询 $Encoder$ 的输出,同样用的是 $Multi\\text{-}Head\\ Attention$ ,使得在 $Decode$ 的时候能看到 $Encoder$ 的所有输出。 \n", " \n", "这里明确一下 **$Decoder$ 的输入输出和解码过程:**\n", "\n", "- 输入:$Encoder$ 的输出 & 对应 $i-1$ 位置 $Decoder$ 的输出。所以中间的 $Attention$ 不是 $Self\\text{-}Attention$ ,它的 $K$,$V$ 来自 $Encoder$ ,Q来自上一位置 $Decoder$ 的输出\n", "- 输出:对应 $i$ 位置的输出词的概率分布\n", "- 解码:这里要特别注意一下,编码可以并行计算,一次性全部encoding出来,但解码不是一次把所有序列解出来的,而是像rnn一样一个一个解出来的,因为要用上一个位置的输入当作 $Attention$ 的 $query$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "这里 ** $Encoder$ 和 $Decoder$ 的 $Attention$ 的区别如下图所示** \n", "<img src=\"./imgs/attention.png\" width=550>" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class Decoder(nn.Module):\n", " def __init__(self, layer, N):\n", " super(Decoder, self).__init__()\n", " # 复制N个decoder layer\n", " self.layers = clones(layer, N)\n", " # Layer Norm\n", " self.norm = LayerNorm(layer.size)\n", "\n", " def forward(self, x, memory, src_mask, tgt_mask):\n", " \"\"\"\n", " 使用循环连续decode N次(这里为6次)\n", " 这里的Decoderlayer会接收一个对于输入的attention mask处理\n", " 和一个对输出的attention mask + subsequent mask处理\n", " \"\"\"\n", " for layer in self.layers:\n", " x = layer(x, memory, src_mask, tgt_mask)\n", " return self.norm(x)\n", "\n", "\n", "class DecoderLayer(nn.Module):\n", " def __init__(self, size, self_attn, src_attn, feed_forward, dropout):\n", " super(DecoderLayer, self).__init__()\n", " self.size = size\n", " # Self-Attention\n", " self.self_attn = self_attn\n", " # 与Encoder传入的Context进行Attention\n", " self.src_attn = src_attn\n", " self.feed_forward = feed_forward\n", " self.sublayer = clones(SublayerConnection(size, dropout), 3)\n", "\n", " def forward(self, x, memory, src_mask, tgt_mask):\n", " # 用m来存放encoder的最终hidden表示结果\n", " m = memory\n", " # self-attention的q,k和v均为decoder hidden\n", " x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, tgt_mask))\n", " # context-attention的q为decoder hidden,而k和v为encoder hidden\n", " x = self.sublayer[1](x, lambda x: self.src_attn(x, m, m, src_mask))\n", " return self.sublayer[2](x, self.feed_forward)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "明确了解码过程之后最上面的图就很好懂了,这里主要的不同就是新加的另外要说一下新加的attention多加了一个 $subsequent\\_mask$ ,因为训练时的output都是ground truth,这样可以确保预测第 $i$ 个位置时不会接触到未来的信息,具体解释如下。\n", "\n", "对于 $Encoder$ 中 $src$ 的 $mask$ 方式就比较简单,直接把 $pad$ 部分给 $mask$ 掉即可。 \n", "但对于 $Decoder$ 中 $trg$ 的 $mask$ 计算略微复杂一些,不仅需要把 $pad$ 部分 $mask$ 掉,还需要进行一个 $subsequent\\_mask$ 的操作。 \n", "即作为 $decoder$,在预测当前步的时候,是不能知道后面的内容的,即 $attention$ 需要加上 $mask$,将当前步之后的分数全部置为$-\\infty$,然后再计算 $softmax$,以防止发生数据泄露。这种 $Masked$ 的 $Attention$ 是考虑到输出 $Embedding$ 会偏移一个位置,确保了生成位置 $i$ 的预测时,仅依赖小于 $i$ 的位置处的已知输出,相当于把后面不该看到的信息屏蔽掉。 " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def subsequent_mask(size):\n", " \"Mask out subsequent positions.\"\n", " # 设定subsequent_mask矩阵的shape\n", " attn_shape = (1, size, size)\n", " # 生成一个右上角(不含主对角线)为全1,左下角(含主对角线)为全0的subsequent_mask矩阵\n", " subsequent_mask = np.triu(np.ones(attn_shape), k=1).astype('uint8')\n", " # 返回一个右上角(不含主对角线)为全False,左下角(含主对角线)为全True的subsequent_mask矩阵\n", " return torch.from_numpy(subsequent_mask) == 0" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "我们可视化一下 $subsequent\\_mask$ 矩阵的形式,直观进行理解。 \n", "这里的 $Attention mask$ 图显示了允许每个目标词(行)查看的位置(列)。在训练期间,当前解码位置的词不能 $Attend$ 到后续位置的词。 \n", " \n", "> 这里是给定一个序列长度size,生成一个下三角矩阵,在主对角线右上的都是 False,其示意图如下: \n", "<img src=\"./imgs/subsequent_mask.png\" width=450> \n", "就是在decoder层的self-Attention中,由于生成 $s_i$ 时, $s_{i+1}$ 并没有产生,所以不能有 $s_i$ 和 $s_{i+1}$ 的关联系数,即只有下三角矩阵有系数,即下图中**`黄色部分`**。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "plt.figure(figsize=(5,5))\n", "plt.imshow(subsequent_mask(20)[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 四. Transformer模型\n", " \n", "最后,我们把 $Encoder$ 和 $Decoder$ 组成 $Transformer$ 模型" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class Transformer(nn.Module):\n", " def __init__(self, encoder, decoder, src_embed, tgt_embed, generator):\n", " super(Transformer, self).__init__()\n", " self.encoder = encoder\n", " self.decoder = decoder\n", " self.src_embed = src_embed\n", " self.tgt_embed = tgt_embed\n", " self.generator = generator \n", "\n", " def encode(self, src, src_mask):\n", " return self.encoder(self.src_embed(src), src_mask)\n", "\n", " def decode(self, memory, src_mask, tgt, tgt_mask):\n", " return self.decoder(self.tgt_embed(tgt), memory, src_mask, tgt_mask)\n", "\n", " def forward(self, src, tgt, src_mask, tgt_mask):\n", " # encoder的结果作为decoder的memory参数传入,进行decode\n", " return self.decode(self.encode(src, src_mask), src_mask, tgt, tgt_mask)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class Generator(nn.Module):\n", " # vocab: tgt_vocab\n", " def __init__(self, d_model, vocab):\n", " super(Generator, self).__init__()\n", " # decode后的结果,先进入一个全连接层变为词典大小的向量\n", " self.proj = nn.Linear(d_model, vocab)\n", "\n", " def forward(self, x):\n", " # 然后再进行log_softmax操作(在softmax结果上再做多一次log运算)\n", " return F.log_softmax(self.proj(x), dim=-1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**定义设置超参并连接完整模型的函数**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def make_model(src_vocab, tgt_vocab, N=6, d_model=512, d_ff=2048, h = 8, dropout=0.1):\n", " c = copy.deepcopy\n", " # 实例化Attention对象\n", " attn = MultiHeadedAttention(h, d_model).to(DEVICE)\n", " # 实例化FeedForward对象\n", " ff = PositionwiseFeedForward(d_model, d_ff, dropout).to(DEVICE)\n", " # 实例化PositionalEncoding对象\n", " position = PositionalEncoding(d_model, dropout).to(DEVICE)\n", " # 实例化Transformer模型对象\n", " model = Transformer(\n", " Encoder(EncoderLayer(d_model, c(attn), c(ff), dropout).to(DEVICE), N).to(DEVICE),\n", " Decoder(DecoderLayer(d_model, c(attn), c(attn), c(ff), dropout).to(DEVICE), N).to(DEVICE),\n", " nn.Sequential(Embeddings(d_model, src_vocab).to(DEVICE), c(position)),\n", " nn.Sequential(Embeddings(d_model, tgt_vocab).to(DEVICE), c(position)),\n", " Generator(d_model, tgt_vocab)).to(DEVICE)\n", " \n", " # This was important from their code. \n", " # Initialize parameters with Glorot / fan_avg.\n", " for p in model.parameters():\n", " if p.dim() > 1:\n", " # 这里初始化采用的是nn.init.xavier_uniform\n", " nn.init.xavier_uniform_(p)\n", " return model.to(DEVICE)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 五. 模型训练" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**标签平滑**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "在训练期间,我们采用了值$\\epsilon_{ls}=0.1$的标签平滑(参见: https://arxiv.org/pdf/1512.00567.pdf ),其实还是从$Computer\\; Vision$上搬过来的,具体操作可以看下面的代码实现,**在这里不作为重点**。 \n", " \n", "这种做法提高了困惑度,因为模型变得更加不确定,但提高了准确性和BLEU分数。 \n", ">我们使用 $KL\\; div\\; loss$(KL散度损失)实现标签平滑。 \n", "对于输出的分布,从原始的 $one\\text{-}hot$ 分布转为在groundtruth上使用一个confidence值,而后其他的所有非groudtruth标签上采用 $\\frac{1 - confidence}{odim - 1}$ 作为概率值进行平滑。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class LabelSmoothing(nn.Module):\n", " \"\"\"标签平滑处理\"\"\"\n", " def __init__(self, size, padding_idx, smoothing=0.0):\n", " super(LabelSmoothing, self).__init__()\n", " self.criterion = nn.KLDivLoss(reduction='sum')\n", " self.padding_idx = padding_idx\n", " self.confidence = 1.0 - smoothing\n", " self.smoothing = smoothing\n", " self.size = size\n", " self.true_dist = None\n", " \n", " def forward(self, x, target):\n", " assert x.size(1) == self.size\n", " true_dist = x.data.clone()\n", " true_dist.fill_(self.smoothing / (self.size - 2))\n", " true_dist.scatter_(1, target.data.unsqueeze(1), self.confidence)\n", " true_dist[:, self.padding_idx] = 0\n", " mask = torch.nonzero(target.data == self.padding_idx)\n", " if mask.dim() > 0:\n", " true_dist.index_fill_(0, mask.squeeze(), 0.0)\n", " self.true_dist = true_dist\n", " return self.criterion(x, Variable(true_dist, requires_grad=False))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "这里的size是输出词表的大小,smoothing是用于分摊在非groundtruth上面的概率值。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "在这里,我们可以看到标签平滑的示例。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Label smoothing的例子\n", "crit = LabelSmoothing(5, 0, 0.4) # 设定一个ϵ=0.4\n", "predict = torch.FloatTensor([[0, 0.2, 0.7, 0.1, 0],\n", " [0, 0.2, 0.7, 0.1, 0], \n", " [0, 0.2, 0.7, 0.1, 0]])\n", "v = crit(Variable(predict.log()), \n", " Variable(torch.LongTensor([2, 1, 0])))\n", "\n", "# Show the target distributions expected by the system.\n", "print(crit.true_dist)\n", "plt.imshow(crit.true_dist)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "如果对给定的选择非常有信心,标签平滑实际上会开始惩罚模型。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "crit = LabelSmoothing(5, 0, 0.1)\n", "def loss(x):\n", " d = x + 3 * 1\n", " predict = torch.FloatTensor([[0, x / d, 1 / d, 1 / d, 1 / d]])\n", " #print(predict)\n", " return crit(Variable(predict.log()), Variable(torch.LongTensor([1]))).item()\n", "\n", "plt.plot(np.arange(1, 100), [loss(x) for x in range(1, 100)])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**计算损失**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class SimpleLossCompute:\n", " \"\"\"\n", " 简单的计算损失和进行参数反向传播更新训练的函数\n", " \"\"\"\n", " def __init__(self, generator, criterion, opt=None):\n", " self.generator = generator\n", " self.criterion = criterion\n", " self.opt = opt\n", " \n", " def __call__(self, x, y, norm):\n", " x = self.generator(x)\n", " loss = self.criterion(x.contiguous().view(-1, x.size(-1)), \n", " y.contiguous().view(-1)) / norm\n", " loss.backward()\n", " if self.opt is not None:\n", " self.opt.step()\n", " self.opt.optimizer.zero_grad()\n", " return loss.data.item() * norm.float()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**optimizer优化器**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "论文里面提到了他们用的优化器,是以$\\beta_1=0.9、\\beta_2=0.98$ 和 $\\epsilon = 10^{−9}$ 的 $Adam$ 为基础,而后使用一种warmup的学习率调整方式来进行调节。 \n", "具体公式如下: \n", " \n", "$$ lrate = d^{−0.5}_{model}⋅min(step\\_num^{−0.5},\\; step\\_num⋅warmup\\_steps^{−1.5})$$ \n", "\n", "基本上就是用一个固定的 $warmup\\_steps$ **先进行学习率的线性增长(热身)**,而后到达 $warmup\\_steps$ 之后会随着 $step\\_num$ 的增长,以 $step\\_num$(步数)的反平方根成比例地**逐渐减小它**,他们用的 $warmup\\_steps = 4000$ ,这个可以针对不同的问题自己尝试。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class NoamOpt:\n", " \"Optim wrapper that implements rate.\"\n", " def __init__(self, model_size, factor, warmup, optimizer):\n", " self.optimizer = optimizer\n", " self._step = 0\n", " self.warmup = warmup\n", " self.factor = factor\n", " self.model_size = model_size\n", " self._rate = 0\n", " \n", " def step(self):\n", " \"Update parameters and rate\"\n", " self._step += 1\n", " rate = self.rate()\n", " for p in self.optimizer.param_groups:\n", " p['lr'] = rate\n", " self._rate = rate\n", " self.optimizer.step()\n", " \n", " def rate(self, step = None):\n", " \"Implement `lrate` above\"\n", " if step is None:\n", " step = self._step\n", " return self.factor * (self.model_size ** (-0.5) * min(step ** (-0.5), step * self.warmup ** (-1.5)))\n", " \n", "def get_std_opt(model):\n", " return NoamOpt(model.src_embed[0].d_model, 2, 4000,\n", " torch.optim.Adam(model.parameters(), lr=0, betas=(0.9, 0.98), eps=1e-9))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "主要调节是在 $rate$ 这个函数中,其中\n", "- $model\\_size$ 即为 $d_{model}$\n", "- $warmup$ 即为 $warmup\\_steps$\n", "- $factor$ 可以理解为初始的学习率" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "以下对该优化器在**不同模型大小($model\\_size$)**和**不同超参数($marmup$)值**的情况下的学习率($lrate$)曲线进行示例。 " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Three settings of the lrate hyperparameters.\n", "opts = [NoamOpt(512, 1, 4000, None), \n", " NoamOpt(512, 1, 8000, None),\n", " NoamOpt(256, 1, 4000, None)]\n", "plt.plot(np.arange(1, 20000), [[opt.rate(i) for opt in opts] for i in range(1, 20000)])\n", "plt.legend([\"512:4000\", \"512:8000\", \"256:4000\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**训练迭代**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "接下来,我们创建一个通用的训练和评分功能来跟踪损失。 我们传入一个上面定义的损失计算函数,它也处理参数更新。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def run_epoch(data, model, loss_compute, epoch):\n", " start = time.time()\n", " total_tokens = 0.\n", " total_loss = 0.\n", " tokens = 0.\n", "\n", " for i , batch in enumerate(data):\n", " out = model(batch.src, batch.trg, batch.src_mask, batch.trg_mask)\n", " loss = loss_compute(out, batch.trg_y, batch.ntokens)\n", "\n", " total_loss += loss\n", " total_tokens += batch.ntokens\n", " tokens += batch.ntokens\n", "\n", " if i % 50 == 1:\n", " elapsed = time.time() - start\n", " print(\"Epoch %d Batch: %d Loss: %f Tokens per Sec: %fs\" % (epoch, i - 1, loss / batch.ntokens, (tokens.float() / elapsed / 1000.)))\n", " start = time.time()\n", " tokens = 0\n", "\n", " return total_loss / total_tokens\n", "\n", "\n", "def train(data, model, criterion, optimizer):\n", " \"\"\"\n", " 训练并保存模型\n", " \"\"\"\n", " # 初始化模型在dev集上的最优Loss为一个较大值\n", " best_dev_loss = 1e5\n", " \n", " for epoch in range(EPOCHS):\n", " # 模型训练\n", " model.train()\n", " run_epoch(data.train_data, model, SimpleLossCompute(model.generator, criterion, optimizer), epoch)\n", " model.eval()\n", "\n", " # 在dev集上进行loss评估\n", " print('>>>>> Evaluate')\n", " dev_loss = run_epoch(data.dev_data, model, SimpleLossCompute(model.generator, criterion, None), epoch)\n", " print('<<<<< Evaluate loss: %f' % dev_loss)\n", " # 如果当前epoch的模型在dev集上的loss优于之前记录的最优loss则保存当前模型,并更新最优loss值\n", " if dev_loss < best_dev_loss:\n", " torch.save(model.state_dict(), SAVE_FILE)\n", " best_dev_loss = dev_loss\n", " print('****** Save model done... ******')\n", " print()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "# 数据预处理\n", "data = PrepareData(TRAIN_FILE, DEV_FILE)\n", "src_vocab = len(data.en_word_dict)\n", "tgt_vocab = len(data.cn_word_dict)\n", "print(\"src_vocab %d\" % src_vocab)\n", "print(\"tgt_vocab %d\" % tgt_vocab)\n", "\n", "# 初始化模型\n", "model = make_model(\n", " src_vocab, \n", " tgt_vocab, \n", " LAYERS, \n", " D_MODEL, \n", " D_FF,\n", " H_NUM,\n", " DROPOUT\n", " )\n", "\n", "# 训练\n", "print(\">>>>>>> start train\")\n", "train_start = time.time()\n", "criterion = LabelSmoothing(tgt_vocab, padding_idx = 0, smoothing= 0.0)\n", "optimizer = NoamOpt(D_MODEL, 1, 2000, torch.optim.Adam(model.parameters(), lr=0, betas=(0.9,0.98), eps=1e-9))\n", "\n", "train(data, model, criterion, optimizer)\n", "print(f\"<<<<<<< finished train, cost {time.time()-train_start:.4f} seconds\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 六. 模型预测" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def greedy_decode(model, src, src_mask, max_len, start_symbol):\n", " \"\"\"\n", " 传入一个训练好的模型,对指定数据进行预测\n", " \"\"\"\n", " # 先用encoder进行encode\n", " memory = model.encode(src, src_mask)\n", " # 初始化预测内容为1×1的tensor,填入开始符('BOS')的id,并将type设置为输入数据类型(LongTensor)\n", " ys = torch.ones(1, 1).fill_(start_symbol).type_as(src.data)\n", " # 遍历输出的长度下标\n", " for i in range(max_len-1):\n", " # decode得到隐层表示\n", " out = model.decode(memory, \n", " src_mask, \n", " Variable(ys), \n", " Variable(subsequent_mask(ys.size(1)).type_as(src.data)))\n", " # 将隐藏表示转为对词典各词的log_softmax概率分布表示\n", " prob = model.generator(out[:, -1])\n", " # 获取当前位置最大概率的预测词id\n", " _, next_word = torch.max(prob, dim = 1)\n", " next_word = next_word.data[0]\n", " # 将当前位置预测的字符id与之前的预测内容拼接起来\n", " ys = torch.cat([ys, \n", " torch.ones(1, 1).type_as(src.data).fill_(next_word)], dim=1)\n", " return ys\n", "\n", "\n", "def evaluate(data, model):\n", " \"\"\"\n", " 在data上用训练好的模型进行预测,打印模型翻译结果\n", " \"\"\"\n", " # 梯度清零\n", " with torch.no_grad():\n", " # 在data的英文数据长度上遍历下标\n", " for i in range(len(data.dev_en)):\n", " # 打印待翻译的英文句子\n", " en_sent = \" \".join([data.en_index_dict[w] for w in data.dev_en[i]])\n", " print(\"\\n\" + en_sent)\n", " # 打印对应的中文句子答案\n", " cn_sent = \" \".join([data.cn_index_dict[w] for w in data.dev_cn[i]])\n", " print(\"\".join(cn_sent))\n", " \n", " # 将当前以单词id表示的英文句子数据转为tensor,并放如DEVICE中\n", " src = torch.from_numpy(np.array(data.dev_en[i])).long().to(DEVICE)\n", " # 增加一维\n", " src = src.unsqueeze(0)\n", " # 设置attention mask\n", " src_mask = (src != 0).unsqueeze(-2)\n", " # 用训练好的模型进行decode预测\n", " out = greedy_decode(model, src, src_mask, max_len=MAX_LENGTH, start_symbol=data.cn_word_dict[\"BOS\"])\n", " # 初始化一个用于存放模型翻译结果句子单词的列表\n", " translation = []\n", " # 遍历翻译输出字符的下标(注意:开始符\"BOS\"的索引0不遍历)\n", " for j in range(1, out.size(1)):\n", " # 获取当前下标的输出字符\n", " sym = data.cn_index_dict[out[0, j].item()]\n", " # 如果输出字符不为'EOS'终止符,则添加到当前句子的翻译结果列表\n", " if sym != 'EOS':\n", " translation.append(sym)\n", " # 否则终止遍历\n", " else:\n", " break\n", " # 打印模型翻译输出的中文句子结果\n", " print(\"translation: %s\" % \" \".join(translation))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# 预测\n", "# 加载模型\n", "model.load_state_dict(torch.load(SAVE_FILE))\n", "# 开始预测\n", "print(\">>>>>>> start evaluate\")\n", "evaluate_start = time.time()\n", "evaluate(data, model) \n", "print(f\"<<<<<<< finished evaluate, cost {time.time()-evaluate_start:.4f} seconds\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": false, "sideBar": true, "skip_h1_title": true, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": true, "toc_position": { "height": "calc(100% - 180px)", "left": "10px", "top": "150px", "width": "256px" }, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }