如果单独传递return_state=True,那么输出将会是两个值,可以仔细看官方文档中的说明是Boolean. Whether to return the last state in addition to the output. Default:False.`也就是output和最后的hidden_state会一起输出,并且output会等于final_state:
Researchers have proposed many gated RNN variants, but LSTM and GRU
are the most widely-used.
Rule of thumb: LSTM is a good default choice (especially if your data
has particularly long dependencies, or you have lots of training data);
Switch to GRUs for speed and fewer parameters.
LSTM doesn’t guarantee that there is no vanishing/exploding gradient,
but it does provide an easier way for the model to learn long-distance
dependencies.
The Region-based CNN (R-CNN) approach [13] to bounding-box object
detection is to attend to a manageable number of candidate object
regions [42, 20] and evaluate convolutional networks [25, 24]
independently on each RoI. R-CNN was extended [18, 12] to allow
attending to RoIs on feature maps using RoIPool, leading to fast speed
and better accuracy. Faster R-CNN [36] advanced this stream by learning
the attention mechanism with a Region Proposal Network (RPN). Faster
R-CNN is flexible and robust to many follow-up improvements (e.g., [38,
27, 21]), and is the current leading framework in several
benchmarks.
在Mask-RCNN的文章中提出了一种新的ROIAlign
Layer,主要是为了解决Faster-Rcnn的网络中ROI pooling
layer的问题。在此补充下ROI pooling是怎么将不同size的ROI(region of
interest)都变成fixed-size的feature map的:
对于"many to
many"类型的网络,有可能输入的长度不等于输出的长度,在机器翻译的任务中很常见。这种网络也叫Sequence
to
Sequence,首先该网络会经由encoder对输入进行编码,然后再有decoder进行sequence的生成。但是这种网络在长句子中表现很差,如果输入句子的长度很长,encoder网络就很难记忆住所有信息,从而在decoder中翻译出准确的词语。由此,需要用到attention
model。从计算角度来说就是encoder每次都会产生一个固定长度的vector,这对于长句子来说fixed
length的向量很难记住很早之前的信息:
那为了解决一个fixed
length的vector很难记忆前序信息的缺陷,所以诞生了attention
机制!具体的就是在decoder阶段的每一个时间步利用都产生不同的context,这个context产生的过程就是attention计算的过程。主要思想是在产生y之前做一个attention的权重计算,这个权重指的是在计算某个时间步的y值时,我们应该对输入句子的每一个词给予多少关注,给予的关注多,权重就大。所以这里我们会基于initial
decoder state(previous hidden state of the (post-attention)
LSTM)和encoder网络的输出值计算权重,计算过程采用dense
layer.这些权重值的和是1。
在transformer的paper中,作者首先介绍本文:主流的sequence
tranduction模型主要基于复杂的RNN或者CNN模型,它们包含encoder和decoder两部分,其中表现最好的模型在encoder和decoder之间增加了attention
mechanism。本文提出了一个新的简单的网络结构名叫transformer,也是完全基于attention机制,"dispensing
with recurrence and convolutions entirely"!
根本无需循环和卷积!了不起的Network~
在阅读这篇文章之前需要提前了解我在另外一篇博客 Attention and
transformer
model中的知识,在translation领域我们的科学家们是如何从RNN循环神经网络过渡到CNN,然后最终是transformer的天下的状态。技术经过了一轮轮的迭代,每一种基础模型架构提出后,会不断的有文章提出新的改进,文章千千万,不可能全部读完,就精读一些经典文章就好,Vaswani这篇文章是NMT领域必读paper,文章不长,加上参考文献才12页,介绍部分非常简单,导致这篇文章的入门门槛很高(个人感觉)。我一开始先读的这篇文章,发现啃不下去,又去找了很多资料来看,其中对我非常帮助的有很多:
其中decoder部分的block比encoder部分的block多了一个sub
layer,其中self-attention和encoder-decoder attention都是multi-head
attention layer,只不过decoder部分的第一个multi-head attention
layer是一个masked multi-head
attention,为了防止未来的信息泄露给当下(prevent positions from
attending to the future).
# Merge the multi-head back to the original shape out = tf.concat(tf.split(out, self.h, axis=0), axis=2) # [bs, q_size, d_model]
# The final linear layer and dropout. # out = tf.layers.dense(out, self.d_model) # out = tf.layers.dropout(out, rate=self.drop_rate, training=self._is_training)
return out deffeed_forwad(self, inp, scope='ff'): """ Position-wise fully connected feed-forward network, applied to each position separately and identically. It can be implemented as (linear + ReLU + linear) or (conv1d + ReLU + conv1d). Args: inp (tf.tensor): shape [batch, length, d_model] """ out = inp with tf.variable_scope(scope): # out = tf.layers.dense(out, self.d_ff, activation=tf.nn.relu) # out = tf.layers.dropout(out, rate=self.drop_rate, training=self._is_training) # out = tf.layers.dense(out, self.d_model, activation=None)
# by default, use_bias=True out = tf.layers.conv1d(out, filters=self.d_ff, kernel_size=1, activation=tf.nn.relu) out = tf.layers.conv1d(out, filters=self.d_model, kernel_size=1)
return out defencoder_layer(self, inp, input_mask, scope): """ Args: inp: tf.tensor of shape (batch, seq_len, embed_size) input_mask: tf.tensor of shape (batch, seq_len, seq_len) """ out = inp with tf.variable_scope(scope): # One multi-head attention + one feed-forword out = self.layer_norm(out + self.multihead_attention(out, mask=input_mask)) out = self.layer_norm(out + self.feed_forwad(out)) return out
The encoder start by processing the input sequence. The output of the
top encoder is then transformed into a set of attention vectors K and V.
These are to be used by each decoder in its “encoder-decoder attention”
layer which helps the decoder focus on appropriate places in the input
sequence
defdecoder_layer(self, target, enc_out, input_mask, target_mask, scope): out = target with tf.variable_scope(scope): out = self.layer_norm(out + self.multihead_attention( out, mask=target_mask, scope='self_attn')) out = self.layer_norm(out + self.multihead_attention( out, memory=enc_out, mask=input_mask)) # 将encoder部分的输出结果作为输入 out = self.layer_norm(out + self.feed_forwad(out)) return out
defdecoder(self, target, enc_out, input_mask, target_mask, scope='decoder'): out = target with tf.variable_scope(scope): for i inrange(self.num_enc_layers): out = self.decoder_layer(out, enc_out, input_mask, target_mask, f'dec_{i}') return out
以上实现的transformer其实我觉得还是有一点点复杂,毕竟在tensorflow2.0+版本中已经有了官方实现好的layers.MultiHeadAttention可以使用,应该可以大大简化我们实现步骤,特别是上面的def multihead_attention(self, query, memory=None, mask=None, scope='attn'):。从刚刚的实现里我们可以发现,除了decoder部分每一个block的第二个sublayer的attention计算有一点不一样之外,其他的attention计算都是一模一样的。我在github上找了不少用TF2.0实现的transformer(最标准的也是Attention
is all you
need的模型),发现很多都都写得一般般,最终发现还是tensorflow官方文档写的tutotial写的最好.
classCausalSelfAttention(BaseAttention): defcall(self, x): attn_output = self.mha( query=x, value=x, key=x, use_causal_mask = True) # The causal mask ensures that each location only has access to the locations that come before it x = self.add([x, attn_output]) x = self.layernorm(x) return x
这里想补充一个东西,在从encoder-decoder过渡到transformer的时候,我一直的疑问是为什么要用transformer呢?为什么Effective
Approaches to Attention-based Neural Machine
Translation这篇paper介绍的方法就渐渐不被人所用了呢?一开始我去看了下transformer的原文,发现paper介绍的非常简单,所以就去找了下博客,找到一篇解释为什么transformer比LSTM快的博客
文章说在传统的也就是paper:Neural Machine Translation by Jointly
Learning to Align and
Translate中介绍的用LSTM的encoder-decoder架构来做机器翻译的问题,一个问题在于:在RNN模型中,我们在计算当前时间步时,需要使用到前一个时间步的hidden_state,这就造成一个问题是无法并行训练,你必须要等到前面的东西都输完了你才能计算当前时间步的结果。transformer就可以解决这个问题,它完全摒弃了RNN的结构,基本上是一个FCNN。首先我们都知道输入到模型中来的是一个序列,序列中的每一个单词我们都转化为了词向量。如果是传统的RNN模型,这时候就要一个vector一个vector的往RNN里输入了,但transformer不是,它是将这个embedding变幻成了三个向量空间,也就是我们后面看到的Q,K,V.
Non sequential: sentences are processed as a whole
rather than word by word.
Self Attention: this is the newly introduced 'unit'
used to compute similarity scores between words in a sentence.
Positional embeddings: another innovation
introduced to replace recurrence. The idea is to use fixed or learned
weights which encode information related to a specific position of a
token in a sentence.
其实在机器翻译领域,为了解决长句子的依赖问题,CNN曾经也被广泛用于解决这个问题,不仅如此,CNN还有共享参数的优点,也就是可以在GPU上并行计算。如何用CNN处理句子可以参考paper,CNN解决依赖问题是用不同宽度的kernel去学习依赖,比如width=2就学习两个词之间的依赖关系,width=3就学习三个词之间的依赖,但长句子的依赖很可能会有很多组合,所以就需要使用到很多不同宽度的kernel,这是不现实的。虽然现在CNN不怎么用来解决S2S的问题,但我觉得它是RNN结构的模型过度到transformer的一个中间桥梁,同时也可以帮助我们理解attention
is all your need这篇文章。感兴趣的可以读一下Convolutional Sequence to
Sequence Learning 2017
The Transformer (which will be referred to as
“vanilla Transformer” to distinguish it from other enhanced versions; Vaswani, et al., 2017) model
has an encoder-decoder architecture, as commonly used in many NMT
models. Later simplified Transformer was shown to achieve great
performance in language modeling tasks, like in encoder-only BERT
or decoder-only GPT.
First, we use a language modeling objective on the unlabeled data to
learn the initial parameters of a neural network model. Subsequently, we
adapt these parameters to a target task using the corresponding
supervised objective.
Open AI GPT uses a Transformer Decoder architecture
as opposed to BERT’s
Transformer Encoder architecture. I have already covered the difference
between the Transformer Encoder and Decoder in this
post; however, it is as follows:
The Transformer Encoder is essentially a
Bidirectional Self-Attentive Model, that uses all the tokens in a
sequence to attend each token in that sequence
i.e. for a given word, the attention is computed using all the words
in the sentence and not just the words preceding the given word in one
of the left-to-right or right-to-left traversal order.
While the Transformer Decoder, is a Unidirectional
Self-Attentive Model, that uses only the tokens preceding a given token
in the sequence to attend that token
i.e. for a given word, the attention is computed using only the words
preceding the given word in that sentence according to the traversal
order, left-to-right or right-to-left.
Thus, GPT gets its auto-regressive nature from this
directionality provided by the Transformer Decoder as it uses
just the previous tokens from the sequence to predict the next
token.
Bert:Pre-training
of Deep Bidirectional Transformers for Language Understanding 2019
这篇文章出现在openai的GPT模型之前,前身是ELMo。
有两种方式将pre-trained language
representations用于下游任务,第一种是feature-based,第二种是fine-tine,其中第一种代表是ELMo,将pre-trained
representations用于附加的features输入到下游任务中;第二种是得到pre-trained的representation之后,将模型接入下游任务,然后同时fine-tune所有的参数。
Improving
Language Understanding by Generative Pre-Training 2018
GPT stands for "Generative pretrained transformer" or "generative
pre-trained"
two main reasons for leveraging more than word-level information from
unlabeled text data
unclear about what type of optimization objectives are most
effective at learning text representations that are useful for transfer
缺乏很好的优化函数
Second, there is no consensus on the most effective way to transfer
these learned representations to the target task
如何更有效的将pre-train的知识transfer到target task还没有很好的方法
作者在introduction部分强调了自己提出的模型在fine-tune阶段只需要对模型架构进行微调便可以适应target
task,在知识迁移时,采用了paper: Reasoning about entailment with neural
attention,which process structured text input as a single contiguous
sequence of tokens.
Related work : LSTM using as the pre-trained network
to capture the representation but they has lots of restrictions. This
paper use transformer networks to allow capturing longer range
linguistic structure. Further, in fine-tuning process, the model only
requires minimal changes to model architecture other than involving
substantial amount of new parameters.
截至到目前,如果我们的下游任务是分类任务,模型的fine-tune是很简单的,就将decoder的输出的最后一个token的vector拿出来接一个classifier就可以了,但是如果处理QA,texttual
entailment的数据集任务,这一类的任务的输入往往是句子的pair或者question,answer的组合。这时候作者想到一个办法是,将他们都当成一个连续的sequence,并在其中增加了随机初始化的vectors比如start/END.
同时作者指出之前有一些论文在这个方面也做了不少research,但都是“re-introduce
a significant amount of task-specific customization and does not use
transfer learning for these additional architectural components”.
那么具体是如何对模型的输入进行小改造的,如图:
随着decoder部分层数的增加,performance会越来越好。作者得出结论
”This indicates that each layer in the pre-trained model contains
useful functionality for solving target tasks“
可以从上面的ppt中看出,原来是仅仅有q这个变量的,这是从一开始的h演变来的,而我们可以看到为了“add
more expressivity to the layer”,所以我们在1.输入x输入到FC得到alignment
score之前又加了一个不同的FC 2. 对输入x用attention weights进行weight
sum的过程也加了一个完全不同的FC layer。而我们可以看到这里加的这两个FC
layer是为了增加模型的表现力。这两个FC
layer的输出也就是成了我们所说的key和value。后面的过程就清晰了,首先利用query和key计算attention
weights,然后用attention weights和value进行计算得到context。
An attention layer does a fuzzy lookup like this, but it's not just
looking for the best key. It combines the values based on
how well the query matches each key.
How does that work? In an attention layer the query,
key, and value are each vectors. Instead of
doing a hash lookup the attention layer combines the query
and key vectors to determine how well they match, the
"attention score". The layer returns the average across all the
values, weighted by the "attention scores