YAN's Blog

instruction-following language models

Posted on 2023-04-14 Edited on 2023-08-29 In NLP Symbols count in article: 4.8k Reading time ≈ 4 mins.

nlp领域很多新出现的名词或者火热的研究方向，没有一个统一的标准。我在接触这些新的概念的时候往往会很糊涂，需要找大量的文献来看，然后捋清楚模型或者技术路线的发展脉络。instructed LM，它是需要对pre-trained LLM进行finetune的，在这之前也有一种技术叫做prompt engineering，它是一种给大模型指令输入的手段，通过调整给大模型的输入，从而使得大模型能够返回更好的输出，解决我们的问题。也有更好的解释引用自blog

Prompt Engineering, also known as In-Context Prompting, refers to methods for how to communicate with LLM to steer its behavior for desired outcomes without updating the model weights. It is an empirical science and the effect of prompt engineering methods can vary a lot among models, thus requiring heavy experimentation and heuristics

prompt engineering得益于LLM拥有zero-shot learning和few-shot learning的两种prompt 模型的方法的发展。它更多的来源于经验。

prompt engineering领域也出现了非常多的文章，就正如blog里的观点一样，我同样觉得有一些文章只需要很少的文字就能讲明白它提出的方法是什么，但还是花了很多的篇幅，一个通用的benchmark才是我们需要的，现在有的只是一些零零碎碎的方法论。prompt engineering不是我的关注重点，它受制于很多因素的影响，比如如果你使用的是GPT-3模型来开展你的任务或者搭建你的application，你可能会因为输入过多的文字而超出limit，而且GPT可是按照字符数收费的，所以可能会比较贵。

那么除了使用prompt engineering的方式来让LLM输出能让我们满意的结果，另外一种方式是fine-tune整个LLM，直接让它在特定的数据集上调整参数（整体调整或者局部调整，比如Lora，prefix-tuning）或者使用增强学习训练一个打分模型，这也属于fine-tune的一个大分支。

2013年的综述文章A Survey of Large Language Models 在第五章介绍了详细的adaptation tuning of LLMs的方法，也就是我一个pretrain好的LLM，如何让它在不同的任务上得到更好的泛化能力，这时候就要tuning LLM。作者介绍其中有两种方法，一个是instruction Tuning，第二个是alignment tuning。后者就是利用增强学习让模型从人类的反馈中去改进自己生成的文本，InstructGPT采用了这种方法。第一种会稍微复杂一点，但原理很简单，就是创造一系列的instruction和问答对，让LLM在这些新instruction上重新finetune，loss为sequence-to-sequence的loss。

[My personal spicy take] 这里这篇综述我觉得写的不完整，有点误导读者。这篇综述第五章只介绍了adaptation tuning模型中的两种，但在instruction tuning出现之前，还有不少技术能够帮助我们“further adapt LLM according to specific goals”. 不仅如此，这篇综述也没有很好的解释instruction tuning为什么就能帮助我们在不同任务上有了performance的提高。所以我就想写一篇博客来记录如果我们拥有了一个pretrained的大模型，我们可以有什么样的做法来使得大模型在特定的任务上为我们所用。详见另一篇博客“Adaptation Tuning of LLMs”

在接触羊驼模型后，我一直有一个疑问，为什么instruction finetuned模型performance有了提高，或者说它在什么样的任务上有了提高？这个问题一直困扰我，直到我看到了google家的Finetuned Language Models Are Zero-Shot Learners.instruction tuning这种finetune方式的提出是为了improve zero-shot performance on unseen tasks，具体一点就是在一些任务上比如阅读归纳，question answering和语言推理上，研究者发现GPT3的zero-shot learning比few-shot 能力差很多，作者说一个潜在的原因是因为如果没有一些context给到模型的话，模型在面对跟pretrain时候数据相差很大的prompt时候会很困难，说直白点，就是没有例子给它参考了，就不会做题了。instruction tuning这种方式就提供了一种非常简单的方式，它在好多个task上finetune这个模型，这里每一个task的数据组织形式跟原来不一样了，现在被组织成了(instruction,[input],output)的形式。finetune完之后的模型在unseen task上做evaluation，研究者发现被instruction finetune之后的模型比原来的模型在同一任务上的zero-shot能力大大提升：

instruction tuning

想要做到instruction tuning有两个前提条件：1. 你有一个pretrained的模型 2. 有很多instructions。首先第一个条件可以看看市面上有哪些模型是已经开源了，参考A Survey of Large Language Models3.1的整理，2023年斯坦福的羊驼模型是基于meta的LLaMA，所以目前github上出现了很多用LLaMA为LLM，在上面做instruction tuning工作的。

那第一个问题解决了，起码我们有开源的LLM可以load到本地来使用，感谢facebook的开源。第二个问题如何产生很多的instructions，斯坦福的羊驼模型Alpaca采用的是下面文章介绍的方法，省时省力，花费上不超过600美金。当然也有其他的一些产生instruction的方法，详细可以参考A Survey of Large Language Models ，其中作者介绍了一系列可以从现有数据集生成instruction的方法，这些方法应该也是低成本快速产生instruction的方法。

Self-instruct: Aligning Language Model with Self Generated Instructions 这篇文章介绍了一种self generated instructions的方法，简单说就是让LLM自己生成人类的问题的答案，然后将这些instructions 重新来fine-tune我们的LLM。这样做的一个前提条件是：1. Large “instruction-tuned” language models (finetuned to respond to instructions) have demonstrated a remarkable ability to generalize zero-shot to new tasks. 2. 产生instruction data非常的耗时，原来都是采用Human written的方式。具体步骤是：

作者首先使用175个手工写的instructions作为seed set，利用这175个instructions用LLM再次生成更多的instructions，将这些instructions再次输入到LLM中我们就得到了很多input-output pair。这些input-output pair将会用来做instruction tuning. 作者使用的LLM是GPT-3. 最终得到了52k个instructions，以及82k个input-output pair。

Instruction generation

用bootstrap的方式，以人工产生的instruction为基础，用GPT来自己生成更多的"new and novel"instruction。

自Alpaca之后，国内的一些团队也仿照斯坦福的这种模型，做了一些自己的LLM，例如https://github.com/LC1332/Chinese-alpaca-lora，instruction来自用GPT翻译的斯坦福产生的52k的instruction的数据，它基于的模型aplaca-lora,lora的全称是Low-rank adaptation，作者说自己"reproducing the Stanford Alpaca results using low-rank adaptation (LoRA)."，并且训练好的instructed model提供的文本质量可以和text-davinci-003(GPT-3)媲美。不太了解这个LoRA，有兴趣的可以读原文：https://arxiv.org/pdf/2106.09685.pdf。

看了Alpaca的blog，我发现斯坦福在evaluation阶段是将alpaca的结果和gpt3来进行比较的，由此也引发了我的思考，就是我们如何去衡量一个LLM的performance。刚上文的review的第七章很好的解答了我的疑惑，包括一系列的基本评测任务以及高级的评测任务。当然作者在7.3也给出了一些公开的全面的benchmarks，而且是用的比较多的，其中有MMLU，BIG-bench，HELM，这些benchmark内都包含了很多个任务，可以综合评测一个LLM的performance。

stanford alpaca

这是2023年斯坦福开源的一款基于meta的LLaMA的大语言模型,名字叫羊驼，只有7个billion的参数。属于instruction tuning的一个标杆。里面用了两个比较新的技术，第一个是上文提到的self-instruct，就是让GPT或者市面上的LLM在我们人工产生的种子instruction上去产生一系列更多的instruction，包括配套每一个instruction的input和output。斯坦福将这部分用GPT-3.5(text-davinci-003)产生的instruction数据慷慨开源，见github。不仅如此斯坦福还给出了产生这些instructions的代码，可谓是非常nice了，方便大家上手学习。

我比较关注用这些instructions数据如何finetune大模型LLaMA的过程，这里权当自己复现以及阅读斯坦福代码时候的记录。首先我本来是想在meta的LLaMA的7B开源模型上做实验，但发现想获取meta的weights需要提前申请，详细可参考huggingface的transformer页面。

斯坦福的代码仓库可以在github找到。

reference

machine translation相关论文阅读

Posted on 2023-03-13 Edited on 2023-12-15 In NLP Symbols count in article: 15k Reading time ≈ 14 mins.

machine translation 这个任务一般是作为language modeling的紧接一个话题。它的前身（2010年之前）是statistical machine translation，但自从Neural machine translation出来之后，用statistical的方式来做translation就少了很多。有兴趣的可以了解下statistical machine translation的具体细节. 本博客主要记录NMT的主要论文和研究。NMT的架构主要是encoder-decoder架构，它其实是一个很典型的seq-to-seq的模型, 关于它的定义：

Neural Machine Translation (NMT) is a way to do Machine Translation with a single end-to-end neural network

它的一般架构是这样的:

NMT所有的模型都基于一个统一的数学公式：

注意这里和statistical machine translation的公式是不一样的：

用统计翻译模型做的时候是分别解决translation model以及language model的问题，涉及很多特征工程的问题，很复杂。

在machine translation领域，encoder-decoder架构的模型经历了好几次演变，最终才转化成加入了attention机制，模型架构的整理可以参考Neural Machine Translation: A Review and Survey。文章的第五章介绍了将encoder编码为固定长度的向量的用法。其中有两种使用这个C的用法，1. 作为decoder的初始化state 2. 作为decoder每一个时间步的固定输入和input一起去计算hidden state：

Encoder-decoder architectures with fixed-length sentence encodings

这些文章从Sequence to Sequence Learning with Neural Networks，再到Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. 然后就过度到attention时代了，所以作者在这篇review中只花了很少的第五章节就结束了。第六章就开始讲attentional encoder-decoder networks。

The concept of attention is no longer just a technique to improve sentence lengths in NMT. Since its introduction by Bahdanau et al. (2015) it has become a vital part of various NMT architectures, culminating in the Transformer architecture

这句话是6.1的精髓，attention的概念不再是我们上文所说的那些用于初始化呀，还是用作duplicate context。Bahdanau 2015年的这篇文章，也就是引入multi-head attention的这篇文章彻底打破了这个convention。因为我们可以看到transformer的架构中都没有RNN的身影，有的只是attention weights的计算。

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation 2014

这是在机器翻译领域encoder-decoder架构，在attention 机制提出之前表现最好的RNN模型。其实模型挺简单的，encoder负责将input sequence编码成了一个固定的向量Context，然后基于这个向量，decoder每一个时间步产生一个单词。在decoder的每一个时间步进行的运算是：

y_t是由s_t得到的。

同样的，这篇文章可以结合代码来看，轻易理解。该代码是用pytorch实现的。这个pytorch的实现是从Sequence to Sequence Learning with Neural Networks开始讲解的，Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation这篇文章进步在

可以看到该篇文章介绍的模型优势在于预测y的时候加入了context以及\(y_{t-1}\),而不是仅仅依赖于\(s_t\)

以上的文章都是将input sentence编码成一个fixed-length的vector，从下面这篇2015年Bahdanau的文章开始，attention就开始用于NMT。为了解决fixed-length vector的问题，这样我们就不必要将input sentence的所有信息都编码到一个固定长度的向量里。

Neural Machine Translation by Jointly Learning to Align and Translate 2015

从这篇文章开始，attention的机制开始使用在翻译中。

在Introduction章节，最重要的一句话：

The most important distinguishing feature of this approach from the basic encoder–decoder is that it does not attempt to encode a whole input sentence into a single fixed-length vector. Instead, it encodes the input sentence into a sequence of vectors and chooses a subset of these vectors adaptively while decoding the translation

意即跟以往那种encoder-decoder的网络来做translation的model不同，虽然提出的模型也属于encoder-decoder架构，但不是将input sentence编码成一个固定长度的向量，而是将input sentence编码成一系列的向量并自适应的从中选择一个小子集的向量用来做decode。

截至文章发表，现有做机器翻译的模型中，表现最好的模型是RNN，内units用lstm。可以称之为RNN Encoder-Decoder。

还有一个发现是，这些encoder和decoder block，里面基本上是stacked rnns结构，也就是堆了好几层rnn。这个发现可以追溯到paper. 该作者发现在NMT任务上，high-performing rnns are usually multi-layer, 不仅如此，对于encoder rnn，2到4层是最好的，对于decoder rnn，4层是最好的。通常情况下，2层堆叠的RNN比一层RNN要lot better; 为了解决long dependency的问题，用lstm cell是必要的，但这也不够，需要使用一些其他的技术，比如skip-connection，dense-connections。

这里值得一提的是，虽然Bahdanau 2015年出的这篇文章很火。但是后来通过学习cs224n和观察tensorflow的文档：Neural machine translation with attention,发现luong 2015的这篇文章中的架构使用的更多，它的计算公式和Bahdanau介绍的有一点点不一样，再luong的文章中我们也可以看到它自己说的和Bahdanau不一样的地方：

Comparison to (Bahdanau et al., 2015) – While our global attention approach is similar in spirit to the model proposed by Bahdanau et al. (2015), there are several key differences which reflect how we have both simplified and generalized from the original model. First, we simply use hidden states at the top LSTM layers in both the encoder and decoder as illustrated in Figure 2. Bahdanau et al. (2015), on the other hand, use the concatenation of the forward and backward source hidden states in the bi-directional encoder and target hidden states in their non-stacking unidirectional decoder. Second, our computation path is simpler; we go from ht → at → ct → ̃ ht then make a prediction as detailed in Eq. (5), Eq. (6), and Figure 2. On the other hand, at any time t, Bahdanau et al. (2015) build from the previous hidden state ht−1 → at → ct → ht, which, in turn, goes through a deep-output and a maxout layer before making predictions.7 Lastly, Bahdanau et al. (2015) only experimented with one alignment function, the concat product; whereas we show later that the other alternatives are better.

所以关于用attention来做machine translation的模型，我们只需要记住下面的计算过程就行，因为它也不是现在流行的machine translation的方法（毕竟2015年的时候transformer还没出来）：

以上的模型给我们解决了标准的seq2seq的模型在做NMT任务时的一些问题：

improves NMT performance
provides more "human-like" model: replace the fixed length vector with dynamic vector according to the decoder hidden states
solves the bottleneck problem: allows decoder to look directly at source
helps with the vanishing gradient problem
provides some interpretability

注意，虽然attention机制首先是在NMT任务中提出并得到了应用，但是它并不是seq2seq的专属，你也可以将attention用在很多architectures和不同的tasks中。有一个关于attention的更general的定义是：

我们有时候会说： query attends to the values，例如在seq2seq2+attention的模型中，每一个decoder hidden state就是query，attends to 所有的encoder hidden states(values).

Attention is all you need 2017

在transformer的paper中，作者首先介绍本文：主流的sequence tranduction模型主要基于复杂的RNN或者CNN模型，它们包含encoder和decoder两部分，其中表现最好的模型在encoder和decoder之间增加了attention mechanism。本文提出了一个新的简单的网络结构名叫transformer，也是完全基于attention机制，"dispensing with recurrence and convolutions entirely"! 根本无需循环和卷积！了不起的Network~

在阅读这篇文章之前需要提前了解我在另外一篇博客 Attention and transformer model中的知识，在translation领域我们的科学家们是如何从RNN循环神经网络过渡到CNN，然后最终是transformer的天下的状态。技术经过了一轮轮的迭代，每一种基础模型架构提出后，会不断的有文章提出新的改进，文章千千万，不可能全部读完，就精读一些经典文章就好，Vaswani这篇文章是NMT领域必读paper，文章不长，加上参考文献才12页，介绍部分非常简单，导致这篇文章的入门门槛很高（个人感觉）。我一开始先读的这篇文章，发现啃不下去，又去找了很多资料来看，其中对我非常帮助的有很多：

非常通俗易懂的blog 有中文版本的翻译
Neural Machine Translation: A Review and Survey 虽然这篇paper很长，90+页。前六章可以作为参照，不多25页左右，写的非常好
stanford cs231n课程的ppt 斯坦福这个课程真的很棒，youtube上可以找到17年的视频，17年的课程中没有attention的内容，所以就姑且看看ppt吧，希望斯坦福有朝一日能将最新的课程分享出来，也算是做贡献了
cs231n推荐的阅读博客非常全面的整理，强烈建议食用. 这位作者也附上了自己的transformer实现，在它参考的那些github实现里，哈佛大学的pytorch实现也值得借鉴。
The annotated Transformer 斯坦福出的关于Attention is All you need学术文章的解析以及代码实现，强烈建议食用。

Transformer这篇文章有几个主要的创新点：

使用self-attention机制，并首次提出使用multi-head attention

该机制作用是在编码当前word的时候，这个self-attention就会告诉我们编码这个词语我们应该放多少注意力在这个句子中其他的词语身上，说白了其实就是计算当前词语和其他词语的关系。这也是CNN用于解决NMT问题时用不同width的kernel来扫input metric的原因。

multi-head的意思是我使用多个不同的self-attention layer来处理我们的输入，直观感觉是训练的参数更多了，模型的表现力自然要好一点。

Positional embeddings

前一个创新点解决了dependence的问题，那如何解决位置的问题呢？也就是我这个词在编码的时候或者解码的时候应该放置在句子的哪个位置上。文章就用pisitional embedding来解决这个问题。这个positional embedding和input embedding拥有相同的shape，所以两者可以直接相加。transformer这篇文章提供了两种encoding方式：

1） sunusoidal positional encoding

其中，pos=1,...,L(L是input句子的长度)，i是某一个PE中的一个维度，取值范围是1到dmodel。python实现为：

def positional_encoding(length, depth):
  depth = depth/2

  positions = np.arange(length)[:, np.newaxis]     # (seq, 1)
  depths = np.arange(depth)[np.newaxis, :]/depth   # (1, depth)

  angle_rates = 1 / (10000**depths)         # (1, depth)
  angle_rads = positions * angle_rates      # (pos, depth)

  pos_encoding = np.concatenate(
      [np.sin(angle_rads), np.cos(angle_rads)],
      axis=-1) 

  return tf.cast(pos_encoding, dtype=tf.float32)

pos_encoding = positional_encoding(length=2048, depth=512)

# Check the shape.
print(pos_encoding.shape) # (2014,512)

2） learned positional encoding

整体上看，这篇文章提出的transformer模型在做translation的任务时，架构是这样的：

其中encoders部分包含了6个encoders的block，decoders部分也包含了6个decoders的block，将encoders的每一个block拆开来看，有两个sub layer：

其中decoder部分的block比encoder部分的block多了一个sub layer，其中self-attention和encoder-decoder attention都是multi-head attention layer，只不过decoder部分的第一个multi-head attention layer是一个masked multi-head attention，为了防止未来的信息泄露给当下（prevent positions from attending to the future）.

在transformer模型中，作者还使用了residual connection，所以在encoder的每一个block中，数据的flow是:

其中self-attention中涉及的运算details是：

可以发现其中涉及的运算都是矩阵的点乘，并没有RNN中那种时间步的概念，所以所有运算都是可以parallelizable，这就能使得模型的推理和训练更加的efficient。并且！Transformers也可以抓住distant的依赖，而不是像rnn那样对于长依赖并不是很擅长，因为它前面的信息如果像传递到很后面的单词推理上，需要经历很多时间步的计算，而transformer在推理每一个单词的时候都可以access到input句子中的每一个单词（毕竟我们的Z中包含了每一个单词跟其他单词的关系)。

其中positional encoding现在可以简单的理解成在我们编码的word embedding上我们又加了一个positional encoding，维度和我们的embedding一模一样。

在tensorflow中有一个layer是MultiHeadAttention,如果我们想实现transformer里的这个self-attention，那就是query，key，value其实都是由input vector计算来的。

以上的理论计算看起来可能会有点模糊，可以同步参照博客参考 illustrated transformer介绍的详细细节，基于tensorflow框架实现的transformer来帮助自己理解transformer模型。

encoder部分

encoder的每一个block由两个sub-layer组成，中间穿插resnet connection。

def multihead_attention(self, query, memory=None, mask=None, scope='attn'):
        """
        Args:
            query (tf.tensor): of shape (batch, q_size, d_model)
            memory (tf.tensor): of shape (batch, m_size, d_model)
            mask (tf.tensor): shape (batch, q_size, k_size)

        Returns:h
            a tensor of shape (bs, q_size, d_model)
        """
        if memory is None: # 如果memory是None，那么就是一个典型的self-attention layer
            memory = query

        with tf.variable_scope(scope):
            # Linear project to d_model dimension: [batch, q_size/k_size, d_model]
            Q = tf.layers.dense(query, self.d_model, activation=tf.nn.relu)
            K = tf.layers.dense(memory, self.d_model, activation=tf.nn.relu)
            V = tf.layers.dense(memory, self.d_model, activation=tf.nn.relu)

            # Split the matrix to multiple heads and then concatenate to have a larger
            # batch size: [h*batch, q_size/k_size, d_model/num_heads]
            Q_split = tf.concat(tf.split(Q, self.h, axis=2), axis=0)
            K_split = tf.concat(tf.split(K, self.h, axis=2), axis=0)
            V_split = tf.concat(tf.split(V, self.h, axis=2), axis=0)
            mask_split = tf.tile(mask, [self.h, 1, 1])

            # Apply scaled dot product attention
            out = self.scaled_dot_product_attention(Q_split, K_split, V_split, mask=mask_split)

            # Merge the multi-head back to the original shape
            out = tf.concat(tf.split(out, self.h, axis=0), axis=2)  # [bs, q_size, d_model]

            # The final linear layer and dropout.
            # out = tf.layers.dense(out, self.d_model)
            # out = tf.layers.dropout(out, rate=self.drop_rate, training=self._is_training)

        return out
    
def feed_forwad(self, inp, scope='ff'):
        """
        Position-wise fully connected feed-forward network, applied to each position
        separately and identically. It can be implemented as (linear + ReLU + linear) or
        (conv1d + ReLU + conv1d).

        Args:
            inp (tf.tensor): shape [batch, length, d_model]
        """
        out = inp
        with tf.variable_scope(scope):
            # out = tf.layers.dense(out, self.d_ff, activation=tf.nn.relu)
            # out = tf.layers.dropout(out, rate=self.drop_rate, training=self._is_training)
            # out = tf.layers.dense(out, self.d_model, activation=None)

            # by default, use_bias=True
            out = tf.layers.conv1d(out, filters=self.d_ff, kernel_size=1, activation=tf.nn.relu)
            out = tf.layers.conv1d(out, filters=self.d_model, kernel_size=1)

        return out  
    
def encoder_layer(self, inp, input_mask, scope):
        """
        Args:
            inp: tf.tensor of shape (batch, seq_len, embed_size)
            input_mask: tf.tensor of shape (batch, seq_len, seq_len)
        """
        out = inp
        with tf.variable_scope(scope):
            # One multi-head attention + one feed-forword
            out = self.layer_norm(out + self.multihead_attention(out, mask=input_mask))
            out = self.layer_norm(out + self.feed_forwad(out))
        return out

decoder部分

在decoder部分，我们可以看到每一个decoder block的输入有两个：整个encoder部分的输出以及上一个decoder block的输出（第一个decoder block是词向量的输入），而encoder部分的输出是接到每一个decoder block的第二个sublayer的。正如刚刚提到了，decoder部分的每一个block跟encoder部分的block有一个不一样的地方，那就是多了一个sublayer： encoder-decoder attention。至于encoder部分和decoder部分是如何connect的，

The encoder start by processing the input sequence. The output of the top encoder is then transformed into a set of attention vectors K and V. These are to be used by each decoder in its “encoder-decoder attention” layer which helps the decoder focus on appropriate places in the input sequence

也就是我们得到了encoder部分top layer（最后一个encoder layer）的输出之后，我们将输出转化成K和V. 我们可以看到在multihead_attention里，memory是enc_out

def decoder_layer(self, target, enc_out, input_mask, target_mask, scope):
        out = target
        with tf.variable_scope(scope):
            out = self.layer_norm(out + self.multihead_attention(
                out, mask=target_mask, scope='self_attn'))
            out = self.layer_norm(out + self.multihead_attention(
                out, memory=enc_out, mask=input_mask)) # 将encoder部分的输出结果作为输入
            out = self.layer_norm(out + self.feed_forwad(out))
        return out

def decoder(self, target, enc_out, input_mask, target_mask, scope='decoder'):
        out = target
        with tf.variable_scope(scope):
            for i in range(self.num_enc_layers):
                out = self.decoder_layer(out, enc_out, input_mask, target_mask, f'dec_{i}')
        return out

以上实现的transformer其实我觉得还是有一点点复杂，毕竟在tensorflow2.0+版本中已经有了官方实现好的layers.MultiHeadAttention可以使用，应该可以大大简化我们实现步骤，特别是上面的def multihead_attention(self, query, memory=None, mask=None, scope='attn'):。从刚刚的实现里我们可以发现，除了decoder部分每一个block的第二个sublayer的attention计算有一点不一样之外，其他的attention计算都是一模一样的。我在github上找了不少用TF2.0实现的transformer（最标准的也是Attention is all you need的模型），发现很多都都写得一般般，最终发现还是tensorflow官方文档写的tutotial写的最好.

现在对照tensorflow的tutorial以及上面transformer的计算过程，拆解一下官方给的代码。

首先定义一个baseAttention类，然后在此基础上我们再定义encoder和decoder中的attention：

class BaseAttention(tf.keras.layers.Layer):
  def __init__(self, **kwargs):
    super().__init__()
    self.mha = tf.keras.layers.MultiHeadAttention(**kwargs)
    self.layernorm = tf.keras.layers.LayerNormalization()
    self.add = tf.keras.layers.Add()

那么针对encoder结果输入到decoder的cross attention layer怎么处理呢？这时候我们使用MultiHeadAttention时就需要将target sequence x当作是query，将encoder输出当作是context sequence也就是key/value。

class CrossAttention(BaseAttention): # encoder结果输入到decoder的层
  def call(self, x, context): # 这里的x是target sequence,context是encoder的输出结果
    attn_output, attn_scores = self.mha(
        query=x,
        key=context,
        value=context,
        return_attention_scores=True)

    # Cache the attention scores for plotting later.
    self.last_attn_scores = attn_scores

    x = self.add([x, attn_output])
    x = self.layernorm(x)

    return x

然后我们再定义global attention，global attention就是没有任何特殊操作的（比如上面的attention计算它有特别的context），而在transformor中更多的是self-attention，也就是我们传递给MultiHeadAttention的query,key,value都是同一个值。

class GlobalSelfAttention(BaseAttention):
  def call(self, x):
    attn_output = self.mha(
        query=x,
        value=x,
        key=x)
    x = self.add([x, attn_output])
    x = self.layernorm(x)
    return x

最后我们定义causal self attention layer，这个是在decoder的每一个block的第一个sublayer：self-attention layer.其实这个layer是和global attention layer差不多的，但还是有一点微小的差别。为什么呢？因为我们在decoder阶段，我们是一个词语一个词语的预测的，这其实包含了一层因果关系，我们在预测一个词语的时候，我们应该已知它前面一个词语是什么，RNN中的hidden state传递到下一个时间步就是这个因果关系的传递。那么如果我们使用刚刚我们实现的global attention layer来实现这个self attention，并没有包含这个因果关系，不仅如此，如果我们使用常规的self attention的计算，将target sequence全部当作输入输入到decoder中的第一个block中，会有未来的数据提前被当前时刻看到的风险，所以在Transformer这篇文章中，作者提出使用mask的技术来避免这个问题。

在tensorflow中实现很简单，就只需要给MultiHeadAttention传递一个use_causal_mask = True的参数即可：

class CausalSelfAttention(BaseAttention):
  def call(self, x):
    attn_output = self.mha(
        query=x,
        value=x,
        key=x,
        use_causal_mask = True) # The causal mask ensures that each location only has access to the locations that come before it
    x = self.add([x, attn_output])
    x = self.layernorm(x)
    return x

这样就可以保证先前的sequence并不依赖于之后的elements。这里我本来有一个疑问是，这样一来这个causal layer并不能实现bi-rnn的能力？但后来一想并不是，因为双向的RNN的后向是指后面的词语先输入，其实就是从后往前输入，这样就可以知道一个sequence当前词语依赖于后面的词语的权重。

补充介绍

tf.keras.layers.MultiHeadAttention

doc

注意，return的结果包含两个，其中attention_output的shape的第二维是和target sequence的长度是一致的，并且E是和query的最后一维是一致的。

Attention Family

这个章节整理于blog，这个作者之前写了一篇介绍attention的文章，后面在2023年一月的时候又更新了两篇博客，详细介绍了从2020年以来出现的新的Transformer models。权当自己学习记录一些我还需要补充的知识。

The Transformer (which will be referred to as “vanilla Transformer” to distinguish it from other enhanced versions; Vaswani, et al., 2017) model has an encoder-decoder architecture, as commonly used in many NMT models. Later simplified Transformer was shown to achieve great performance in language modeling tasks, like in encoder-only BERT or decoder-only GPT.

TF2中的custom layer&model&training

Posted on 2023-02-13 Edited on 2023-04-28 In tensorflow Symbols count in article: 7.8k Reading time ≈ 7 mins.

在上Coursera上关于Tensorflow的高级用法课程时，老师简略介绍了custom layer和custom model的用法，但后来看到其实课程覆盖的内容比较简单，除了介绍了__init__和call两个可override的function外没有介绍其他的。偶然看到一篇博客详细介绍了在tensorflow中如何使用sub classing来搭建模型，写的非常好，这里贴上链接

我们知道在tensorflow中有三种搭建模型的方式： 1) sequential API 也就是想创建一个Sequential实例，然后通过add的方式把一个layer加到模型中去，如：

# declare input shape 
seq_model = tf.keras.Sequential()
seq_model.add(tf.keras.Input(shape=imput_dim))

# Block 1
seq_model.add(tf.keras.layers.Conv2D(32, 3, strides=2, activation="relu"))
seq_model.add(tf.keras.layers.MaxPooling2D(3))
seq_model.add(tf.keras.layers.BatchNormalization())

# Block 2
seq_model.add(tf.keras.layers.Conv2D(64, 3, activation="relu"))
seq_model.add(tf.keras.layers.BatchNormalization())
seq_model.add(tf.keras.layers.Dropout(0.3))

# Now that we apply global max pooling.
seq_model.add(tf.keras.layers.GlobalMaxPooling2D())

# Finally, we add a classification layer.
seq_model.add(tf.keras.layers.Dense(output_dim))

sequential的方式在researcher中用的不多，随着模型变得越来越复杂，可以看到tensorflow的application模块实现的官方模型代码中，已经见不到这种形式了。 2) Functional API 正如其名，就是用函数调用的方式来搭建模型：

# declare input shape 
input = tf.keras.Input(shape=(imput_dim))

# Block 1
x = tf.keras.layers.Conv2D(32, 3, strides=2, activation="relu")(input)
x = tf.keras.layers.MaxPooling2D(3)(x)
x = tf.keras.layers.BatchNormalization()(x)

# Block 2
x = tf.keras.layers.Conv2D(64, 3, activation="relu")(x)
x = tf.keras.layers.BatchNormalization()(x)
x = tf.keras.layers.Dropout(0.3)(x)

# Now that we apply global max pooling.
gap = tf.keras.layers.GlobalMaxPooling2D()(x)

# Finally, we add a classification layer.
output = tf.keras.layers.Dense(output_dim)(gap)

# bind all
func_model = tf.keras.Model(input, output)

注意：这种方式最终要使用tf.keras.Model()来将inputs和outputs接起来。

Model sub-classing API 第三种方式是现在用的最多的方式。之前我没理解layer和model两种调用方式的区别，我觉得就是一系列运算，我们把输入输进来，return output结果的一个过程。但如果一个类它是Layer的子类，它比model的子类多了一个功能，它有state属性，也就是我们熟悉的weights。比如Dense layer，我们知道它做了线性运算+激活函数，其中的weights就是我们assign给每一个feature的权重，但其实我们并不只是想要这一类别的运算，比如下面的：

class SimpleQuadratic(Layer):

    def __init__(self, units=32, activation=None):
        '''Initializes the class and sets up the internal variables'''
        # YOUR CODE HERE
        super(SimpleQuadratic, self).__init__()
        self.units = units
        self.activation = tf.keras.activations.get(activation)
    
    def build(self, input_shape):
        '''Create the state of the layer (weights)'''
        # a and b should be initialized with random normal, c (or the bias) with zeros.
        # remember to set these as trainable.
        # YOUR CODE HERE
        a_init = tf.random_normal_initializer()
        b_init = tf.random_normal_initializer()
        c_init = tf.zeros_initializer()
        
        self.a = tf.Variable(name = "kernel", initial_value = a_init(shape= (input_shape[-1], self.units), 
                                                                    dtype= "float32"), trainable = True)
        
        self.b = tf.Variable(name = "kernel", initial_value = b_init(shape= (input_shape[-1], self.units), 
                                                                    dtype= "float32"), trainable = True)
        
        self.c = tf.Variable(name = "bias", initial_value = c_init(shape= (self.units,), 
                                                                    dtype= "float32"), trainable = True)
   
    def call(self, inputs): 
        '''Defines the computation from inputs to outputs'''
        # YOUR CODE HERE
        result = tf.matmul(tf.math.square(inputs), self.a) + tf.matmul(inputs, self.b) + self.c
        return self.activation(result)

上面的代码将inputs平方之后和a做乘积，之后再加上inputs和b的乘积，最终返回的是和。这样的运算是tf.keras.layer中没有的。这个时候我们自己customize layer就很方便。还有一个很方便的地方在于很多模型其实是按模块来的，模块内部的layer很类似。这个时候我们就可以把这些模型内的layer包起来变成一个layer的子类（Module），再定义完这些module之后我们使用Model把这些module再包起来，这就是我们最终的model。这时候我们就可以看到Model和Layer子类的区别了，虽然两者都可以实现输入进来之后实现一系列运算返回运算结果，但后者可以实现更灵活的运算，而前者往往是在把每一个模块定义好之后最终定义我们训练模型的类。 > In general, we use the Layer class to define the inner computation blocks and will use the Model class to define the outer model, practically the object that we will train. ---粘贴自博客

You can treat any model as if it were a layer by invoking it on an Input or on the output of another layer. By calling a model you aren't just reusing the architecture of the model, you're also reusing its weights

同样值得注意的是，model的子类也可以像layer那样使用functional API来调用，比如：

encoder_input = keras.Input(shape=(28, 28, 1), name="original_img")
x = layers.Conv2D(16, 3, activation="relu")(encoder_input)
x = layers.Conv2D(32, 3, activation="relu")(x)
x = layers.MaxPooling2D(3)(x)
x = layers.Conv2D(32, 3, activation="relu")(x)
x = layers.Conv2D(16, 3, activation="relu")(x)
encoder_output = layers.GlobalMaxPooling2D()(x)

encoder = keras.Model(encoder_input, encoder_output, name="encoder")
encoder.summary()

decoder_input = keras.Input(shape=(16,), name="encoded_img")
x = layers.Reshape((4, 4, 1))(decoder_input)
x = layers.Conv2DTranspose(16, 3, activation="relu")(x)
x = layers.Conv2DTranspose(32, 3, activation="relu")(x)
x = layers.UpSampling2D(3)(x)
x = layers.Conv2DTranspose(16, 3, activation="relu")(x)
decoder_output = layers.Conv2DTranspose(1, 3, activation="relu")(x)

decoder = keras.Model(decoder_input, decoder_output, name="decoder")
decoder.summary()

autoencoder_input = keras.Input(shape=(28, 28, 1), name="img")
encoded_img = encoder(autoencoder_input)
decoded_img = decoder(encoded_img)
autoencoder = keras.Model(autoencoder_input, decoded_img, name="autoencoder")
autoencoder.summary()

我们以sub-classing的方式定义的model是没有办法调用summary来看模型架构的，作者也给出了解决方案：github comments

方法就是在Model的子类中添加build_graph方法：

1
2
3

def build_graph(self, raw_shape):
        x = tf.keras.layers.Input(shape=raw_shape)
        return Model(inputs=[x], outputs=self.call(x))

这样我们就可以正常调用summary()

cm.build_graph(raw_input).summary()
# 不仅如此还能使用tf.keras.utils.plot_model来生成png
tf.keras.utils.plot_model(
    model.build_graph(raw_input),                      # here is the trick (for now)
    to_file='model.png', dpi=96,              # saving  
    show_shapes=True, show_layer_names=True,  # show shapes and layer name
    expand_nested=False                       # will show nested block
)

作者同样推荐了一篇博客讲tensorflow中保存模型的各种方式：博客地址.非常推荐阅读

总结一下就是：

对于Functional API创建的模型，最好的保存模型和导入模型的方式是：

1
2
3

model.save('path_to_my_model.h5')
del model
model = keras.models.load_model('path_to_my_model.h5')

以上方式会将模型的架构，weights以及训练过程中的设定（也就是model.compile()）的内容全部保存。

对于sub class创建的模型，推荐的方式是用save_weights

1	model.save_weights('path_to_my_weights', save_format='tf')

如果想要加载weights，必须要知道原来用sub class建立模型的code。不仅如此，还需要用原来的code先build起模型，让模型知道输入tensor的shape以及dtype，如果没有build这一步程序将会报错。

1
2
3

new_model = MiniInception()
new_model.build((None, x_train.shape[1:])) # or .build((x_train.shape))
new_model.load_weights('net.h5')

tf.function

在我们定义custum training 过程中时我们经常会用到这个装饰器@tf.function

@tf.function
def train_step(step, x, y):
   '''
   input: x, y <- typically batches 
   input: step <- batch step
   return: loss value
   '''
    # start the scope of gradient 
   with tf.GradientTape() as tape:
      logits = model(x, training=True) # forward pass
      train_loss_value = loss_fn(y, logits) # compute loss 

    # compute gradient 
   grads = tape.gradient(train_loss_value, model.trainable_weights)

    # update weights
   optimizer.apply_gradients(zip(grads, model.trainable_weights))

    # update metrics
   train_acc_metric.update_state(y, logits)
    
    # write training loss and accuracy to the tensorboard
   with train_writer.as_default():
        tf.summary.scalar('loss', train_loss_value, step=step)
        tf.summary.scalar(
            'accuracy', train_acc_metric.result(), step=step
        ) 
   return train_loss_value

先看如果一个函数不加这个装饰器会如何：

def f(x):
    print("Traced with", x)

for i in range(5):
    f(2)
    
f(3)

输出为：

Traced with 2
Traced with 2
Traced with 2
Traced with 2
Traced with 2
Traced with 3

加上装饰器：

@tf.function
def f(x):
    print("Traced with", x)

for i in range(5):
    f(2)
    
f(3)

输出为：

1 2	Traced with 2 Traced with 3

可以看到第二种加了装饰器的方式，即便是循环了5遍，我们仍然只有一行打印了2.

如果我们在上面的代码中print之前加上一行：

@tf.function
def f(x):
    print("Traced with", x)
    # add tf.print
    tf.print("Executed with", x)
for i in range(5):
    f(2)
    
f(3)

程序的输出就变成了：

Traced with 2
Executed with 2
Executed with 2
Executed with 2
Executed with 2
Executed with 2
Traced with 3
Executed with 3

可以看到tf.print就可以正常按loop运行。注意一点: 被tf.function装饰的函数只能包含operations而不能定义variable比如tf.Variable()