Prompt Engineering, also known as In-Context
Prompting, refers to methods for how to communicate with LLM to
steer its behavior for desired outcomes without updating the
model weights. It is an empirical science and the effect of prompt
engineering methods can vary a lot among models, thus requiring heavy
experimentation and heuristics
2013年的综述文章A Survey of
Large Language Models 在第五章介绍了详细的adaptation tuning of
LLM。作者介绍其中有两种方法,一个是instruction Tuning,第二个是alignment
[My personal spicy take]
tuning出现之前,还有不少技术能够帮助我们“further adapt LLM according to
specific goals”. 不仅如此,这篇综述也没有很好的解释instruction
Tuning of LLMs”
finetuned模型performance有了提高,或者说它在什么样的任务上有了提高?这个问题一直困扰我,直到我看到了google家的Finetuned Language Models Are
Zero-Shot Learners.instruction
tuning这种finetune方式的提出是为了improve zero-shot performance
on unseen tasks,具体一点就是在一些任务上比如阅读归纳,question
answering和语言推理上,研究者发现GPT3的zero-shot learning比few-shot
instruction tuning
想要做到instruction tuning有两个前提条件:1. 你有一个pretrained的模型
有很多instructions。首先第一个条件可以看看市面上有哪些模型是已经开源了,参考A Survey of Large Language
那第一个问题解决了,起码我们有开源的LLM可以load到本地来使用,感谢facebook的开源。第二个问题如何产生很多的instructions,斯坦福的羊驼模型Alpaca采用的是下面文章介绍的方法,省时省力,花费上不超过600美金。当然也有其他的一些产生instruction的方法,详细可以参考A Survey of Large Language
Self-instruct: Aligning
Language Model with Self Generated Instructions
这篇文章介绍了一种self generated
重新来fine-tune我们的LLM。这样做的一个前提条件是:1. Large
“instruction-tuned” language models (finetuned to respond to
instructions) have demonstrated a remarkable ability to generalize
zero-shot to new tasks. 2. 产生instruction
data非常的耗时,原来都是采用Human written的方式。具体步骤是:
The concept of attention is no longer just a technique to improve
sentence lengths in NMT. Since its introduction by Bahdanau et al.
(2015) it has become a vital part of various NMT architectures,
culminating in the Transformer architecture
同样的,这篇文章可以结合代码来看,轻易理解。该代码是用pytorch实现的。这个pytorch的实现是从Sequence to Sequence Learning
with Neural Networks开始讲解的,Learning Phrase Representations
using RNN Encoder-Decoder for Statistical Machine
Machine Translation by Jointly Learning to Align and Translate 2015
The most important distinguishing feature of this approach from the
basic encoder–decoder is that it does not attempt to encode a whole
input sentence into a single fixed-length vector. Instead, it encodes
the input sentence into a sequence of vectors and chooses a subset of
these vectors adaptively while decoding the translation
Comparison to (Bahdanau et al., 2015) – While our global attention
approach is similar in spirit to the model proposed by Bahdanau et al.
(2015), there are several key differences which reflect how we have both
simplified and generalized from the original model. First, we simply use
hidden states at the top LSTM layers in both the encoder and decoder as
illustrated in Figure 2. Bahdanau et al. (2015), on the other hand, use
the concatenation of the forward and backward source hidden states in
the bi-directional encoder and target hidden states in their
non-stacking unidirectional decoder. Second, our computation path is
simpler; we go from ht → at → ct → ̃ ht then make a prediction as
detailed in Eq. (5), Eq. (6), and Figure 2. On the other hand, at any
time t, Bahdanau et al. (2015) build from the previous hidden state ht−1
→ at → ct → ht, which, in turn, goes through a deep-output and a maxout
layer before making predictions.7 Lastly, Bahdanau et al. (2015) only
experimented with one alignment function, the concat product; whereas we
show later that the other alternatives are better.
我们有时候会说: query attends to the
values,例如在seq2seq2+attention的模型中,每一个decoder hidden
state就是query,attends to 所有的encoder hidden states(values).
Attention is all you need
with recurrence and convolutions entirely"!
在阅读这篇文章之前需要提前了解我在另外一篇博客 Attention and
layer,其中self-attention和encoder-decoder attention都是multi-head
attention layer,只不过decoder部分的第一个multi-head attention
layer是一个masked multi-head
attention,为了防止未来的信息泄露给当下(prevent positions from
attending to the future).
# Merge the multi-head back to the original shape out = tf.concat(tf.split(out, self.h, axis=0), axis=2) # [bs, q_size, d_model]
# The final linear layer and dropout. # out = tf.layers.dense(out, self.d_model) # out = tf.layers.dropout(out, rate=self.drop_rate, training=self._is_training)
return out deffeed_forwad(self, inp, scope='ff'): """ Position-wise fully connected feed-forward network, applied to each position separately and identically. It can be implemented as (linear + ReLU + linear) or (conv1d + ReLU + conv1d). Args: inp (tf.tensor): shape [batch, length, d_model] """ out = inp with tf.variable_scope(scope): # out = tf.layers.dense(out, self.d_ff, activation=tf.nn.relu) # out = tf.layers.dropout(out, rate=self.drop_rate, training=self._is_training) # out = tf.layers.dense(out, self.d_model, activation=None)
# by default, use_bias=True out = tf.layers.conv1d(out, filters=self.d_ff, kernel_size=1, activation=tf.nn.relu) out = tf.layers.conv1d(out, filters=self.d_model, kernel_size=1)
return out defencoder_layer(self, inp, input_mask, scope): """ Args: inp: tf.tensor of shape (batch, seq_len, embed_size) input_mask: tf.tensor of shape (batch, seq_len, seq_len) """ out = inp with tf.variable_scope(scope): # One multi-head attention + one feed-forword out = self.layer_norm(out + self.multihead_attention(out, mask=input_mask)) out = self.layer_norm(out + self.feed_forwad(out)) return out
The encoder start by processing the input sequence. The output of the
top encoder is then transformed into a set of attention vectors K and V.
These are to be used by each decoder in its “encoder-decoder attention”
layer which helps the decoder focus on appropriate places in the input
defdecoder_layer(self, target, enc_out, input_mask, target_mask, scope): out = target with tf.variable_scope(scope): out = self.layer_norm(out + self.multihead_attention( out, mask=target_mask, scope='self_attn')) out = self.layer_norm(out + self.multihead_attention( out, memory=enc_out, mask=input_mask)) # 将encoder部分的输出结果作为输入 out = self.layer_norm(out + self.feed_forwad(out)) return out
defdecoder(self, target, enc_out, input_mask, target_mask, scope='decoder'): out = target with tf.variable_scope(scope): for i inrange(self.num_enc_layers): out = self.decoder_layer(out, enc_out, input_mask, target_mask, f'dec_{i}') return out
以上实现的transformer其实我觉得还是有一点点复杂,毕竟在tensorflow2.0+版本中已经有了官方实现好的layers.MultiHeadAttention可以使用,应该可以大大简化我们实现步骤,特别是上面的def multihead_attention(self, query, memory=None, mask=None, scope='attn'):。从刚刚的实现里我们可以发现,除了decoder部分每一个block的第二个sublayer的attention计算有一点不一样之外,其他的attention计算都是一模一样的。我在github上找了不少用TF2.0实现的transformer(最标准的也是Attention
is all you
classCausalSelfAttention(BaseAttention): defcall(self, x): attn_output = self.mha( query=x, value=x, key=x, use_causal_mask = True) # The causal mask ensures that each location only has access to the locations that come before it x = self.add([x, attn_output]) x = self.layernorm(x) return x
The Transformer (which will be referred to as
“vanilla Transformer” to distinguish it from other enhanced versions; Vaswani, et al., 2017) model
has an encoder-decoder architecture, as commonly used in many NMT
models. Later simplified Transformer was shown to achieve great
performance in language modeling tasks, like in encoder-only BERT
or decoder-only GPT.
# Block 1 x = tf.keras.layers.Conv2D(32, 3, strides=2, activation="relu")(input) x = tf.keras.layers.MaxPooling2D(3)(x) x = tf.keras.layers.BatchNormalization()(x)
# Block 2 x = tf.keras.layers.Conv2D(64, 3, activation="relu")(x) x = tf.keras.layers.BatchNormalization()(x) x = tf.keras.layers.Dropout(0.3)(x)
# Now that we apply global max pooling. gap = tf.keras.layers.GlobalMaxPooling2D()(x)
# Finally, we add a classification layer. output = tf.keras.layers.Dense(output_dim)(gap)
# bind all func_model = tf.keras.Model(input, output)
Model sub-classing API 第三种方式是现在用的最多的方式。
def__init__(self, units=32, activation=None): '''Initializes the class and sets up the internal variables''' # YOUR CODE HERE super(SimpleQuadratic, self).__init__() self.units = units self.activation = tf.keras.activations.get(activation) defbuild(self, input_shape): '''Create the state of the layer (weights)''' # a and b should be initialized with random normal, c (or the bias) with zeros. # remember to set these as trainable. # YOUR CODE HERE a_init = tf.random_normal_initializer() b_init = tf.random_normal_initializer() c_init = tf.zeros_initializer() self.a = tf.Variable(name = "kernel", initial_value = a_init(shape= (input_shape[-1], self.units), dtype= "float32"), trainable = True) self.b = tf.Variable(name = "kernel", initial_value = b_init(shape= (input_shape[-1], self.units), dtype= "float32"), trainable = True) self.c = tf.Variable(name = "bias", initial_value = c_init(shape= (self.units,), dtype= "float32"), trainable = True) defcall(self, inputs): '''Defines the computation from inputs to outputs''' # YOUR CODE HERE result = tf.matmul(tf.math.square(inputs), self.a) + tf.matmul(inputs, self.b) + self.c return self.activation(result)
> In general, we use the Layer class to define the inner computation
blocks and will use the Model class to define the outer model,
practically the object that we will train. ---粘贴自博客
You can treat any model as if it were a layer by invoking it on an
Input or on the output of another layer. By calling a model
you aren't just reusing the architecture of the model, you're also
reusing its weights
defbuild_graph(self, raw_shape): x = tf.keras.layers.Input(shape=raw_shape) return Model(inputs=[x],
1 2 3 4 5 6 7 8
cm.build_graph(raw_input).summary() # 不仅如此还能使用tf.keras.utils.plot_model来生成png tf.keras.utils.plot_model( model.build_graph(raw_input), # here is the trick (for now) to_file='model.png', dpi=96, # saving show_shapes=True, show_layer_names=True, # show shapes and layer name expand_nested=False# will show nested block )
@tf.function deftrain_step(step, x, y): ''' input: x, y <- typically batches input: step <- batch step return: loss value ''' # start the scope of gradient with tf.GradientTape() as tape: logits = model(x, training=True) # forward pass train_loss_value = loss_fn(y, logits) # compute loss
# update metrics train_acc_metric.update_state(y, logits) # write training loss and accuracy to the tensorboard with train_writer.as_default(): tf.summary.scalar('loss', train_loss_value, step=step) tf.summary.scalar( 'accuracy', train_acc_metric.result(), step=step ) return train_loss_value
1 2 3 4 5 6 7
deff(x): print("Traced with", x)
for i inrange(5): f(2) f(3)
1 2 3 4 5 6
Traced with 2 Traced with 2 Traced with 2 Traced with 2 Traced with 2 Traced with 3
1 2 3 4 5 6 7 8
@tf.function deff(x): print("Traced with", x)
for i inrange(5): f(2) f(3)
1 2
Traced with2 Traced with3
1 2 3 4 5 6 7 8 9
@tf.function deff(x): print("Traced with", x) # add tf.print tf.print("Executed with", x) for i inrange(5): f(2) f(3)