首先RLHF包含了两个步骤,第一个就是训练一个reward
modeling来对LM生成的回答进行打分,这个分数是一个数值型的数据;第二部分就是用这个RM去调整我们的LM,使得LM能output更符合人类期望的回答。也有作者将SFT放到了RLHF的第一阶段,比如A Survey of Large Language
Models 的5.2.3节将RLHF分为了三阶段:
不过我认为SFT还是隔离开讲比较好。
Reward Modeling
数据
prompt好准备,那么打分这个就要靠人来打分了,人打分有一定的主观臆测性,所以就换成了比较哪一种回答比较好,像LLAMA2的做法就是分了四个等级:significantly
better, better, slightly better or negligibly better / unsure。
In designing accelerators, researchers concentrate on network
compression, parallel processing, and optimizing memory transfers for
processing speed-up.
我这里做个思维导图总结一下:
题外话,根据LLMA文章的意思,它提出的这种帮助reduce the serving
cost的方式不属于上述任意一类,它认为以transformer为基础的生成模型,推理阶段主要消耗的时间瓶颈在autoregressive
decoding。这里贴原文便于理解
While there are general methodologies that help reduce the serving
cost of LLMs such as quantization(Dettmers & Zettlemoyer, 2023),
pruning (Frantar & Alistarh, 2023), compression (Xu et al., 2020)
and distillation (Wang et al., 2020), the inference efficiency
bottleneck of these transformer-based generative models (e.g., GPT) is
mainly associated with autoregressive decoding: at test time, output
tokens must be decoded (sequentially) one by one, which poses
significant challenges for the LLMs to be deployed at scale.
这里补充介绍一下AI模型中精度,你会在各种场合下碰到FP32,FP16,int8,int4等名词。
GPTQ (Frantar et al., 2022) applies quantization only to weights but
not activations.
GPTQ这种方法只对weights做了量化,并没有对激活值做量化(我个人认为虽然这是事实,但有点硬凹的意思,因为对activations做量化映射并不会加速很多)
LLM.int8() uses mixed int8/fp16 decomposition to address the
activation outliers. However, such implementation leads to large latency
overhead, which can be even slower than FP16 inference.
意思是LLM.int8()这种方法只是减少了显存占用,并没有减少推理延迟,说白了就是慢,runtime没提高
We propose Low-Rank Adaptation, or LoRA, which freezes the pretrained
model weights and injects trainable rank decomposition matrices into
each layer of the Transformer architecture, greatly reducing the number
of trainable parameters for downstream tasks
# prepare int-8 model for training model = prepare_model_for_int8_training(model) model = get_peft_model(model, peft_config) model.print_trainable_parameters() return model, peft_config
MMLU (Massive Multitask Language
Understanding) is a new benchmark designed to measure knowledge
acquired during pretraining by evaluating models exclusively in
zero-shot and few-shot settings.
关于如何使用这个benchmark,参考MMLU原始实现,作者写的是用chatgpt来产生答案,prompt为:prompt = "The following are multiple choice questions (with answers) about {}.\n\n".format(format_subject(subject))
data = [ "A multilayer perceptron (MLP) is a class of feedforward artificial neural network (ANN)", "The term MLP is used ambiguously, sometimes loosely to any feedforward ANN, sometimes strictly to refer to networks composed of multiple layers of perceptrons (with threshold activation); see § Terminology", 'Multilayer perceptrons are sometimes colloquially referred to as "vanilla" neural networks, especially when they have a single hidden layer.[1]', "An MLP consists of at least three layers of nodes: an input layer, a hidden layer and an output layer. Except for the input nodes, each node is a neuron that uses a nonlinear activation function.", ] model = transformers.GPT2LMHeadModel.from_pretrained("gpt2") tok = transformers.GPT2Tokenizer.from_pretrained("gpt2") tgs = [] for dat in data: random.seed(dat) # print(model(tok.encode(dat, return_tensors="pt"))[0][0]) toks = tok.encode(dat, return_tensors="pt") ind = random.randrange(len(toks[0]) - 1) logits = F.log_softmax(model(toks)[0], dim=-1)[:, :-1] # [batch, seq, vocab] res = torch.gather(logits, 2, toks[:, 1:].unsqueeze(-1)).squeeze(-1)[0] tgs.append(float(res[ind:].sum()))