LLM评测/Evaluation

Posted on 2023-08-16 Edited on 2023-08-29 In NLP Symbols count in article: 4.7k Reading time ≈ 4 mins.

最近在关注模型performance评估的问题，打算在这个主题上做一个整理，也是受到很多博客和文章的启发写这篇文章，所以就将所有推荐阅读的文章放在前面，感兴趣的小伙伴可以拓展阅读。

老刘说NLP 公众号中8.10发的一篇文章《如何让自己的大模型榜单评分更高》这篇文章有点借鉴了hugging face的Open LLM 排行榜近况
https://mp.weixin.qq.com/s/WiF3yRU5MZS7ulLTP8FIpw

首先说一下这个LLM榜单, 有四个benchmark，其中上面的博客就是重点讲了为什么同样一个模型比如LLaMA在MMLU上评测的结果会不如llama文章中提的效果，trick就在作者使用MMLU这个benchmark的方式有很大不同，这里来看看MMLU这个benchmark。

MMLU benchmark

首先看一下这个数据集到底是什么数据集，长什么样子，先给出文章中的定义：

MMLU (Massive Multitask Language Understanding) is a new benchmark designed to measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings.

这个评测集合里包含了57个学科，也就是57个task。原始的数据集长这样，里面的每个问题包含四个可能选项，且每个问题只有一个正确答案。：

可以看到基本上就是question，answer的组织。注意这里看到原始数据的时候我还有点没看明白，作者的readme中也没写，还是对beginner有点不友好，第一列表示question，第二到第四列表示四个选项，最后一列是答案。所以可以看到原作者在evaluation的代码中这样处理的：

choices = ["A", "B", "C", "D"] # 首先定义选项

def eval(args, subject, engine, dev_df, test_df):
    cors = []
    all_probs = []
    answers = choices[:test_df.shape[1]-2] # 对于每一个csv文件读取进来后取answers
...

label = test_df.iloc[i, test_df.shape[1]-1] # label这里其实是取得最后一列，也就是答案

但这个评测数据集在用来评测LLM的过程中衍生出了很多版本，基本是prompt的变化：

同样的问答对，比如上面的选择题，Harness没有指令，并且衍生的两个版本也就是helm和harness版本还加了Question这个前缀，harness在选线之前还加了Choices。就这么一点差距，就导致同一个llm的出来的分数不一样：

关于如何使用这个benchmark，参考MMLU原始实现，作者写的是用chatgpt来产生答案，prompt为：prompt = "The following are multiple choice questions (with answers) about {}.\n\n".format(format_subject(subject))

这三种实现方式不仅prompt的形式不同，也就是上面提到的。并且它在计算F1score的时候的机制也不同。

原始实现

在原始实现中的评估的代码是这样写的：

for ans in answers:
    try:
        lprobs.append(c["choices"][0]["logprobs"]["top_logprobs"][-1][" {}".format(ans)]) # c是chatgpt的回答
    except:
        print("Warning: {} not found. Artificially adding log prob of -100.".format(ans))
    lprobs.append(-100)
    pred = {0: "A", 1: "B", 2: "C", 3: "D"}[np.argmax(lprobs)]
    probs = softmax(np.array(lprobs))

    cor = pred == label
    cors.append(cor)
    all_probs.append(probs)

该方法在评估的时候，仅仅比较了模型对四个选项字母的预测概率，哪个选项的概率高就选哪个，即便是在极端情况下四个选项的概率值都很低的情况下也会选择某个选项，但其实模型有时候会回答很多不相关的东西（都是很高的概率的token），所以这种方式有点”放水“，整体评估出来的分数会偏高。

HELM实现

HELM实现是根据模型预测的下一个输出词元的概率来选择输出文本，并将生成的文本与正确答案的文本进行对比。这种方式有效避免了如果模型的答案中出现概率高的token不是选项中的任意一个，那么就会判为错误答案。

看了helm的代码仓库，着实有点丰富。内容很多我都没有找到在哪个文件里做的evaluation的计算，只知道了读取csv的地方。有好心的小伙伴可以私信我告诉我在哪里。

harness实现

这是hugging face的llm榜单所用的实现。它不再是只是统计选项，而是连同选项字母以及后面的答案一起被考虑进来，计算的是整个序列的概率（获取每个词元的概率 (与上面其他实现一样) 并求它们的联合概率），那么很容易一些长文本的联合概率会比短文本的联合概率大，所以作者说可以在联合概率的基础上在做一个归一化，也就是用对数联合概率/ token数。

例如实现如下，基于GPT2计算句子联合概率的一个例子：

data = [
    "A multilayer perceptron (MLP) is a class of feedforward artificial neural network (ANN)",
    "The term MLP is used ambiguously, sometimes loosely to any feedforward ANN, sometimes strictly to refer to networks composed of multiple layers of perceptrons (with threshold activation); see § Terminology",
    'Multilayer perceptrons are sometimes colloquially referred to as "vanilla" neural networks, especially when they have a single hidden layer.[1]',
    "An MLP consists of at least three layers of nodes: an input layer, a hidden layer and an output layer. Except for the input nodes, each node is a neuron that uses a nonlinear activation function.",
]
model = transformers.GPT2LMHeadModel.from_pretrained("gpt2")
tok = transformers.GPT2Tokenizer.from_pretrained("gpt2")
tgs = []
for dat in data:
    random.seed(dat)
    # print(model(tok.encode(dat, return_tensors="pt"))[0][0])
    toks = tok.encode(dat, return_tensors="pt")
    ind = random.randrange(len(toks[0]) - 1)
    logits = F.log_softmax(model(toks)[0], dim=-1)[:, :-1]  # [batch, seq, vocab]
    res = torch.gather(logits, 2, toks[:, 1:].unsqueeze(-1)).squeeze(-1)[0]
    tgs.append(float(res[ind:].sum()))

在“老刘说NLP”的博客中也提到了一点，就是上面的方式都是开源模型，所以很容易就能得到每一个token的预测概率，所以返回结果可以拆的这么细致来分析。如果是闭源模型只返回response的话，这时候就需要用正则的方式来抽取回答内容里的选项，比如CEVAL的测试方案：

def extract_cot_answer(self, line, gen_ans):
    m = re.findall(r'所以答案是(.+?)。', gen_ans, re.M)
    if len(m) > 0 and m[-1] in self.choices:
        return m[-1], True
    answer_patterns = [
        r'([ABCD])是正确的',
        r'选项([ABCD])正确',
        r'答案为([ABCD])',
        r'答案是([ABCD])',
        r'答案([ABCD])',
        r'选择([ABCD])',
        r'答案：([ABCD])',
        r'选择答案([ABCD])'
    ]
    # RE extraction
    for answer_pattern in answer_patterns:
        m = re.search(answer_pattern, gen_ans, re.M)
        if m:
            answer = m.group(1)
            return answer, False
        # only containing one choice-character
        m = re.findall(r'[ABCD]', gen_ans, re.M)
        if len(m) == 1:
            answer = m[0]
            return answer, False
        answer_word_counter = 0
        # only containing one choice-context
        for c in self.choices:
            if str(line[f'{c}']) in gen_ans:
                answer = c
                answer_word_counter += 1
                if answer_word_counter == 1:
                    return answer, False
                return '-', False

对CLEVA评测平台感兴趣的可以看原文paper或者参考文章。原文说CLEVA是专门为评估中文语言模型而设计的平台。