浅析stanford alpaca羊驼代码

Posted on 2023-06-20 Edited on 2023-08-29 In NLP Symbols count in article: 7.2k Reading time ≈ 7 mins.

斯坦福的羊驼alpaca开启了大家微调大语言模型的先河，现在很多国内的工作都是基于斯坦福的羊驼模型的范式来微调chatglm-6B,alpaca的github.

之前我一个很厉害的师兄说过，要想写代码写的厉害或者有所提高，最重要的就是要多读别人优秀的代码，多思考别人为什么这么写，如果让自己写的话是不是可以做到如此高效。这就是我想写这篇博客的原因，我是最近几个月才接触的huggingface的transformer库，发现很多API虽然设计的很简单，但里面功能丰富，不可能一下子就掌握住，所以我的办法是多看别人是如何写的，我的主要参考repo就是alpaca和chatglm-6B官方repo给出的那些finetune LLM的代码。

回到羊驼alpca这份代码，不得不说斯坦福的这份代码写的真的很优秀，值得一句一句去debug。

数据准备部分

代码在generate_instruction.py内。

这部分代码主要功能是实现由seed_tasks.jsonl作为模板，让GPT3.5来根据两个seed生成一些instruction和input，output。思想基于self-instruct的理念，在我另外一篇博客instrcution tuning中有所详细介绍。

这里的启动函数是def generate_instruction_following_data(), 首先会将seed_tasks读取进来，然后从中随机选取num_prompt_instructions个数据，默认是三个：

1	prompt_instructions = random.sample(seed_instruction_data, num_prompt_instructions)

注意这里羊驼采用了向chatgpt传输batch请求的方式，也就是一次性向chatgpt传输多个prompt，程序里默认是5个prompt一起传输给gpt，然后每一个prompt长什么样子呢？

举一个简单的例子, 下面这是一个seed_task

{"id": "seed_task_1", "name": "antonym_relation", "instruction": "What is the relation between the given pairs?", "instances": [{"input": "Night : Day :: Right : Left", "output": "The relation between the given pairs is that they are opposites."}], "is_classification": false}

作者将三个seed_task拼接在一起，然后前面加上事先定义好的prompt：

def encode_prompt(prompt_instructions):
    """Encode multiple prompt instructions into a single string."""
    prompt = open("./prompt.txt").read() + "\n"

    for idx, task_dict in enumerate(prompt_instructions):
        (instruction, input, output) = task_dict["instruction"], task_dict["input"], task_dict["output"]
        instruction = re.sub(r"\s+", " ", instruction).strip().rstrip(":")
        input = "<noinput>" if input.lower() == "" else input
        prompt += f"###\n"
        prompt += f"{idx + 1}. Instruction: {instruction}\n"
        prompt += f"{idx + 1}. Input:\n{input}\n"
        prompt += f"{idx + 1}. Output:\n{output}\n"
    prompt += f"###\n"
    prompt += f"{idx + 2}. Instruction:"
    return prompt

事先定义好的prompt长这样：

You are asked to come up with a set of 20 diverse task instructions. These task instructions will be given to a GPT model and we will evaluate the GPT model for completing the instructions.

Here are the requirements:
1. Try not to repeat the verb for each instruction to maximize diversity.
2. The language used for the instruction also should be diverse. For example, you should combine questions with imperative instrucitons.
3. The type of instructions should be diverse. The list should include diverse types of tasks like open-ended generation, classification, editing, etc.
4. A GPT language model should be able to complete the instruction. For example, do not ask the assistant to create any visual or audio output. For another example, do not ask the assistant to wake you up at 5pm or set a reminder because it cannot perform any action.
5. The instructions should be in English.
6. The instructions should be 1 to 2 sentences long. Either an imperative sentence or a question is permitted.
7. You should generate an appropriate input to the instruction. The input field should contain a specific example provided for the instruction. It should involve realistic data and should not contain simple placeholders. The input should provide substantial content to make the instruction challenging but should ideally not exceed 100 words.
8. Not all instructions require input. For example, when a instruction asks about some general information, "what is the highest peak in the world", it is not necssary to provide a specific context. In this case, we simply put "<noinput>" in the input field.
9. The output should be an appropriate response to the instruction and the input. Make sure the output is less than 100 words.

List of 20 tasks:

理解起来就是作者在list of 20 tasks后面跟上了三个seed tasks，也就是给gpt打个样，让它知道按这个模式去生成。这里有个地方值得参考的：

用模板的时候给每一个example配上分隔符，这里作者采用了###, 不仅如此，作者还采用了序号的方式，这些都是为了方便后面对gpt返回的text进行处理

这里还有一个值得学习的地方在utils.py内，我们在使用openai的api获取回复时，有时候会遇到prompt过长的问题，羊驼catch了这个报错，将prompt的长度变为原来的80%，然后再向gpt发送请求，正是由于这份耐心，这个代码的耦合性就没那么高，所以易用性非常强，非常值得野生程序员学习。

gpt返回的文本羊驼模型还做了similarity的计算，将相似度超过一定阈值的instrcution剔除掉。这部分代码可以直接复用。

train中的数据处理

羊驼模型采用了一个函数生成transformers库所需要的参数： make_supervised_data_module，该函数返回一个dict，其中字典的key就是我们初始化Trainer类所需要的train_dataset,eval_dataset和data_collator。这个函数里首先是构建数据集的类SupervisedDataset，继承自Dataset. 注意pytorch里规定如果你想要创建一个自建的Dataset，这个继承自Dataset的子类必须重写__len__和__getitem__两个方法，羊驼这里写的是：

class SupervisedDataset(Dataset):
    """Dataset for supervised fine-tuning."""

    def __init__(self, data_path: str, tokenizer: transformers.PreTrainedTokenizer):
        super(SupervisedDataset, self).__init__()
        logging.warning("Loading data...")
        list_data_dict = utils.jload(data_path)

        logging.warning("Formatting inputs...")
        prompt_input, prompt_no_input = PROMPT_DICT["prompt_input"], PROMPT_DICT["prompt_no_input"]
        sources = [
            prompt_input.format_map(example) if example.get("input", "") != "" else prompt_no_input.format_map(example)
            for example in list_data_dict
        ]
        targets = [f"{example['output']}{tokenizer.eos_token}" for example in list_data_dict]

        logging.warning("Tokenizing inputs... This may take some time...")
        data_dict = preprocess(sources, targets, tokenizer)

        self.input_ids = data_dict["input_ids"]
        self.labels = data_dict["labels"]

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, i) -> Dict[str, torch.Tensor]:
        return dict(input_ids=self.input_ids[i], labels=self.labels[i]) # 这里key必须是input_ids和labels,这是由于是llama模型的规定。

羊驼模型用的DataCollator是自定义的，首先解释下datacollator是什么东西，transformers的官方文档的解释是：

Data collators are objects that will form a batch by using a list of dataset elements as input. These elements are of the same type as the elements of train_dataset or eval_dataset.

也就是你把所有的文本用tokenizer转化成input_ids和labels之后，要把他们组织成batch的形式，不仅如此，collator还能做一些数据处理的工作。它的输入就是我们之前的数据集，注意我们数据集的组织形式每一个数据sample它是一个字典，字典有两个key。所以羊驼这里首先将其拆分, 一句话解决，非常善于利用[ for 句式]，让我写的话应该是写成非常冗余的两个for循环。

1	input_ids, labels = tuple([instance[key] for instance in instances] for key in ("input_ids", "labels"))

后面就是很简单的train了

1
2
3

trainer.train()
trainer.save_state()
trainer.save_model(output_dir=training_args.output_dir)

我当时看到这里的时候有点奇怪，因为我之前的习惯还是tensorflow那一套，基本上你把数据处理完之后还要prefetech，batch等，但是这里感觉transformer全部做了集成，可以仔细看羊驼模型的训练启动命令：

torchrun --nproc_per_node=4 --master_port=<your_random_port> train.py \
    --model_name_or_path "facebook/opt-6.7b" \
    --data_path ./alpaca_data.json \
    --bf16 True \
    --output_dir <your_output_dir> \
    --num_train_epochs 3 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'OPTDecoderLayer' \
    --tf32 True

其中的per_device_train_batch_size就指定了一个device上弄几个数据，也就是batch_size是多少，另外loss计算这些都是pretrained模型定好的，所以不用管，包括optimizer，所以我们在训练的时候只需要指定训练过程中用到的参数，比如保存步数，学习率，训练多少个epoch等。这是finetune大语言模型和之前做深度学习模型不一样的地方，技术往往更新的太快，都快看不懂大家写的代码了，怎么咔咔一两句就开始训练了，所以函数集成太厉害也不是只有好处。

YAN's Blog

浅析stanford alpaca羊驼代码

数据准备部分

train中的数据处理

推荐阅读