{"id": "seed_task_1", "name": "antonym_relation", "instruction": "What is the relation between the given pairs?", "instances": [{"input": "Night : Day :: Right : Left", "output": "The relation between the given pairs is that they are opposites."}], "is_classification": false}
作者将三个seed_task拼接在一起,然后前面加上事先定义好的prompt:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
defencode_prompt(prompt_instructions): """Encode multiple prompt instructions into a single string.""" prompt = open("./prompt.txt").read() + "\n"
You are asked to come up with a set of 20 diverse task instructions. These task instructions will be given to a GPT model and we will evaluate the GPT model for completing the instructions.
Here are the requirements: 1. Try not to repeat the verb for each instruction to maximize diversity. 2. The language used for the instruction also should be diverse. For example, you should combine questions with imperative instrucitons. 3. The type of instructions should be diverse. The list should include diverse types of tasks like open-ended generation, classification, editing, etc. 4. A GPT language model should be able to complete the instruction. For example, do not ask the assistant to create any visual or audio output. For another example, do not ask the assistant to wake you up at 5pm or set a reminder because it cannot perform any action. 5. The instructions should be in English. 6. The instructions should be 1 to 2 sentences long. Either an imperative sentence or a question is permitted. 7. You should generate an appropriate input to the instruction. The input field should contain a specific example provided for the instruction. It should involve realistic data and should not contain simple placeholders. The input should provide substantial content to make the instruction challenging but should ideally not exceed 100 words. 8. Not all instructions require input. For example, when a instruction asks about some general information, "what is the highest peak in the world", it is not necssary to provide a specific context. In this case, we simply put "<noinput>" in the input field. 9. The output should be an appropriate response to the instruction and the input. Make sure the output is less than 100 words.
List of 20 tasks:
理解起来就是作者在list of 20 tasks后面跟上了三个seed
tasks,也就是给gpt打个样,让它知道按这个模式去生成。这里有个地方值得参考的:
logging.warning("Formatting inputs...") prompt_input, prompt_no_input = PROMPT_DICT["prompt_input"], PROMPT_DICT["prompt_no_input"] sources = [ prompt_input.format_map(example) if example.get("input", "") != ""else prompt_no_input.format_map(example) for example in list_data_dict ] targets = [f"{example['output']}{tokenizer.eos_token}"for example in list_data_dict]
logging.warning("Tokenizing inputs... This may take some time...") data_dict = preprocess(sources, targets, tokenizer)
Data collators are objects that will form a batch by using a list of
dataset elements as input. These elements are of the same type as the
elements of train_dataset or eval_dataset.
也就是你把所有的文本用tokenizer转化成input_ids和labels之后,要把他们组织成batch的形式,不仅如此,collator还能做一些数据处理的工作。它的输入就是我们之前的数据集,注意我们数据集的组织形式每一个数据sample它是一个字典,字典有两个key。所以羊驼这里首先将其拆分,
一句话解决,非常善于利用[ for
句式],让我写的话应该是写成非常冗余的两个for循环。
1
input_ids, labels = tuple([instance[key] for instance in instances] for key in ("input_ids", "labels"))
最近出了一个工作: LIMA: Less Is More for
Alignment,在LLaMa-65B基础上用1000条instruction数据训练的模型,在43%的情况下,LIMA可以超过或者和GPT4平齐,这真的很厉害了,毕竟只用了1000条数据,而且作者也用斯坦福的方法复刻了52k微调llama-65B的大羊驼,发现还是LIMA优秀一点,作者猜测是因为数据集质量,这1000条数据是精心策划的。
Chatglm-6B p-tuning
基于chatglm-6B的微调项目超级多,chatglm有天然的中文优势,所以国内好多语言模型都是基于清华的这个语言模型做的工作。chatglm-6B给出的官方github
repo中包含了p-tuning v2的代码, p tuning
v2的原理就是将应该人工写的那一部分prompt用参数来学习,LLM预训练好的那一部分参数固定住,只更新添加的这部分参数。参考chatglm-6B自己给出的再ADGEN(广告生成的数据集)上finetuneg
chatglm6B的代码:
https://github.com/THUDM/ChatGLM-6B/tree/main/ptuning,
这部分代码的数据组织部分蛮有意思的,数据集长这样:
我们在使用这个model进行推理时,输入给他的就是一串我呢本,告诉它你想获得一个怎么样的API,比如“I
would like to identify the objects in an image”或者更模糊一点:“I am
going to the zoo, and would like to track
animals”。在zero-shot模式下这个instruction会直接给到gorilla,模型会给你返回一串API调用的代码,像这样: