{"id": "seed_task_1", "name": "antonym_relation", "instruction": "What is the relation between the given pairs?", "instances": [{"input": "Night : Day :: Right : Left", "output": "The relation between the given pairs is that they are opposites."}], "is_classification": false}
作者将三个seed_task拼接在一起,然后前面加上事先定义好的prompt:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
defencode_prompt(prompt_instructions): """Encode multiple prompt instructions into a single string.""" prompt = open("./prompt.txt").read() + "\n"
You are asked to come up with a set of 20 diverse task instructions. These task instructions will be given to a GPT model and we will evaluate the GPT model for completing the instructions.
Here are the requirements: 1. Try not to repeat the verb for each instruction to maximize diversity. 2. The language used for the instruction also should be diverse. For example, you should combine questions with imperative instrucitons. 3. The type of instructions should be diverse. The list should include diverse types of tasks like open-ended generation, classification, editing, etc. 4. A GPT language model should be able to complete the instruction. For example, do not ask the assistant to create any visual or audio output. For another example, do not ask the assistant to wake you up at 5pm or set a reminder because it cannot perform any action. 5. The instructions should be in English. 6. The instructions should be 1 to 2 sentences long. Either an imperative sentence or a question is permitted. 7. You should generate an appropriate input to the instruction. The input field should contain a specific example provided for the instruction. It should involve realistic data and should not contain simple placeholders. The input should provide substantial content to make the instruction challenging but should ideally not exceed 100 words. 8. Not all instructions require input. For example, when a instruction asks about some general information, "what is the highest peak in the world", it is not necssary to provide a specific context. In this case, we simply put "<noinput>" in the input field. 9. The output should be an appropriate response to the instruction and the input. Make sure the output is less than 100 words.
List of 20 tasks:
理解起来就是作者在list of 20 tasks后面跟上了三个seed
tasks,也就是给gpt打个样,让它知道按这个模式去生成。这里有个地方值得参考的:
logging.warning("Formatting inputs...") prompt_input, prompt_no_input = PROMPT_DICT["prompt_input"], PROMPT_DICT["prompt_no_input"] sources = [ prompt_input.format_map(example) if example.get("input", "") != ""else prompt_no_input.format_map(example) for example in list_data_dict ] targets = [f"{example['output']}{tokenizer.eos_token}"for example in list_data_dict]
logging.warning("Tokenizing inputs... This may take some time...") data_dict = preprocess(sources, targets, tokenizer)
Data collators are objects that will form a batch by using a list of
dataset elements as input. These elements are of the same type as the
elements of train_dataset or eval_dataset.
也就是你把所有的文本用tokenizer转化成input_ids和labels之后,要把他们组织成batch的形式,不仅如此,collator还能做一些数据处理的工作。它的输入就是我们之前的数据集,注意我们数据集的组织形式每一个数据sample它是一个字典,字典有两个key。所以羊驼这里首先将其拆分,
一句话解决,非常善于利用[ for
句式],让我写的话应该是写成非常冗余的两个for循环。
1
input_ids, labels = tuple([instance[key] for instance in instances] for key in ("input_ids", "labels"))