Uses jsonl and binidx instead of the garbage RAM-intensive scripts.
Example.jsonl file contents:
{"text": "This is a sentence"}
{"text": "Instruction: a\n\nInput: b\n\nResponse: c"}
{"text": "Question: a\n\nAnswer: b"}
The tokenizer will combine all jsonl files inside your dataset folder into two files the train script will read. Read the outputs of the cells to know what to do.
Only tested 0.4B-World
Soon™