Our purpose of intgrating LLM in Pippy-Activity is quite clear. Making co-pilot quite similar like github but on a much elementary level and easy to use for children. Python codes which LLM generate are based mainly for developers to understand beacause its trained on a data of codes of developers available on internet. There are chances that the code generated by these models may be difficult for kids to understand beacause many a times kids are unable to give structured and clear prompts. This results in hallucinations in our output.
There are many FOSS LLM available in the market like Gemini, codellama, Mistral, Llama, GPT2, Gemma, Bloom, Mixtral, Llama2 and much more. Based on their performance and architecture we need to shortlist the LLM for code generation purpose. While working with Gemini, Bloom, GPT2 I found they are great for general text generative tasks but not suitable for code generation purposes. This is because
- The models had many hallucinations while working on complex codes which involve good amount of logic.
- Computational power consumption of these models is relatively high compared to other models for same tasks. This adds a lot of maintainance cost in future.
- Difficulties in fine-tuning these models. Hence making it difficult for us to cater kids.
I personally feel this model quite relatable to our project. Being launched by Meta in correspondance to LLama2 specifically for code-generation and correction tasks. This itself prooves the efficiency of our model.
Click here to view the research paper based on codellama.
This is my final model using codellama which I have made reading the research paper and other resources(model used-Codellama, output of model).
This can be made to operate on CPU instead on GPUs because GPUs are expensive and not FOSS. Hence using codellama is a great alernative.
In above code there are 2 codellama models trained on 7B and 13B parameters. The 7B parameter model are relatively easy to fine-tuned and can be trained on our specific dataset using less GPU resources whereas the 13B model would require more computational power as it's fine-tuned on extra billions of parameters.
Yes, the output which model is generating is quite higher compares to the level which kids can understand. However I have solved this issue by setting agents or pre-defined system prompts in our model preventing all these hallucinations.
Our model is now able to generate kids correct code, examples and explaination supporting it. This is working great as per our expectations!!
Here are some modifications which I did in order to optimize my model generate code pertaining to middle school pupils.
System Prompts help align the parameters in such a way that model is told to generate code based on the context similar to system prompt. In codellama and many other models I have set static system prompts. Model generates code after evaluating both system and user prompts.
This is the most elementary and commonly used method which I have already used in our gemini model by assisgning roles like software-developer, teacher, python code expert etc. Here we assign many agents and our model generates output based on the average of all these agents hence giving more accurate answers. In this research paper it's clearly shown that agents are helpful for logic based tasks like math and playing chess.
These are low-rank adaptations used to fine-tune model on our specific dataset. QLoRA is used for quantised models and LoRA is used for non-quantised models. I tried applying these techniques to gemma for better and specific output. If we want code pertaining to kids age then we can fine-tune our model to a dataset having code based on kids IQ level. We can see in these research papers that quantised models usually outperform non-quantised models, but the disadvantage is that it require very high computational power to fine-tune model on the dataset. These are some of the disadvantages we face with LoRA and QLoRA techinques.
I implemented RAG approach in LLama model and recieved quite satisfactory results. We can use this models to provide relevant examples to kids based on python like tuples, dictionary, string slicing etc by feeding a python document having the same. However the research paper clearly states is good for chatbots and general text generative models. It is not good for tasks which involve a lot of logic like generating code for a random problem.