Skip to content

dame-cell/MinGru-pytorch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

64 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MinGRU

A MinGRU model with 18 million parameter.

Mingru-lm Mingru
Image 1 Image 1

The MinGRU model is a simplified version of the traditional Gated Recurrent Unit (GRU), designed to reduce complexity and improve efficiency. By removing the hidden state dependencies from its gates, MinGRU allows for parallel training, which is much faster compared to traditional GRUs. Additionally, it eliminates the use of non-linear activations like tanh, further streamlining computations.

Parameters Value Description
--dim 512 Dimension for the model
--num_tokens 256 Maximum tokens for model (max is 256 due to ASCII)
--num_layers 6 Number of layers to train the model

Examples

Here is an example generated by the model after training it: Prompt - "once upon a time

once upon a time, there was a little girl named Lily. She loved to play outside and explore the world around her. One day, she decided to make a cool. Lily wanted to mxing, she saw something on the ground and got stuck in a hole. When she was many corner of the girl with her parents and her tears were fboard for her to be cution of the stairs. She gave her mom, she felt a little boy was, so she started to feel safe.Onstally, Lily went to a visit her could go in her backyard. She said to investigate and Lily

Try the Pre-trained Model

You can try the pre-trained model in this Hugging Face Space app: MinGru

And you can find the pre-trained model here: MinGru-model

Training Details

The model was trained using two NVIDIA T4 GPUs in a distributed data parallel (DDP) setup

Hyperparameter Type Default Value Description
--batch_size int 204 Batch size for training
--lr float 4e-3 Learning rate for training the model
--wd float 1e-2 Weight decay for your optimizer
--epochs int 30 Total number of epochs

Dataset Information

Trained the model on the tiny-stories dataset

Getting started

Before we begin the model was trained on two t4 kaggle gpus so the file train_ddp.py will work there but you can still train the model using the train.py

  • First git clone this repo
git clone https://github.com/dame-cell/MinGru.git
cd MinGru 
pip install -r requirements.txt 
cd mingru 
  • First you will need to prepare the dataset just simply run this code,it will take less than 1 min or maybe
python data.py 
  • Then train the model
python train.py --path_to_train_data (required) --path_to_test_data (required) --batch_size 204 

Plots and observations

We updated the model architecture by adding casual-depth which can be used to capture local temporal dependencies

The updated architecture seems to solve the spike at the end of the training

Metric Best Value
Best Updated Train Loss 0.74
Best Updated Val Loss 0.77
Best Updated Perplexity 2.16

Check the wandb report right here wandb

Citations

@inproceedings{Feng2024WereRA,
    title   = {Were RNNs All We Needed?},
    author  = {Leo Feng and Frederick Tung and Mohamed Osama Ahmed and Yoshua Bengio and Hossein Hajimirsadegh},
    year    = {2024},
    url     = {https://api.semanticscholar.org/CorpusID:273025630}
}

Acknowledgement

Thank to lucidrains for the reference code

@inproceedings{Feng2024WereRA,
    title   = {Were RNNs All We Needed?},
    author  = {Leo Feng and Frederick Tung and Mohamed Osama Ahmed and Yoshua Bengio and Hossein Hajimirsadegh},
    year    = {2024},
    url     = {https://api.semanticscholar.org/CorpusID:273025630}
}

Releases

No releases published

Packages

No packages published

Languages