Skip to content

This is the repo for the project in Natural Language Processing at @unibo. In this repo we try to tackle the question generation problem by using a Seq2Seq network.

License

Notifications You must be signed in to change notification settings

Erhtric/neural-question-generation

Repository files navigation

GoQU: a Generator of QUestion: approaching question generation using Deep Learning

This repository contains a final project realized for the Natural Language Processing course of the Master's degree in Artificial Intelligence, University of Bologna.

Table Of Contents

Data

The dataset on which we trained, developed and tested our Question Generation (QG) network is the Stanford Question Answering Dataset (SQuAD) version 1.1, which is a collection of question-answer pairs derived from Wikipedia articles. The dataset was processed in order to better accomodonate our needs for the implementation.

Project Details

This project tries to solve the Question Generation task by using the ideas introduced in the paper by Du et al. [1]. It acknowledge that by implementing our revisited version of the model proposed in the 2017 by exploiting newer technologies and using the acclaimed Tensorflow framework provided by Google. To this end, this project only purpouse is only an educational and we do not reserve any credit for the great work done by Du et al.

Model Architecture

Folder structure (WIP)


├── goqu.py
│
├── GoQUreport.pdf
│
├── requirements.txt
│
├──  configs
│   └── config.py   - this file contains the configurations for the project.
│
├──  data   - this folder contains the data for the training and some additional files useful for further operations.
│
├── models
│   └── eval
│       ├── eval_metrics.py     - this file contains the metrics used for the evaluation.
│       └── evaluator.py        - this file contains the class used for evaluation.
│   └── layers
│       ├── decoder.py          - this file contains the decoder layer.
│       ├── encoder.py          - this file contains the encoder layer.
│       └── masking.py          - this file contains the custom masking layer.
│   └── trainers
│       ├── keras_tuner.py      - this file contains the code for the automatic tuning.
│       ├── trainer.py          - this file contains the class used for training.
│       └── metrics.py          - this file contains the metrics used for evaluating training.
│   ├── weights             - this folder contains the pre-trained weights from colab.
│   ├── loss.py         - this file contains the loss used by the model
│   └── callbacks.py    - this file contains the classes used as callbacks
│
├── data_loader
│   └── data_generator.py   - this file contains the dataset methods for loading and processing it.
│
└── utils   - this folder contains utility methods useful for complementary operations
     ├── dirs.py
     ├── embeddings.py
     └── utils.py

Technologies and Frameworks

Frameworks:

Platforms

Configurations and enviroments

The config.py file contains all the configurations needed by the project. The environment could be loaded by using conda by launching the command:

$ conda create --name <env> --file requirements.txt

Versioning

We used Git for versioning.

Future Works

Possible improvements to this project could be:

  • encoding additional information to the embedding dimension, this means that we could concatenate to each word vector its NER and POS tags to augment the information given to the network,
  • adding a more sophisticated decoding in the last part, instead of using the temperature sampling, see beam search decoding,
  • use contextual word embeddings,
  • use a different model, maybe more sophisticated.

Bibliography

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

About

This is the repo for the project in Natural Language Processing at @unibo. In this repo we try to tackle the question generation problem by using a Seq2Seq network.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published