GitHub - Resh1992/Language-Model

Language Model

We intend to implement language models. Use appropriate smoothing (you can use in built smoothing functions) to ensure your language model outputs a non-zero and valid probability distribution for out-of-vocabulary words as well.

A dataset of movies reviews (positive and negative) is made available

The program generates the following information:

The number of word tokens in the database.
Vocabulary size (number of unique words) of the dataset.
Top ten bigrams and trigrams from positive and negative review sets, including the frequencies.
Write a function, that given a sequence of three words (w1,w2,w3), would compute the probability of third word using trigram language model p(w3|w1,w2). If you're using log-probabilities, use base 2 for computing logs.
Five test cases (sequence of three words) showing output from your trigram language model.

Pre processing

Replaced few words like <br>, <br /> and <br/> with \n.
We used regex to tokenize sentence into words. Regex used: [A-Z]{2,}(?![a-z])|[A-Z][a-z]+(?=[A-Z])|['\w-]+
Removed stop words.

To run

python vocabulary-compute.py <path-to-directory-containing-data>

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
Readme.md		Readme.md
vocabulary-compute.py		vocabulary-compute.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Language Model

Pre processing

To run

About

Releases

Packages

Languages

Resh1992/Language-Model

Folders and files

Latest commit

History

Repository files navigation

Language Model

Pre processing

To run

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages