Skip to content

shaping language as it evolves from basic to complex forms

License

Notifications You must be signed in to change notification settings

vg11072001/wordfigures

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

WordFigures

Build Status License Documentation

wordfigureslogo

WordFigures symbolizes the transformation of simple text into structured and advanced language models (LLMs), reflecting the art of shaping language as it evolves from basic to complex forms.

This repository takes you on a structured journey through Natural Language Processing (NLP) and Large Language Models (LLMs). It starts with fundamental text preprocessing techniques and gradually builds up to advanced deep learning models.

The project is organized into several notebooks, each dedicated to a specific topic or technique, making it easy to follow and learn. Whether you're a beginner or looking to deepen your understanding, WordFigures provides a step-by-step guide to mastering NLP and LLMs.

Contents

Part 1: Text Processing and Word Representations

  • Text Preprocessing: Cleaning, tokenization, stemming, lemmatization, stop word removal, spell checking, etc.
  • Text Representation: Bag-of-words, TF-IDF, word embeddings (Word2Vec, GloVe), contextual embeddings (BERT, ELMO), etc.
  • Syntax and Grammar: Parsing, part-of-speech tagging, syntactic analysis, dependency parsing, constituency parsing, etc.
  • Semantics: Named entity recognition, semantic role labeling, word sense disambiguation, semantic similarity, etc.
  • Description: Demonstrates basic NLP techniques to preprocess and clean text data.
  • Techniques Covered:
    1. Tokenization
    2. Stemming
    3. Lemmatization
    4. Removing Stopwords
    5. Bag of Words
    6. TF-IDF
  • Purpose: Lays the foundation for text data preprocessing, transforming raw text into structured input for further analysis.
  • Description: Demonstrates the Word2Vec model for learning word embeddings.
  • Key Features:
    • Skip-gram and CBOW architectures
    • Training a Word2Vec model on sample text
    • Visualizing word relationships using t-SNE
  • Purpose: Explains how Word2Vec captures contextual relationships in text data.
  • Reference: neural-probabilistic-lang-model-bengi003a.pdf

Part 2: NLP Applications

  • Text Classification: Sentiment analysis, topic modeling, document categorization, spam detection, intent recognition, etc.
  • Information Extraction: Named entity extraction, relation extraction, event extraction, entity linking, etc.
  • Machine Translation: Neural machine translation, statistical machine translation, alignment models, sequence-to-sequence models, etc.
  • Question Answering: Document-based QA, knowledge-based QA, open-domain QA, reading comprehension, etc.
  • Text Generation: Language modeling, text summarization, dialogue systems, chatbots, text completion, etc.
  • Text Mining and Analytics: Topic extraction, sentiment analysis, trend detection, text clustering, opinion mining, etc.
  • Description: Implements sentiment analysis using Transformer-based models.
  • Techniques Covered:
    • TFDistilBertForSequenceClassification
    • Fine-tuning with TFTrainer and TFTrainingArguments
  • Purpose: Demonstrates how to classify text data into sentiment categories using state-of-the-art Transformer architectures.
  • Description: Builds a Fake News Classifier using an LSTM-based deep learning model.
  • Key Features:
    • Text preprocessing for fake news detection
    • LSTM architecture for sequence modeling
  • Purpose: Highlights the application of recurrent neural networks for sequence classification tasks.
  • Reference:
    • image
    • image

Part 3: Transformer Models and Fine-Tuning

  • Transformer model introduced in the paper "Attention is All You Need" by Vaswani et al. (2017), is a groundbreaking neural network architecture that has become the foundation of modern Natural Language Processing (NLP) and many advancements in Artificial Intelligence (AI).
  • Model Fine-tuning: If the performance of the model is not satisfactory, consider refining the model architecture, adjusting hyperparameters, or applying techniques such as regularization to improve performance.

Transformer :huggingface.co/docs/transformers/

image

  • Step 1: Install Transformer.
  • Step 2: Call the pretrained model.
  • Step 3: Call the tokenizer of that particular pretrained model and encode the text in ex. seq2seq manner.
  • Step 4: Convert these encoding into Dataset objects. (Different objects of dataset for tensorflow - tensors and pytorch)
  • Step 5: Translate and decode the elements in batch
  • Description: Demonstrates fine-tuning BERT for specific NLP applications.
  • Techniques Covered:
    • Customizing BERT for downstream tasks
    • Training and evaluation of the fine-tuned model
  • Purpose: Showcases the versatility of BERT for domain-specific applications through transfer learning.
  • Reference:

Part 4: LLM, GPT

  1. LLM
  2. Framework for - LangChain, Llamaindex
  3. Fine Tune LLM models with larger dataset

Screenshot 2023-09-03 205232

Screenshot 2023-05-30 024318

Inspirations

About

shaping language as it evolves from basic to complex forms

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published