Skip to content

klychliiev/Meduzzen_AI_Internship

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

98 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Meduzzen & U2D AI Internship

Environment

To accomplish these challenges, I will use the VS Code Jupyter extension, and Google Colab whenever I need to build more sopihsitcated solutions requiring more computational capabilities.

Projects Overview

I'd like to try my hand at all three challenges (NLP, email classification/extraction and object detection), but as a computational limguistics student I am particularly interested in NLP.

🗣️ NLP

Below is the list of Python NLP modules I use in my work as a computational linguist. However, not all of them fit our requirements for this Internship, particularly, due to the limited number of supported languages.

Useful libraries for solving NLP problems:

library description fits our needs
stanza Collection of tools for the linguistic analysis created by the Stanford NLP team. Supports multiple languages. For example, for NER task, stanza has pre-trained models for 34 languages.
SpaCy Powerful Python NLP library with support for 70+ languages. It has in-built word vectors, has tools for tokenization, NER, POS-tagging, dependancy parsing, text classification, lemmatization, morphological analysis etc.
pymorphy2 Morpohlogical analyser written in Python. It is used for fetching information about grammatical properties of a particular word (POS, case, gender, number). Supports only two languages: Ukrainian and Russian.
gensim Python library for topic modelling, document indexing and similarity retrieval with large corpora.
fasttext Useful Python library for working with word embeddings. Contains pre-trained word vectors for 157 (!) languages.
langdetect Niche Python library designed exclusively for the language detection task. Able to detect 55 languages.
Polyglot Polyglot supports various multilingual applications and offers a wide range of analysis. Applications: language detection, tokenization, NER, POS-tagginf, sentiment analysis.
nltk Suite of libraries and programs for symbolic and statistical NLP for English written in the Python programming language.

I personally prefer spacy and stanza to a smaller extent for their diversity and overall accuracy for different tasks. When dealing with word vectors I use fasttext. Whenever I need a language detection I use langdetect. For example. recently I've been working on a hatespeech project and had to filter for posts written in the Ukrainian language only.

💌 Emails classification

Working on tasks related to email classification and extraction, we deal with the text data in the first place, therefore, libraries listed in the NLP section will come in handy. For emails classification we can use sklearn and tensorflow/keras libraries.

library description fits our need
Scikit-learn Open-source Python library which includes implementations of many traditional ML algorithms.
TensorFlow Open-source framework for prototyping and assessing machine learning models, primarily neural networks.

TensorFlow and Scikit-learn can be used for object detection and NLP as well. For instance, Tensorflow CNNs come in handy when working with images/video, while for NLP problems RNNs and LSTMs are often used.

📹 Object detection (CV)

Useful Python packages for image & video data processing:

library description fits our needs
OpenCv CV library focused on real-time applications. The library has a modular structure and includes several hundreds of computer vision algorithms.
Scikit-Image Includes a collection of algorithms for image processing. Image processing toolbox for SciPy.
matplotlib Library for creating static, animated and interactive visualisations.
Pillow Contains all the basic image processing functionality; intuitive and easy-to-use.
numpy While not being a specifically CV library, numpy provides powerful data structures and algorithms for easy image data manipulation.

👨‍💻 Author

Kyrylo Klychliiev
Kyiv, Ukraine