Skip to content
View TatjanaChernenko's full-sized avatar
🌏
Unravelling mysteries hidden within datasets, a relentless data detective.
🌏
Unravelling mysteries hidden within datasets, a relentless data detective.

Block or report TatjanaChernenko

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
TatjanaChernenko/README.md

Unraveling mysteries hidden within datasets, a relentless data detective, transforming chaos into knowledge.

Introduction

  • 👋 Hi, I’m @TatjanaChernenko
  • 👀 I’m interested in Data Science, ML/DL, NLP and .
  • 📫 How to reach me: tatjana.chernenko.work@gmail.com
  • 📁 New Public Repository: This new public GitHub profile contains both old (starting from approx. 2015) and new my projects, uploaded now after years of working in a private capacity due to privacy policies of my employers.
  • 📁 Project Uploads: All projects uploaded here are from my personal endeavors and university research. Due to privacy policies at SAP SE, where I am employed, I am unable to share work-related projects publicly. These repositories exclusively feature my private projects and are newly uploaded to this fresh GitHub profile. Thank you for your understanding.

Table of Contents

My Projects

  • Research Repositories

  1. CHERTOY: Word Sense Induction for Web Search Result Clustering

  2. Data-to-text Generation

  3. Text Summarization with LexRank

Predictive Maintenance (RUL, failure prediction, maintaince)

LSTM for predictive maintenance of aircraft machines

Anomaly Detection for Time Series with IBM API (SVR), K-Means clustering, statsmodels decomposition and Fourier analysis

Game AI

Reinforcement Learning Agent for Bomberman

Speech Recognition

Speech-to-text with Transfer Learning

Data Augmentation

Data Augmentation Techniques for Classification

[My Playground (smaller projects / samples)](#playground):

Inspiration

Industrial research

My Projects

Research Repositories

NLP / ML

Whitepaper - link

Key words: word sense induction, web search results clustering, ML, NLP, word2vec, sent2vec, NLP, data science, data processing.

  • 2018 Data-to-text: Natural Language Generation from structured inputs - This project investigates the generation of descriptions of images focusing on spatial relationships between the objects and sufficient attributes for the objects. Leveraging an encoder-decoder architecture with LSTM cells (the Dong et al. (2017) is taken as basis), the system transforms normalized vector representations of attributes into fixed-length vectors. These vectors serve as initial states for a decoder generating target sentences from sequences in description sentences.

Whitepaper - link

Key words: natural language generation, encoder-decoder, ML, NLP, data science, feed-forward neural network, LSTMs.

  • 2018 Text Summarization research: Optimizing LexRank system with ECNU features - enhancing the LexRank-based text summarization system by incorporating semantic similarity measures from the ECNU system. The LexRank-based text summarization system employs a stochastic graph-based method to compute the relative importance of textual units for extractive multi-document text summarization. This implementation initially utilizes cosine similarity between sentences as a key metric. In this model, a connectivity matrix based on intra-sentence cosine similarity is used as the adjacency matrix of the graph representation of sentences. The objective is to explore the impact of replacing cosine similarity with a combination of features from the ECNU system, known for its semantic similarity measure. This modification aims to improve the summarization effectiveness of the LexRank approach.

Whitepaper - link

Key words: natural language processing, text summarizaton, ML, NLP, data science, LexRank, ECNU, semantic similarity metrics, multi-document text summarization, cosine similarity, connectivity matrix, optimization.

  • 2019, Reinforcement Learning agent for Bomberman game Training a RL agent for the multi-player game Bomberman using reinforcement learning, deep Q-learning with a dueling network architecture and separate decision and target networks, prioritized experience replay.

Whitepaper - link

Key words: reinforcement learning, q-learning.

Report - link

Key words: transfer learning, automated speech translation

  • 2018, Data Augmentation techniques for binary- and multi-label classification - Exploring Data Augmentation techniques (Thesaurus and Backtranslation, a winning Kaggle technique) to expand existing datasets, evaluating on binary- and multi-label classification task (spam/not spam and news articles classification). Important when training data is limited, especially in Machine Learning (ML) or Deep Learning (DL) applications. The primary concept involves altering text while retaining its meaning to enhance the dataset's diversity.

Key words: data augmentation, data science, ML, DL, binary and multi-class classification

LSTM for predictive maintenance of aircraft machines

Anomaly Detection for Time Series with IBM API (SVR), K-Means clustering, statsmodels decomposition and Fourier analysis

Text Categorisation Task with ML (Reuters)

(coming soon)

Playground

EDA (Explorative Data Analysis)

(further projects coming soon)

Basic NLP Examples

  • NLP examples - Jupyter Notebook with data preprocessing, top words, word cloud, frequencies, AgglomerativeClustering, PCA, Sentiment analysis, Topic Detection
  • REGEX examples - simple summary of regex examples in Jupyter Notebook.

Databases, SQL, noSQL, webscrapping, email notifications

Various ML tasks

(coming soon)

Apps with ChatGPT and OpenAI

  • OpenAI basic app - updating the basic OpenAI simple app to generate pet names to correspond to the OpenAI changes in code (January, 2024)
  • [fork: GPT Chatbot - customizable]https://github.com/TatjanaChernenko/customizable-gpt-chatbot) - A dynamic, scalable AI chatbot built with Django REST framework, supporting custom training from PDFs, documents, websites, and YouTube videos. Leveraging OpenAI's GPT-3.5, Pinecone, FAISS, and Celery for seamless integration and performance.

(coming soon)

Dialogue Systems

Recommendation Systems

Own projects:

(to be uploaded soon) Forks:

Sentiment Analysis

(to be uploaded soon)

Forks:

  • Tweet Analysis - Analyzing ChatGPT-related tweets to observe technology interest trends over time

Voice technologies (speech-to-text, speech-to-speech, text-to-speech)

Own projects: (to be uploaded soon)

Forks:

  • Whisper OpenAI - Robust Speech Recognition via Large-Scale Weak Supervision
  • WhisperX Timestamps (& Diarization) - Automatic Speech Recognition with Word-level Timestamps (& Diarization)
  • Whisper real-time - real-time speech-to-text conversion with Whisper
  • SpeechGPT - detects microphone input and coverts it to text using Google's Speech Recognition API. It then opens ChatGPT and inputs the recognized text using selenium. It can be used with a wake word, and it can also use text to speech to repeat ChatGPT's answer to the query.
  • Speaker Diarization Whisper - Whisper with with Speaker Diarization based on OpenAI Whisper
  • Speech-to-Text-WaveNet: End-to-end sentence level English speech recognition based on DeepMind's WaveNet and tensorflow (forked from buriburisuri)
  • Speech-to-text via Whisper and GPT-4 - transcribe dictations to text using whisper, and then fixing the resulting transcriptions into usable text using gpt-4 (forked from MNoichl)
  • TensorFlow Speech Recognition - audio processing and speech classification with Tensorflow - convolution neural networks (forked from harshel)
  • Watson_STT_CustomModel - a custom speech model using IBM Watson Speech to Text; an old one (approx. 2018)
  • Simple Speech Recognition with Python - very simple setup using SpeechRecognition Python module
  • CTTS - Controllable Text-to-speech system, based on Microsoft's FastSpeech2
  • Google Sheets to Speech - Excel-to-speech, forked from Renoncio: A Python script for generating audio from a list of sentences in Google Sheets.
  • StreamlitTTS - Streamlit app allows you to convert text to audio files using the Microsoft Edge's online text-to-speech service.
  • Dolla Llama: Real-Time Co-Pilot for Closing the Deal - forked from smellslikeml; power a real-time speech-to-text agent with retrieval augmented generation based on webscraped customer use-cases, implements speech-to-text (STT) and retrieval-augmented generation (RAG) to assist live sales calls.
  • Text-to-Speech on AWS - forked from codets1989; using AWS Lambda and Polly converting text to speech and creating a automated pipeline
  • Whisper speech-to-text Telegram bot - forked from loyal-pelmen; Speech-to-Text Telegram bot
  • DeepSpeech on devices - embedded (offline, on-device) speech-to-text engine which can run in real time ranging from a Raspberry Pi 4 to high power GPU servers
  • Bash Whisper - using a Digital Voice Recorder (DVR) - Bash function to ease the transcription of audio files with OpenAI's whisper.
  • Awesome Whisper - model variants and playgrounds
  • TikTok Analyzer - Video Scraping and Content Analysis Tool. Search & download Tiktok videos by username and/or video tag, and analyze video contents. Transcribe video speech to text and perform NLP analysis tasks (e.g., keyword and topic discovery; emotion/sentiment analysis). Isolate audio signal and perform signal processing analysis tasks (e.g., pitch, prosody and sentiment analysis). Isolate visual stream and perform image tasks (e.g., object detection; face detection).
  • SpeechBrain - an open-source PyTorch toolkit that accelerates Conversational AI development; spans speech recognition, speaker recognition, speech enhancement, speech separation, language modeling, dialogue, and beyond. Over 200 competitive training recipes on more than 40 datasets supporting 20 speech and text processing tasks. Supports both training from scratch and fine-tuning pretrained models such as Whisper, Wav2Vec2, WavLM, Hubert, GPT2, Llama2, and beyond. The models on HuggingFace can be easily plugged in and fine-tuned.
  • Speech Synthesis Markup - SSML - XML-based markup language that you can use to fine-tune your text to speech output attributes (tutorial from Microsoft).

(further projects coming soon)

NMT

(coming soon)

Computer Vision

Own projects: (to be uploaded soon)

Forks/Inspiration:

Other


My Projects

Category Project Title GitHub
Research Repositories CHERTOY: Word Sense Induction for better web search result clustering CHERTOY System
Research Repositories Data-to-text: Natural Language Generation from structured inputs Data-to-text Generation
Research Repositories Text Summarization research: Optimizing LexRank system with ECNU features Text Summarization with LexRank
Research Repositories Reinforcement Learning agent for Bomberman game RL Agent for Bomberman
Research Repositories Speech-to-text: Transfer Learning for Automatic Speech Translation (playground) Speech-to-text with Transfer Learning
Research Repositories Data Augmentation techniques for binary- and multi-label classification Data Augmentation Techniques
Predictive Maintenance LSTM for predictive maintenance of aircraft machines: failure and RUL (remaining useful life) prediction Predictive Maintenance with LSTM
Anomaly Detection Anomaly Detection for Time Series with IBM API (SVR), K-Means clustering, statsmodels decomposition and Fourier analysis IBM API for anomaly detection, univariate data
Text Categorisation Text Categorisation Task with ML (Reuters) Categorization task with ML Algorithms for Reuters text categorization benchmark dataset
Playground Explorative Data Analysis of Airbnb rental prices in New York, 2019 EDA of Airbnb Prices in New York
Playground Basic NLP Examples NLP examples
Databases, SQL, noSQL, webscrapping, email notifications LinkedIn webscrapping, saving data to local MongoDB and csv, filtering and updating the user via email LinkedIn Webscrapping and Email Notifications
Various ML tasks Regression Task: Predicting Airbnb rental prices in New York Regression Task with Airbnb Data
Dialogue Systems Question answering with DistilBERT DistilBERT Question Answering
Dialogue Systems Document Question Answering with LayoutLM LayoutLM Document QA
Recommendation Systems Recommendation System with TensorFlow TensorFlow Recommenders
Sentiment Analysis Sentiment Analysis (to be uploaded soon)
Voice Technologies Speech-to-Text-WaveNet Speech-to-Text-WaveNet
Voice Technologies Speech-to-text via Whisper and GPT-4 Speech-to-text with Whisper to GPT
Voice Technologies TensorFlow Speech Recognition TensorFlow Speech Recognition
Voice Technologies Watson_STT_CustomModel Watson STT Custom Model
Voice Technologies Simple Speech Recognition with Python Simple Speech Recognition
Voice Technologies CTTS CTTS
Voice Technologies Google Sheets to Speech Google Sheets to Speech
Voice Technologies StreamlitTTS StreamlitTTS
Voice Technologies Dolla Llama: Real-Time Co-Pilot for Closing the Deal Dolla Llama
Voice Technologies Text-to-Speech on AWS Text-to-Speech on AWS
Voice Technologies Whisper speech-to-text Telegram bot Whisper Speech-to-Text Telegram Bot
NMT NMT (Neural Machine Translation) (coming soon)

Inspiration

Different

Prediction, Time Series, Anomaly Detection

Data Science Resources

NLP Resources

Evaluation Tasks

  • Evaluate from Huggingface - Evaluate is a library that makes evaluating and comparing models and reporting their performance easier and more standardized. Implementations of dozens of popular metrics: the existing metrics cover a variety of tasks spanning from NLP to Computer Vision
  • NMT Evaluation framework - A useful framework to evaluate and compare different Machine Translation engines between each other on variety datasets.
  • FastChat - LLM chatbots evaluation platform - FastChat is an open platform for training, serving, and evaluating large language model based chatbots.
  • ParlAI - a framework for training and evaluating AI models on a variety of openly available dialogue datasets.
  • AutoGluon - if you prefer more control over the forecasting model exploration, training, and evaluation processes.
  • tune from Huggingface - A benchmark for comparing Transformer-based models.

Image / Video Technologies

  • Activity detection - Real-Time Spatio-Temporally Localized Activity Detection by Tracking Body Keypoints
  • Dance transfer - acquire pose estimates from a participant, train a pix2pix model, transfer source dance video, and generate a dance gif; Motion transfer booth for a 1 hour everybody dance now video generation using EdgeTPU and Tensorflow 2.0
  • Video embeddings and similarity - Training CNN model to generate image embeddings
  • Deep Fakes Detection - (2019) Repository to detect deepfakes, an opensource project as part of AI Geeks effort.
  • Diffusers from Huggingface - Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch

Voice Technologies

  • Speech Cognitive Service - A Jupyter Notebook that details how to use Azure's Speech Cognitive Service to Translate speech
  • Audio-Speech Tutorial, 2022 - an introduction on the topic of audio and speech processing - from basics to applications (approx. 2022)
  • espnet - End-to-End Speech Processing Toolkit
  • TTS - a deep learning toolkit for Text-to-Speech, battle-tested in research and production
  • Speech-to-text benchmark - speech-to-text benchmarking framework
  • Speech-to-text - with Whisper and Python, March 2023
  • Multilingual Text-to-Speech - Tomáš Nekvinda and Ondřej Dušek, One Model, Many Languages: Meta-Learning for Multilingual Text-to-Speech, 2020, Proc. Interspeech 2020
  • Unified Speech Tokenizer for Speech Language Models - SpeechTokenizer; SpeechTokenizer: Unified Speech Tokenizer for Speech Language Models, Xin Zhang and Dong Zhang and Shimin Li and Yaqian Zhou and Xipeng Qiu, 2023
  • FunASR - a Fundamental End-to-End Speech Recognition Toolkit and Open Source SOTA Pretrained Models; hopes to build a bridge between academic research and industrial applications on speech recognition. By supporting the training & finetuning of the industrial-grade speech recognition model, researchers and developers can conduct research and production of speech recognition models more conveniently, and promote the development of speech recognition ecology
  • Whisper model - OpenAI Whisper
  • Wenet - Production First and Production Ready End-to-End Speech Recognition Toolkit
  • Distilled variant of Whisper - Distilled variant of Whisper for speech recognition. 6x faster, 50% smaller, within 1% word error rate.
  • Fine-tune Whisper -Fine-Tune Whisper For Multilingual ASR with Transformers

Different ML Resources

Industrial research

  • OptiGuide - Large Language Models for Supply Chain Optimization
  • Generative AI lessons - 12 Lessons, Get Started Building with Generative AI
  • LLMOps Workshop - Learn how to build solutions with Large Language Models.
  • Data Science Lessons
  • AI Lessons
  • unilm - Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities. An open source AutoML toolkit for automate machine learning lifecycle, including feature engineering, neural architecture search, model compression and hyper-parameter tuning.
  • Old Photo Restoration via Deep Latent Space Translation - Bringing Old Photo Back to Life (CVPR 2020 oral)
  • NNI - An open source AutoML toolkit for automate machine learning lifecycle, including feature engineering, neural architecture search, model compression and hyper-parameter tuning.
  • From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations
  • Seamless: Speech-to-speech translation (S2ST), Speech-to-text translation (S2TT), Text-to-speech translation (T2ST), Text-to-text translation (T2TT), Automatic speech recognition (ASR)
  • Fairseq(-py) is a sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language modeling and other text generation tasks.
  • Faiss is a library for efficient similarity search and clustering of dense vectors.
  • PyTorch-BigGraph (PBG) is a distributed system for learning graph embeddings for large graphs, particularly big web interaction graphs with up to billions of entities and trillions of edges.
  • Llama 2 Fine-tuning - examples to quickly get started with fine-tuning for domain adaptation and how to run inference for the fine-tuned models. For ease of use, the examples use Hugging Face converted versions of the models.
  • Pearl - A Production-ready Reinforcement Learning AI Agent Library
  • TorchRecipes - Recipes are a standard, well supported set of blueprints for machine learning engineers to rapidly train models using the latest research techniques without significant engineering overhead.
  • fastText is a library for efficient learning of word representations and sentence classification.
  • ParlAI - a framework for training and evaluating AI models on a variety of openly available dialogue datasets.
  • Deep Learning Examples - State-of-the-Art Deep Learning scripts organized by models - easy to train and deploy with reproducible accuracy and performance on enterprise-grade infrastructure.
  • NeMo: a toolkit for conversational AI

Pinned Loading

  1. image_description_generation image_description_generation Public

    NL Generation from structured inputs. Focuses on generating natural language descriptions for images by exploring the relationship between textual descriptions and image attributes. Leveraging an e…

    Jupyter Notebook 2

  2. reinforcement_learning_agent_Bomberman_game reinforcement_learning_agent_Bomberman_game Public

    Training an agent for the multi-player game Bomberman using reinforcement learning, deep Q-learning with a dueling network architecture and separate decision and target networks, prioritized experi…

    Python 3

  3. word_sense_induction_CHERTOY_system word_sense_induction_CHERTOY_system Public

    An approach to improve word sense induction systems (WSI) for web search result clustering. Exploring the boundaries of vector space models for the WSI Task. CHERTOY system. Chernenko, Tatjana and …

    Python 2

  4. TatjanaChernenko TatjanaChernenko Public

    About me.