AI/ML Research Engineer - Hiring Assessment

This repository contains my submission for the Orfium AI Research Engineer Assessment. The assessment consists of two tasks: Text Normalization and Cover Song Similarity. I have opted to complete the Text Normalization task as the full implementation, providing Notebook with its respective Text Normalization Report and provide a Cover Song Similarity Report for the Cover Song Similarity task.

Repository Overview

Text_Normalization/:
- Text_Normalization.ipynb: A Jupyter notebook demonstrating my full implementation of the text normalization task.
- Text_Normalization_Report.pdf: A report detailing my approach, methodology, and findings.
- normalization_assesment_dataset_10k.csv: Input dataset provided for text normalization.
- results.csv & results_extended.csv: Output results generated by my implementation.
- Fine_Tuning/:
  - Custom_DL.ipynb: A Jupyter notebook demonstrating the implementation of a custom PyTorch model with a transformer-like architecture for text normalization.
  - Finetuning_with_Pretrained_Model.ipynb: A Jupyter notebook showcasing the fine-tuning of a pretrained models (BART,mT5,T5-small) for text normalization.
  - Fine_Tuning_Report.pdf: A report describing the fine-tuning experiments, including methodology, results, and insights.
Cover_Song_Similarity/:
- Da_Tacos.ipynb: Notebook where I explored a bit with metadata of DA-TACOS dataset.
- da-tacos_metadata/: Metadata files from the DA-TACOS dataset.
- Cover_Song_Similarity_Report.pdf: A detailed report on my proposed methodology for the cover song similarity task.
Hiring_Assignment.pdf: The original assessment brief and guidelines.
README.md: Documentation for the repository.

Text Normalization

This project addresses the challenge of text normalization for composition writer data, utilizing a hybrid approach combining rule-based models with Large Language Models (LLMs). The goal was to process ambiguous, inconsistent, and non-English text, ensuring cleaner and more reliable data for prediction tasks. This hybrid approach improves text normalization quality while handling edge cases and diverse languages more effectively.

Problem

Text normalization for composition writer data involves cleaning raw information that contains unnecessary details, inconsistencies, and ambiguity. The challenge was to remove these extraneous elements and standardize the data, focusing primarily on preserving writer names while ensuring high accuracy for predictions in the context of diverse languages and scripts.

Approach

1. Dataset Analysis

Investigated the missing values (approximately 1,400 entries in the Ground Truth column), which required handling for consistent data processing.
Analyzed text length distribution to understand variability in text size and ensure the model handles both short and long texts effectively.
Identified a mix of Latin and non-Latin text, which influenced preprocessing strategies for handling diverse languages.
Examined N-gram patterns (bigram, trigram, and 4-gram analysis) to capture contextual relationships in text and improve feature engineering.
Implemented a function to identify and address specific symbols (e.g., <Unknown>) for special handling during text processing.

2. Hybrid Approaches

Baseline Rules: Applied predefined rules to remove common stopwords and simplify text. This approach struggles with ambiguous and complex cases, especially non-English characters.
Enhanced Rules: Built upon baseline rules, adding richer logic to handle non-standard formatting and edge cases, such as inconsistent capitalization and special characters.
LLM-Assisted Approach: Utilized LLMs for contextual understanding, enhancing the model's ability to handle complex text with ambiguity or irregularities, especially non-English text.
Voting Hybrid: Combined outputs from Baseline, Enhanced, and LLM models using a voting mechanism to select the final prediction, leveraging the strengths of each approach.
RefineBoost Hybrid: Integrated rule-based models with boosting techniques to refine predictions, using LLMs to correct errors from initial predictions.
Conditional Hybrid: Selected the best model based on the characteristics of the text, using LLMs for non-English or longer texts and simpler models for others, optimizing accuracy and efficiency.

3. Evaluation Metrics

Exact Match Percentage: Measures how well the predicted text matches the ground truth.
Jaro-Winkler Similarity: Evaluates similarity between predicted and true values, focusing on minor variations.
Word Error Rate (WER): Measures word-level errors, including insertions, deletions, and substitutions.
Character Error Rate (CER): Assesses character-level discrepancies, providing finer granularity than WER.
Execution Time: Measures the time taken by each approach, highlighting the trade-off between computational efficiency and accuracy.

Key Insights

Hybrid Models: The Conditional Hybrid approach provided the best balance of accuracy and efficiency, achieving the highest Exact Match Percentage and low WER and CER, but with higher execution time.
LLM Models: LLM-assisted models significantly reduced errors (WER and CER) but had higher computational costs, making them suitable for tasks requiring higher accuracy, despite the longer processing time.
Rule-Based Models: The Enhanced Rules approach offered a solid compromise between speed and accuracy, but Baseline Rules underperformed, especially with more complex text.
Performance Trade-offs: While more advanced models like LLM-assisted and Conditional Hybrid improved accuracy, they introduced a clear trade-off in execution time. For faster applications, simpler models like Voting Hybrid or Enhanced Rules are more efficient, though they offer slightly lower accuracy.

Results

Experimentation with a 100-row subset allowed for fast iterations and comparisons. Here are the results for each approach:

Approach	Exact Match (%)	Jaro-Winkler	WER	CER	Time (mins)
Baseline Original	63.00%	0.85	0.39	0.10	0.00
Baseline Rules	65.00%	0.86	0.38	0.07	0.01
Enhanced Rules	69.00%	0.86	0.20	0.06	0.00
LLM-Assisted	75.00%	0.93	0.25	0.07	11.86
Voting Hybrid	70.00%	0.87	0.25	0.08	0.00
RefineBoost Hybrid	68.00%	0.86	0.21	0.07	2.13
Conditional Hybrid	75.00%	0.93	0.24	0.07	11.94

Cover Song Similarity

This project tackles the challenge of cover song similarity, aiming to measure how closely songs resemble each other while accounting for variations in harmonic, rhythmic, tonal, and lyrical aspects. Leveraging the DA-TACOS dataset, multiple approaches were developed to build robust similarity metrics and detection systems.

Problem

Cover song detection is a complex task, as cover versions often feature changes in musical attributes or lyrical content. This requires sophisticated methods capable of capturing both subtle and significant variations.

Approach

1. Dataset Analysis

Investigated metadata (e.g., title, artist, release year) and pre-extracted features (e.g., Chroma, HPCP, MFCCs).
Enhanced the dataset by integrating external lyric data to incorporate a textual perspective into similarity evaluation.

2. Unified Similarity Metric

Combined multiple dimensions:
- Harmonic
- Rhythmic
- Tonal
- Lyrical
- Metadata
Computed a holistic similarity score and evaluated it using metrics like precision, recall, F1-score, and AUC to ensure robust classification.

3. Machine Learning Models

K-Means Clustering:
Grouped songs based on broad patterns, suitable for exploratory analysis but limited in capturing nuances.
Gradient Boosting:
Classified song pairs using labeled data, balancing performance and interpretability, with precautions against overfitting.
Siamese Networks:
Learned pairwise relationships in a shared embedding space, excelling at identifying fine-grained similarities, particularly for complex cases.

Key Insights

Integrating audio-based and textual features enhances similarity evaluation significantly.
The unified similarity metric offers a scalable and accurate framework for cover song comparison.
Siamese Networks demonstrated superior performance for nuanced tasks, while Gradient Boosting provided a reliable, interpretable alternative.

Time and Cost Breakdown

Time:
- Each assessment took 2 days to complete.
- In total, I spent 4 days for both assessments until delivery, considering a 1-week deadline.
Cost:
- The cost of using OpenAI's models for text normalization was just $2, highlighting the cost-effectiveness of these advanced tools for extensive experimentation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI/ML Research Engineer - Hiring Assessment

Repository Overview

Text Normalization

Problem

Approach

1. Dataset Analysis

2. Hybrid Approaches

3. Evaluation Metrics

Key Insights

Results

Cover Song Similarity

Problem

Approach

1. Dataset Analysis

2. Unified Similarity Metric

3. Machine Learning Models

Key Insights

Time and Cost Breakdown

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
Cover_Song_Similarity		Cover_Song_Similarity
Text_Normalization		Text_Normalization
Hiring_Assignment.pdf		Hiring_Assignment.pdf
README.md		README.md

spyros-briakos/AI-Research-Assessment-TextNormalization-SongSimilarity

Folders and files

Latest commit

History

Repository files navigation

AI/ML Research Engineer - Hiring Assessment

Repository Overview

Text Normalization

Problem

Approach

1. Dataset Analysis

2. Hybrid Approaches

3. Evaluation Metrics

Key Insights

Results

Cover Song Similarity

Problem

Approach

1. Dataset Analysis

2. Unified Similarity Metric

3. Machine Learning Models

Key Insights

Time and Cost Breakdown

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages