This repository contains my submission for the Orfium AI Research Engineer Assessment. The assessment consists of two tasks: Text Normalization and Cover Song Similarity. I have opted to complete the Text Normalization task as the full implementation, providing Notebook with its respective Text Normalization Report and provide a Cover Song Similarity Report for the Cover Song Similarity task.
-
Text_Normalization/:
Text_Normalization.ipynb
: A Jupyter notebook demonstrating my full implementation of the text normalization task.Text_Normalization_Report.pdf
: A report detailing my approach, methodology, and findings.normalization_assesment_dataset_10k.csv
: Input dataset provided for text normalization.results.csv
&results_extended.csv
: Output results generated by my implementation.Fine_Tuning/
:Custom_DL.ipynb
: A Jupyter notebook demonstrating the implementation of a custom PyTorch model with a transformer-like architecture for text normalization.Finetuning_with_Pretrained_Model.ipynb
: A Jupyter notebook showcasing the fine-tuning of a pretrained models (BART,mT5,T5-small) for text normalization.Fine_Tuning_Report.pdf
: A report describing the fine-tuning experiments, including methodology, results, and insights.
-
Cover_Song_Similarity/:
Da_Tacos.ipynb
: Notebook where I explored a bit with metadata of DA-TACOS dataset.da-tacos_metadata/
: Metadata files from the DA-TACOS dataset.Cover_Song_Similarity_Report.pdf
: A detailed report on my proposed methodology for the cover song similarity task.
-
Hiring_Assignment.pdf: The original assessment brief and guidelines.
-
README.md: Documentation for the repository.
This project addresses the challenge of text normalization for composition writer data, utilizing a hybrid approach combining rule-based models with Large Language Models (LLMs). The goal was to process ambiguous, inconsistent, and non-English text, ensuring cleaner and more reliable data for prediction tasks. This hybrid approach improves text normalization quality while handling edge cases and diverse languages more effectively.
Text normalization for composition writer data involves cleaning raw information that contains unnecessary details, inconsistencies, and ambiguity. The challenge was to remove these extraneous elements and standardize the data, focusing primarily on preserving writer names while ensuring high accuracy for predictions in the context of diverse languages and scripts.
- Investigated the missing values (approximately 1,400 entries in the Ground Truth column), which required handling for consistent data processing.
- Analyzed text length distribution to understand variability in text size and ensure the model handles both short and long texts effectively.
- Identified a mix of Latin and non-Latin text, which influenced preprocessing strategies for handling diverse languages.
- Examined N-gram patterns (bigram, trigram, and 4-gram analysis) to capture contextual relationships in text and improve feature engineering.
- Implemented a function to identify and address specific symbols (e.g.,
<Unknown>
) for special handling during text processing.
- Baseline Rules: Applied predefined rules to remove common stopwords and simplify text. This approach struggles with ambiguous and complex cases, especially non-English characters.
- Enhanced Rules: Built upon baseline rules, adding richer logic to handle non-standard formatting and edge cases, such as inconsistent capitalization and special characters.
- LLM-Assisted Approach: Utilized LLMs for contextual understanding, enhancing the model's ability to handle complex text with ambiguity or irregularities, especially non-English text.
- Voting Hybrid: Combined outputs from Baseline, Enhanced, and LLM models using a voting mechanism to select the final prediction, leveraging the strengths of each approach.
- RefineBoost Hybrid: Integrated rule-based models with boosting techniques to refine predictions, using LLMs to correct errors from initial predictions.
- Conditional Hybrid: Selected the best model based on the characteristics of the text, using LLMs for non-English or longer texts and simpler models for others, optimizing accuracy and efficiency.
- Exact Match Percentage: Measures how well the predicted text matches the ground truth.
- Jaro-Winkler Similarity: Evaluates similarity between predicted and true values, focusing on minor variations.
- Word Error Rate (WER): Measures word-level errors, including insertions, deletions, and substitutions.
- Character Error Rate (CER): Assesses character-level discrepancies, providing finer granularity than WER.
- Execution Time: Measures the time taken by each approach, highlighting the trade-off between computational efficiency and accuracy.
- Hybrid Models: The Conditional Hybrid approach provided the best balance of accuracy and efficiency, achieving the highest Exact Match Percentage and low WER and CER, but with higher execution time.
- LLM Models: LLM-assisted models significantly reduced errors (WER and CER) but had higher computational costs, making them suitable for tasks requiring higher accuracy, despite the longer processing time.
- Rule-Based Models: The Enhanced Rules approach offered a solid compromise between speed and accuracy, but Baseline Rules underperformed, especially with more complex text.
- Performance Trade-offs: While more advanced models like LLM-assisted and Conditional Hybrid improved accuracy, they introduced a clear trade-off in execution time. For faster applications, simpler models like Voting Hybrid or Enhanced Rules are more efficient, though they offer slightly lower accuracy.
Experimentation with a 100-row subset allowed for fast iterations and comparisons. Here are the results for each approach:
Approach | Exact Match (%) | Jaro-Winkler | WER | CER | Time (mins) |
---|---|---|---|---|---|
Baseline Original | 63.00% | 0.85 | 0.39 | 0.10 | 0.00 |
Baseline Rules | 65.00% | 0.86 | 0.38 | 0.07 | 0.01 |
Enhanced Rules | 69.00% | 0.86 | 0.20 | 0.06 | 0.00 |
LLM-Assisted | 75.00% | 0.93 | 0.25 | 0.07 | 11.86 |
Voting Hybrid | 70.00% | 0.87 | 0.25 | 0.08 | 0.00 |
RefineBoost Hybrid | 68.00% | 0.86 | 0.21 | 0.07 | 2.13 |
Conditional Hybrid | 75.00% | 0.93 | 0.24 | 0.07 | 11.94 |
This project tackles the challenge of cover song similarity, aiming to measure how closely songs resemble each other while accounting for variations in harmonic, rhythmic, tonal, and lyrical aspects. Leveraging the DA-TACOS dataset, multiple approaches were developed to build robust similarity metrics and detection systems.
Cover song detection is a complex task, as cover versions often feature changes in musical attributes or lyrical content. This requires sophisticated methods capable of capturing both subtle and significant variations.
- Investigated metadata (e.g., title, artist, release year) and pre-extracted features (e.g., Chroma, HPCP, MFCCs).
- Enhanced the dataset by integrating external lyric data to incorporate a textual perspective into similarity evaluation.
- Combined multiple dimensions:
- Harmonic
- Rhythmic
- Tonal
- Lyrical
- Metadata
- Computed a holistic similarity score and evaluated it using metrics like precision, recall, F1-score, and AUC to ensure robust classification.
- K-Means Clustering:
Grouped songs based on broad patterns, suitable for exploratory analysis but limited in capturing nuances. - Gradient Boosting:
Classified song pairs using labeled data, balancing performance and interpretability, with precautions against overfitting. - Siamese Networks:
Learned pairwise relationships in a shared embedding space, excelling at identifying fine-grained similarities, particularly for complex cases.
- Integrating audio-based and textual features enhances similarity evaluation significantly.
- The unified similarity metric offers a scalable and accurate framework for cover song comparison.
- Siamese Networks demonstrated superior performance for nuanced tasks, while Gradient Boosting provided a reliable, interpretable alternative.
-
Time:
- Each assessment took 2 days to complete.
- In total, I spent 4 days for both assessments until delivery, considering a 1-week deadline.
-
Cost:
- The cost of using OpenAI's models for text normalization was just $2, highlighting the cost-effectiveness of these advanced tools for extensive experimentation.