Skip to content

This repository contains the content of our final project for Knowledge Engineering, tackling integration of lyrics data into the Polifonia music ontologies. You will find here the lyrics ontology module we created, the data we used with explanations on the way we obtained and standardized them, as well as the SPARQL queries used to create our KG.

Notifications You must be signed in to change notification settings

AlvielD/music-lyrics

 
 

Repository files navigation

Music Lyrics Ontology and Knowledge Graph (ML)

This repository contains the content of our final project for Knowledge Engineering, tackling integration of lyrics data into the Polifonia music ontologies. You will find here the lyrics ontology module we created, the data we used with explanations on the way we obtained and standardized them, as well as the SPARQL queries used to create our Knowledge Graph.

Important

We advise you to read the whole current file and the annotated notebooks that are referred along the way before trying to use its content on your own. Don't forget to move to the main folder and run pip install -r Requirements.txt to be able to run the scripts.

Files hierarchy

Design of the Ontology

First of all, we designed an ontology that our knowledge graph has to respect. We made sure to link it to some Polifonia already-existing modules (mostly Core and Music Meta) when the information we wanted to display were already present in them.

🔗 Ontology URI https://w3id.org/polifonia/ontology/music-lyrics/

Competency Questions

The structure of our ontology was developed based on the list of competency questions provided in the table below. These questions can be queried by running this notebook.

ID Competency question Reference
1 In what language is written a song? Core
2 Which different languages the lyrics from our Knowledge Graph use? Core
3 Who is the primary artist involved in the creative process of a song? Music Meta
4 What are all the different artists involved in writing the lyrics? Music Meta
5 Which are the different sections of a song? Music Lyrics
6 What are the lyrics associated to an specific section of a song? Music Lyrics
7 What are the starting and ending time stamps of a specific lyric's line? Music Lyrics
8 What is the duration of a lyric's line? Music Lyrics
9 What is the duration of a song's section? Music Lyrics
10 What are the singers in charge of singing a specific song's section? Music Lyrics
11 What is the dataset where the annotations were taken from? Music Lyrics
12 From which dataset was extracted each JSON data file? Music Lyrics

Reused Ontology Design Patterns

Music Time Interval ODP

In order to model the time dimension of the different annotations, we have reused the Music Time Interval ODP.

Creative Tasks ODP

To differ those artists in charge of singing the song (instrumentation) and those in charge of writing the lyrics of the song, we have reused the Creative Tasks ODP. In this way each artist is involved in a different Creative Task.

Corpus ODP

We have reused the Corpus ODP present in the JAMS Ontology so to represent the source where the data was taken from. We have added some more information like the identifier of the source file and the name of the dataset where the file belongs to.

Aligned Ontologies

Imported ontologies

Direct Imports

Indirect Imports

Data preparation

Choice of the datasets

To create a knowledge graph, we first needed some data.
Since we wanted to work with audio-aligned lyrics, we decided after some investigation to use the two following datasets:

  • DALI;
  • DAMP.
    These datasets provide us with lyrics splitted (either in paragraphs, in lines, in words or even in notes) along with time indexes for start and end of the lyrics fragment.
    We then had to enrich these data with more details, among them the paragraph information. To perform this, we used both of these API:
  • Genius;
  • Spotify.
    Here is one of the tutorials that helped us to scrape the lyrics from the web using the API: Tutorial on lyrics web scraping

Choice of the output format

Then, we had to converge on the output format. We chose JSON as it is a simple basic structured format, and that the datasets we were using were quite simple to convert in json (the lyrics files were already json ones in DAMP, and in DALI there were some built-in functions to convert their files to json).

Choice of the output content

We designed our output content's structure and fields so that it provides us with the information we will need to create our knwoledge graph according to our ontology. It is at this step that we decided to not go below the lines granularity level for lyrics: indeed, DAMP randomly contains either notes or lines, while DALI always contains both, in addition to words and paragraphs!
The thing is, if we have a song which audio alignment is represented line by line in DAMP, we could maybe find out how to split the words of the lines into syllables, but we would not have the related time tags of each syllables. So we thought it was better to focus on a scale a bit larger, hence line by line.

You can look at the structure of a converted file here.

Cleaning of the data

Since we had to match the dataset file's content with the Genius's content to be able to form the paragraphs, many kind of mistakes could lead to a mismatch. We incrementally improved our system by implementing cleaning functions.
Here is a summary of the different obstacles we had to overcome:

  • Matching of lyrics when some are in lower case and other in cap letters
  • Encoding of characters between different alphabets:
    We had to put aside all the polish songs (75), the estonian (1) and the croatian ones (4) from DALI dataset because the encoding was wrong in the source file.
    We also had to put aside the asian songs from DALI dataset because they were written in occidental characters while the Genius ones were in asian characters (so no possibility for successful matching).
  • Translation of paragraph headers:
    We implemented english to be the standard, that required us to create a translation dictionary.
  • No convention between different lyrics platforms for the spelling of interjections or for the adlibs:
    We had to use functions matching the longest substring (cf. in utils.py) instead of absolute comparison ones.

Data JSONification

Once that we had well planned and implemented everything, we were ready to carry out the whole conversion. You can follow the preparatory steps we did in the test_conversion notebook.
The successful results obtained after this preliminary step then led us to use the bulk functions defined in main.py and then perform some post-conversion steps so as to obtain out final dataset folder. You can follow these steps in the real_conversion notebook.

During the conversion, lyrics are carefully encoded and decoded using UTF-8, then they are cleaned from extra data written by the users and aligned with the lyrics extracted from Genius.
When converting a dataset json file to the final output file enriched with Genius's info, we have to be careful when assigning an ID to it: this one should be unique, and enable to prevent having duplicates if a song is present in both datasets. We took care of this thanks to an id text file updated and referred to along the way.
We also detect the songs that have problems and skip them while listing them in the avoided_file folder along with a skipping reason. Due to unexpected overwriting problems while converting DAMP, we unfortunately did not get its avoided songs' list, but we got it for DALI.
Theoretically, here are the reasons possible to skip a song (depending on the entry dataset):

DALI DAMP
no_paragraphs [on Genius] no_paragraphs [on Genius]
no_language_information [on Genius] no_language_information [on Genius]
not_found_on_Genius not_found_on_Genius
wrongly_encoded_asian_song notes_encoding_instead_of_lines

Our Dataset

We advise you to take a look at the DALI and DAMP statistics presented in the second part of the real_conversion notebook. Here is a small summary:

Dataset Number of usable converted songs Number of avoided songs Percentage of usability
DALI 3051 1885* 61.81%***
DAMP 1115 3834* 29.08%***

Knowledge Graph Construction

Once all of our data was standardized into proper JSON files, we develop a SparQL construct query which allowed us to create the triples following the previously shown ontology. The whole model was build into this SparQL query and then binded the variables of each JSON datafile in order to triplify our data. The SparQL query file can be seen here.

Generated triples can be checked here.

SparQL endpoint

The SparQL endpoint is available to be build locally through a Fuseki server. To do so, follow the steps below

  1. First, create a new dataset by clicking on new dataset. It will take some time, so we suggest to store it in a persistent format ( $\approx$ 15Gb).
  2. Then, click on add data.
  3. Select the .ttl files on the triples folder and upload all.

Once the process is done you can query the Knowledge Graph on http://localhost:3030/#/dataset/<dataset_name>/query.

Keep in mind you need Java 17 or more to execute the version of Fuseki present on this repository. There shouldn't be any problems if you try downloading an older version of Fuseki server, but we haven't tested them.

Conclusion

We are very happy to have been able to concretely use what we had studied in class in such an ambitious project, despite the short time we had to complete it.
The scope of this project is to give an entry point to extend the Polifonia ontology on the audio-aligned data. There are hence many things to be done still.
Here are some improvement axis:

  • Improve the cleaning in order to get more songs from the DAMP dataset;
  • Perform the entity linking on the Polifonia knowledge graph;
  • Enhance the granularity to syllable level.

About

This repository contains the content of our final project for Knowledge Engineering, tackling integration of lyrics data into the Polifonia music ontologies. You will find here the lyrics ontology module we created, the data we used with explanations on the way we obtained and standardized them, as well as the SPARQL queries used to create our KG.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 71.8%
  • Ruby 17.2%
  • Python 9.0%
  • Shell 1.7%
  • Other 0.3%