end-of-day commig of dezrann.py

collab with Louis Couturier
johentsch · Feb 22, 2023 · 667c2a1 · 667c2a1
1 parent 6c8f6dc
commit 667c2a1
Showing 1 changed file with 243 additions and 0 deletions.
diff --git a/src/ms3/dezrann.py b/src/ms3/dezrann.py
@@ -0,0 +1,243 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+""" 
+DCML to Dezrann
+===============
+
+Script to convert contiguous score annotations from a tabular format (one line per label) into
+the JSON-LD format .dez, used by the Dezrann annotation tool developed at the Algomus group.
+
+# Intro
+
+The script presents a first application of what is to become a formal standard of a "measure map";
+see first discussion points at 
+
+* https://gitlab.com/algomus.fr/dezrann/dezrann/-/issues/1030#note_1122509147)
+* https://github.com/MarkGotham/bar-measure/
+
+As an early proxy of a measure map, the current version uses the measure tables that each
+DCML corpus provides in its `measures` folder. This is beneficial in the current context because:
+
+1. The files are required for correct, actionable quarter-note positions without having to re-parse 
+  the entire score.
+2. The files play an essential role for validating the conversion output.
+3. They help avoiding the confusion that necessarily arises when several addressing schemes are 
+  at play.
+
+In detail:
+
+## 1. Quarterbeats
+
+From a technical perspective, offsets in the sense of "distance from the origin" represent the
+primary mechanism of referencing positions in a text (character counts being the default in NLP).
+Music scores are typically aligned with a time line of "musical time", an alignment which is 
+frequently expressed as float values representing an event's distance from the score's beginning, 
+measured in quarter notes, here referred to as quarterbeats. The fundamental problem, however, is
+ensuring that quarterbeat positions refer to the same time line. The commonplace
+score encoding formats do not indicate quarterbeat positions. Instead, they structure
+musical time in a sequence of containers, generally called "measures", each of which represents 
+a time line starting from 0. Counting measure units (of some kind) therefore represents the second 
+prevalent way of indicating positions in a score, together with an event onset indicating an
+event's distance from the container's beginning. To avoid terminological confusion, we call 
+the distance from the beginning of a measure container "onset".
+
+Looking at a single score, there is an unambiguous mapping between the two types of positions:
+`event_offset = measure_offset + event_onset`. Problems arise, however when information from one
+score is to be set into relation with timed information from another source. This is a wide-spread
+problem in the context of music research and musical corpus studies where data from different
+sources with different ways of expressing timestamps frequently needs to be aligned, often in 
+absence of the original score that one of the source is aligned to. Currently, there is no
+standardized way of storing such alignments for later re-use. Hence the idea of a central
+mapping file for storing alignments between positions given as quarterbeats, measure+onset,
+recording timestamps in seconds, IDs, and other data relevant for score addressability.
+
+**Different types of quarterbeats**
+
+All TSV files issued by the DCML come with the column `quarterbeats` indicating every event's
+offset from the score's beginning (position 0). With the caveat that, in the case of first/second endings 
+("voltas"), the indicated values do not take into account any but the second ending, with the
+rationale that they should represent the temporal proportion of a single playthrough without any
+repetitions. For correct conversion, therefore, using a strict, measuring-stick-based variant
+of `quarterbeats` will probably be useful. This means that the default `quarterbeats` should be 
+ignored (unless first endings are to be categorically excluded) in favour of a
+`quarterbeats_all_endings` column. Since the DCML measure maps already come with columns of both 
+names, the simple formula mentioned above `quarterbeats = quarterbeats(measure) + event_onset` 
+has its analogue `quarterbeats_all_measures = quarterbeats_all_measures(measure) + event_onset`.
+
+Input: DataFrame containing DCML harmony labels as output via the command `ms3 extract -X` 
+(X for 'expanded'), stored by default in a folder called 'harmonies'. Using these TSV files
+ensures using only valid DCML labels but in principle this script can be used for converting
+labels of all kinds as long as they come in the specified tabular format.
+
+## 2. Validating the output
+
+Going from a `DcmlLabel` dictionary to a `DezrannLabel` dictionary is straightforward because
+they exchange positions as quarterbeats. Validation, on the other hand, requires relating
+the output .dez format with the converted score which it is layed over in Dezrann. In the 
+interface, positions are shown to the user in terms of `measure_count + event_onset`. Extracting
+this information and comparing it to the one in the original TSVs will 
+
+Columns:
+
+* `mc`: measure count (XML measures, always starting from 1)
+* 
+
+
+
+
+Output:
+JSON Dezrann file (.dez) containing all the harmony labels, aligned with the score. 
+Here is an example of Dezrann file structure:
+'''
+{
+  "labels": [
+    {"type": "Harmony", "start": 0, "duration": 4, "line": "top.3", "tag": "I{"},
+    {"type": "Harmony", "start": 4, "duration": 4, "line": "top.3", "tag": "V(64)"},
+    {"type": "Harmony", "start": 8, "duration": 4, "line": "top.3", "tag": "V}"},
+    ...
+}
+'''
+"""
+
+import json
+import os
+from typing import Dict, List, TypedDict, Any, Union
+
+from fractions import Fraction
+import pandas as pd
+
+
+
+def safe_frac(s: str) -> Union[Fraction, str]:
+    try:
+        return Fraction(s)
+    except Exception:
+        return s
+
+class DezrannLabel(TypedDict):
+    type: str #= "Harmony" # Default value ?
+    start: float
+    duration: float
+    line: str #= "top.3" #Literal?
+    tag: str
+
+class DezrannDict(TypedDict):
+    labels: List[DezrannLabel]
+    meta: Dict
+
+class DcmlLabel(TypedDict):
+    quarterbeats: float
+    duration: float
+    label: str
+
+
+def transform_df(labels: pd.DataFrame, 
+                measures: pd.DataFrame, 
+                label_column: str = 'label') -> List[DcmlLabel]:
+    """
+
+    Parameters
+    ----------
+    labels:
+        Dataframe as found in the 'harmonies'  folder of a DCML corpus. Needs to have columns with
+        the correct dtypes {'mc': int, 'mc_onset': fractions.Fraction} and no missing values.
+    measures:
+        Dataframe as found in the 'measures' folder of a DCML corpus. Requires the columns
+        {'mc': int, 'quarterbeats_all_endings': fractions.Fraction}
+    label_column: str, optional
+        The column that is to be used as label string. Defaults to 'label'.
+
+    Returns
+    -------
+        List of dictionaries where each represents one row of the input labels.
+    """
+    offset_dict = measures.set_index("mc")["quarterbeats_all_endings"]
+    quarterbeats = labels['mc'].map(offset_dict)
+    quarterbeats = quarterbeats.astype('float') + (labels.mc_onset * 4.0)
+    transformed_df = pd.concat([quarterbeats.rename('quarterbeats'), labels.duration_qb.rename('duration'), labels[label_column].rename('label')], axis=1)
+    return transformed_df.to_dict(orient='records')
+
+def make_dezrann_label(quarterbeats: float, duration: float, label: str) -> DezrannLabel:
+    return DezrannLabel(type="Harmony", start=quarterbeats, duration=duration, line="top.3", tag=label)
+
+def convert_dcml_list_to_dezrann_list(values_dict: List[DcmlLabel]) -> List[DezrannDict]:
+    label_list = []
+    for e in values_dict:
+        label_list.append(
+            make_dezrann_label(
+                quarterbeats=e["quarterbeats"],
+                duration=e["duration"],
+                label=e["label"]
+            )
+        )
+    return DezrannDict(labels=label_list, meta={"layout": []})
+
+
+def generate_dez(path_measures, path_labels, output_path="labels.dez"): # need paths for harmony.TSV + paths for measures.TSV
+    """
+    path_measures : :obj:`str`
+        Path to a TSV file as output by format_data().
+    path_labels : :obj:`str`
+        Path to a TSV file as output by format_data().
+    output_labels : :obj:`str`
+        Path to a TSV file as output by format_data().
+    """
+    harmonies = pd.read_csv(
+        path_labels, sep='\t',
+        usecols=['mc', 'mc_onset', 'duration_qb', 'label'], #'chord'
+        converters={'mc_onset': safe_frac}
+    )
+    measures = pd.read_csv(
+        path_measures, sep='\t',
+        usecols=['mc', 'quarterbeats_all_endings'],
+        converters={'quarterbeats_all_endings': safe_frac}
+    )
+    dcml_labels = transform_df(labels=harmonies, measures=measures)
+    dezrann_content = convert_dcml_list_to_dezrann_list(dcml_labels)
+
+    # Manual post-processing  #TODO: improve these cases
+    # 1) Avoid NaN values in "duration" (happens in second endings)
+    # optional : in the transform_df : transformed_df = transformed_df.replace('NaN', 0) ?
+    for label in dezrann_content['labels']:
+        if pd.isnull(label['duration']):
+            print(f"WARNING: NaN duration detected in label {label}.")
+            label['duration'] = 0
+    # 2) Remove "start" value in the first label ?
+    if dezrann_content['labels'][0]['start'] == 0.:
+        del dezrann_content['labels'][0]['start']
+
+    with open(output_path, 'w', encoding='utf-8') as f:
+        json.dump(dezrann_content, f, indent=2)
+
+
+# Test
+MOZART_SONATAS = [
+    'K279-1', 'K279-2', 'K279-3',
+    'K280-1', 'K280-2', 'K280-3',
+    'K283-1', 'K283-2', 'K283-3',
+]
+MEASURE_DIR = os.path.join("src", "ms3") #to be updated
+HARMONY_DIR = os.path.join("src", "ms3") #to be updated
+MEASURE_PATHS = [
+    os.path.join(MEASURE_DIR, f"{movement}_measures.tsv")
+    for movement in MOZART_SONATAS
+]
+HARMONY_PATHS = [
+    os.path.join(HARMONY_DIR, f"{movement}_harmonies.tsv")
+    for movement in MOZART_SONATAS
+]
+
+OUTPUT_DIR = "." #to be updated
+def generate_all_dez(output_dir=OUTPUT_DIR):
+    for i_piece, piece in enumerate(MOZART_SONATAS):
+        generate_dez(MEASURE_PATHS[i_piece], HARMONY_PATHS[i_piece])
+
+
+if __name__ == "__main__":
+    #measures = ms3.load_tsv('src/ms3/K283-2_measures.tsv')
+    #harmonies = ms3.load_tsv('src/ms3/K283-2_harmonies.tsv')
+    #transformed = transform_df(labels=harmonies, measures=measures)
+    #print(transformed)
+
+    dez = generate_dez('src/ms3/K283-2_measures.tsv', 'src/ms3/K283-2_harmonies.tsv')
+    #generate_all_dez()