Skip to content

3. Data Model Issues

rebmizrahi edited this page Apr 8, 2024 · 1 revision

Data Model Issues

General Issues

  • Placing some information in relationships instead of entities
    • i.e. Rage date composed in the COMPOSED relationship as opposed to in the Musical Work entity
    • That is the "best" way to do it in a relational and graph database but it makes it harder to export to RDF
      • Makes us have to reify a relationship
  • Should a work be able to have multiple genres?
  • Two ways of encoding geographical location
  • Hornbostel–Sachs Number for instruments
  • Name of the relationship for the CONTAINS relationship between instrument and symbolic music file
  • Some of the attributes for files can be expressed in a relationship between the file and its encoder, but then we run into the same issue as above
  • Can sources be in more than one language?
  • Where would lyrics go?
  • Editorial notes in a source, collection of sources or both?
  • Representing workflows
  • How to represent and store provenance "chains"?
  • Keeping track of multiple titles and the language of the title
  • Keeping track of different spellings of a person's name
  • Adding lyrics to works
  • Multiple roles for creators
  • Every field must have provenance

The meaning of NULL

Since our data is heterogeneous, we need to have a convention on how to distinguish values that could exist but are missing (i.e. user enters a musical work without a date composed) vs. values that are not applicable (i.e. if we had a key signature field, it would not a apply to a medieval work, or a specific work does not belong to any opus but we still have a field for opus)

This seems to be a bit of an open problem, and even the father of relational databases thinks that "the ISO SQL implementation of Null is the subject of criticism, debate and calls for change. In The Relational Model for Database Management: Version 2, Codd suggested that the SQL implementation of Null was flawed and should be replaced by two distinct Null-type markers. The markers he proposed were to stand for "Missing but Applicable" and "Missing but Inapplicable", known as A-values and I-values, respectively. Codd's recommendation, if accepted, would have required the implementation of a four-valued logic in SQL."

The best solution for me right now would be to have conventions on what NULL is. We could decide that NULL means that the information is applicable but missing, and then have conventions to indicate that the data is not applicable (for instance a blank string for textual fields or a NaN for numeric fields). Or we could do it the other way around and decide the NULL means not applicable and have conventions for values that are applicable but missing. I know that creating our own conventions is not ideal but the distinction is important, and as far as I know there is no standard way of doing it.

Another (possibly horrible) solution would be to have a set of attributes that the entity "should" have, and then a column of JSON or hstore type that we can use to keep key-value pairs for any "extra" data regarding the entity. Since we can pretty much do whatever we want with key-value pairs, such a column can and will look different for different rows. This is messy but it gets around some of the complexity of defining what NULL is. In this solution if we add to the key-value field only the keys and values to represent what we need. For example for we can add religiosity : secular and mode : Dorian to a certain work. However, this is lacks structure and since our data is heterogeneous the set of attributes that every entity "should" have would be relatively small, and we lose some of the control over the schema.

Where did you get this information?

Since we want to keep track of where the information inserted in the database comes from, almost every attribute needs a "meta-attribute" that specifies the origin of this information. A possible solution would be to make those attributes of a composite type, so that we can have a field with two values, the data we need (title, date, etc) and where it came from.

Another possible solution that Andrew suggested is that for every table we have a "sister" provenance table with the same number of columns, but instead of having data in the column we have the provenance information, and those tables are mapped one-to-one (i.e. every row in a data table maps to exactly one row in its sister provenance table and vice-versa)