-
Notifications
You must be signed in to change notification settings - Fork 405
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New DublinCore-based metadata implementation #1266
New DublinCore-based metadata implementation #1266
Conversation
I have deleted this very obsolete description of my approach. |
This PR now has proposed modifications to xmlToM21.py and m21ToXml.py to use the new Metadata APIs. Anything unrecognized goes into the miscellaneous tag, with the key containing both namespace and key, so if a writer understands one of those namespaces, it can interpret that data with no loss. |
Another deleted, updated-but-still-obsolete description of my approach. |
No more class hierarchy, it was not working for me the way I wanted. class Metadata has all the old APIs, which are re-implemented, but fully backward compatible, and all the new APIs, which should be used instead of the old ones. The MusicXML parser and writer can run using either (chosen at run-time). I'll remove the old code when I have other good backward compatibility tests written. |
Next step: get the current set of tests (all backward compatibility-based) succeeding. Then write a bunch of new tests for the new APIs, and some new tests for the old APIs as well. |
This is ready for review (I am working on writing a bunch of new tests to increase coverage significantly). |
9d50355
to
f660326
Compare
This is now ready for review. I have attached a text file containing a description of the nature of the changes here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi! I just did a once over and it looks really powerful and a good step in the right direction. My main concern is that I'd like the standard API to look more like Python and less like Java with getItem/setItem etc. Is there a way forward with that?
Some smaller changes asking for more diverse examples and attention to docs generation also.
This isn't a full review -- just what I could do on a long plane flight.
894ec21
to
881a587
Compare
That latest big push was just a rebase to latest master, and a bunch of "t." additions to my new code to match the new "import typing as t" thing. |
a8b2132
to
46177eb
Compare
thanks for the amazing work on this -- I'm liking what I'm seeing and I'm getting more and more confident this will be a major feature in v.8 |
Looking a little more, I thought we were leaning more towards I'm still digging through the roots, but why can't we use |
The two most Pythonic ways to add or remove an attribute are |
I switched away from get -> t.Any and getAll -> t.Tuple today (yesterday?), changing them to getFirst -> t.Any and get -> t.Tuple, in order to make it clear that clients should generally be prepared to deal with multiple values for any given item. getFirst should only be called if you are in a situation where only one can be used (e.g. exporting to musicxml where the “first” title goes in as the “one true title", and the rest go in miscellaneous a bit later; take a look at my version of m21ToXml.py to see how I do this).
set() and add() have been modified very recently as well, to take either a single item or a list of items (setAll and especially addAll sounded weird to me).
I’m attracted to the idea of using setattr/getattr and __getitem__ (or whatever it is) to make it possible to do things like:
item = md.librettist
items= md[‘librettist']
items = md[‘marcrel:LBT’] # that’s ‘librettist’ as well, in ’namespace:name’ form
md.librettist = item
md[‘librettist’] = items
md[‘marcrel:LBT’] = items
But it’s a little weird differentiating plural vs singular getters that way. Dict-style keying doesn’t feel obviously more plural. And how do you do an add?
I could, on the other hand, have customized plural forms for each unique name (but not for the namespace:names, which could always return a tuple, I guess). I would just add a pluralUniqueName field to each PropertyDescription.
And the customized plurals can get weird. ‘librettists' is pretty straightforward, but what’s the plural of ‘rights’? ‘rightses’? Maybe ‘rightsList’ or ‘multipleRights’, or something, as a fallback for difficult terms. You will notice that musicxml actually supports multiple rightses, and I have changed the musicxml converter to take advantage of that.
And we still have to have an add() call, so the properties don’t feel great as a complete solution.
As an aside, there are also compatibility issues with adding all the new metadata terms as properties. md.librettist (being an existing property, not a new one) currently returns str. And it would need to change to return Text (to match most of the new properties), which breaks backward compatibility. Is that OK? If so, I can remove a lot of weird code that tries very hard to keep all those old properties behaving exactly as before.
Further thoughts? I’ve been mulling this issue for a while, and ended up just leaving the old properties as is, and not trying to add new ones (or plural new ones). It just got too complicated, so I’m currently stuck with “you really need to be using get/set/add, and occasionally getFirst”. I was even considering deprecating the old singular properties, since singular isn’t really what anyone should usually be doing. (But md.librettist sure does sound like it returns only one librettist.)
For the record, I like the feel of get/set/add; there is other code in music21 that does this (e.g. for XML elements), and dict itself has a get API.
Also for the record, I am motivated by a desire to make it obvious in the API that asking for a single librettist (or alternativeTitle, or whatever) is not generally a good idea.
Greg
… On May 25, 2022, at 1:47 PM, Michael Scott Asato Cuthbert ***@***.***> wrote:
The two most Pythonic ways to add or remove an attribute are md.librettist = ... or md['librettist'] = .... Maybe we can use both, the .librettist form for getting a string and the ['librettist'] form for getting/setting a tuple of objects?
—
Reply to this email directly, view it on GitHub <#1266 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AR6X47F44WMYLBDSSE3D7GDVL2GVRANCNFSM5S2Y34GA>.
You are receiving this because you authored the thread.
|
It's not about plural vs. singular, it's about returning a single Python primitive (generally a string) with We could also do a little Django-style magic and let What I would very much like is that most people can continue using metadata, even expanded metadata, as something like: In [3]: corpus.parse('mozart').metadata.composer but they might get: In [3]: corpus.parse('mozart').metadata['composer'] in case they want to work with something deeper. If there is more than one composer, I would be okay with something like: In [3]: corpus.parse('mozart').metadata.composer to signify to look further? |
This is appealing. Let me think a bit on it. |
I did not understand this at first, and I do now. I like this approach a lot.
Yeah, I agree that's overkill.
Yes.
Yes.
Yes. I think we still need further discussion on similar shortcuts for set(). I think add() is the only way to add. |
Oh! Are you proposing that |
Latest push is new implementation of |
As of the latest push, |
Oh, and all the parsers have been updated to use add() as much as makes sense for that file format, and the writers are all dealing with multiples as much as makes sense for that file format. |
I actually meant that |
OK, that big push was the rebase to latest master. On to uniqueName keys in md._contents. |
The test failure is unrelated to my changes: |
ack -- the push of all previous commits has wiped out Github's memory of what has already been reviewed, what has changed since last review, etc. -- can you point me to the place you've made the most changes since yesterday so I can look at the diffs manually. I was already planning on doing a Squash Merge when it's ready, but definitely will now. :-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Figured out that all changes were in the last commit. Whew!
Looking good. I'm still trying to grok when custom attributes are included and when they're excluded in various routines; I think I've found a few places where they're not included. It'd be great to have custom attributes set in more of the tests to confirm that they're being searched/returned by various .all()
etc. routines.
Besides the namespace name differences (TLAs), I'm confused about the difference between .all() and .getAllNamedValues() -- when should a user use the first and when the latter? If the namespace:TLA is the main difference, then the method name doesn't do a good job of distinguishing them (and couldn't that be an attribute on .all()?) THANKS!
@@ -1488,7 +1488,7 @@ def testDateSelection(self): | |||
|
|||
|
|||
ValueType = t.Union[DateSingle, DateRelative, DateBetween, DateSelection, | |||
Text, Contributor, Copyright] | |||
Text, Contributor, Copyright, int] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is "int" included in this? It doesn't seem to make sense as part of this definition. In the few places where an int can be returned would it be better to type that as t.Union[ValueType, int]
-- otherwise you won't be able to do x: ValueType = function(); print(x._data)
because ints don't have a ._data attribute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ValueType is "the types of the values that might be found in self._contents". I didn't realize it also needed to mean "types with _data". I found ValueType really useful (instead of using t.Any), and I'm not sure using t.Union[ValueType, int] everywhere will actually solve your problem (if function returns t.Union[ValueType, int], you still can't print(x._data). Perhaps we shouldn't use int in _contents at all, and have a new type with an int x._data field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, and int is in there because of fileNumber.
music21/metadata/properties.py
Outdated
UNIQUE_NAME_TO_MUSIC21_WORK_ID: t.Dict[str, str] = { | ||
x.uniqueName if x.uniqueName | ||
else x.name: | ||
x.oldMusic21WorkId if x.oldMusic21WorkId | ||
else '' | ||
for x in STANDARD_PROPERTY_DESCRIPTIONS | ||
if x.oldMusic21WorkId | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please rewrite as a standard loop. I can't figure out what if
goes to what. Only singly-nested generator functions appear (or should appear) in music21 code and this is triply nested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aw man, I was so proud of writing that crazy thing. I can't read it either, though, so yeah I'll redo those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it was indeed remarkably impressive. the part that made me say, "nah, as amazing as this is, it's gotta go" was seeing else x.name:
with a line break after it, and I was like, "wait a second, else can't take an expression after it, he's thinking of elif..." before I realized that the :
was the separator of the key:value
of the dictionary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
# Metadata.all to here, since it is clearly MusicXML-specific, and I don't | ||
# want to corrupt the actual metadata in other code paths/converters. Perhaps |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:-) +1!
f'Analyst: {self.analyst}', | ||
f'Proofreader: {self.proofreader}', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you!
if not self._isStandardUniqueName(uniqueName): | ||
# custom metadata, don't search it | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will this be found earlier in the self. _getPluralAttribute(field)
line? We would like to be able to search customMetadata if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, custom metadata is currently not searchable. If you want to search custom metadata there's a bunch of work needed in the search algorithm, because in some (most? all?) cases it will only report the first matching piece of metadata. So a custom piece of metadata that has "title" somewhere in its custom name will mess your expected results up badly. I'm not sure I want to bite that off just now. It would be a pretty big behavior change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah. That can be done later, I suppose, since it's not a feature in the current system apparently (I thought it was, but just checked). I'll put that as a followup PR. Your code is so clear that I'm pretty sure I know how to do it pretty easily.
music21/metadata/__init__.py
Outdated
@@ -1809,23 +1768,6 @@ def _getPluralAttribute(self, attributeName: str) -> t.Tuple[str, ...]: | |||
properties.MUSIC21_ABBREVIATION_TO_NAMESPACE_NAME[attributeName] | |||
) | |||
|
|||
# The following are in searchAttributes, and getattr will find them because |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a comment about the above lines (from prior commits). Couldn't this, faster and more legibly be:
if attributeName in self._contents:
...do something with self._contents[attributeName]
try:
attributeName = ...standardize attribute name
... do something with self._contents[attributeName]
except (KeyError, AttributeError) as ae:
raise AttributeError(f'invalid attributeName: {attributeName}')
that way we get customElement searching in here as well, and short circuit the expensive standardization process when the correctly spelled name exists and the metadata exists.
BTW -- What is the return result for a correctly spelled standard attributeName but without information/entry in md._contents? ('',)
, None
, or AttributeError?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's None for singular attributes, and an empty tuple for plural ones.
It's interesting to expose custom names as attributes, although it would only work for those custom names that are valid Python attribute names. For example, custom names might be things like 'myNamespace:blah', and that would never end up in an attribute name.
Also, I think the side-effect of adding (some) custom element searching is not something we want just yet (from a different discussion).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AHH!!!! This is for attributes not for getItem -- never mind. You're right, customElements shouldn't get a .customName
interface.
music21/metadata/__init__.py
Outdated
>>> rmd.all(skipContributors=True) | ||
(('ambitus', | ||
AmbitusShort(semitones=48, diatonic='P1', pitchLowest='C2', pitchHighest='C6')), | ||
('copyright', '© 2014, Creative Commons License (CC-BY)'), | ||
('dateCreated', '1689/--/-- or earlier'), | ||
('fileFormat', 'musicxml'), | ||
('filePath', '...corpus/corelli/opus3no1/1grave.xml'), | ||
('keySignatureFirst', -1), | ||
('keySignatures', [-1]), | ||
('localeOfComposition', 'Rome'), | ||
('movementName', | ||
'Sonata da Chiesa, No. I (opus 3, no. 1)'), | ||
('noteCount', 259), | ||
('numberOfParts', 3), | ||
('pitchHighest', 'C6'), | ||
('pitchLowest', 'C2'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all this doesn't need to be repeated just to show one thing has changed. use ...
to skip identical lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
…t namespaceName (or custom name).
I'm going to pause to let you finish reviewing everything including my (just pushed) change to make self._contents keyed by uniqueName instead of namespaceName. That simplified things quite a bit. |
Done -- it's AWESOME! Is the method to translate a uniqueName to namespaceName (and vice-versa) public or private? I suspect that people will want to do this, depending on their needs. (and if we eventually get in an attribute to |
failing test is my fault -- will fix on master and then rerun. |
Metadata.uniqueNameToNamespaceName and Metadata.namespaceNameToUniqueName are both public. |
Greg -- after the force push, Github seems to be treating pr/1266 and gregchapman-dev:gregc/metadataDublinCore as two different branches, so the tests are failing right now against master because of a boneheaded direct push I did yesterday. When you make your next edits, can you make sure that you can merge master into this first so that we can see if the tests all run. Thanks! |
Until (very very) recently, |
I take that back about So I think we just need a new name for |
though would it be hard to make If you go this route, we'll probably need an |
Typing is hard here, because RichMetadata overrides this, and returns some randomly typed stuff. And you can't have the RichMetadata version typed differently than the Metadata version. |
I like the @overload thing though; I didn't know about that. I'll see what I can do. |
Yeah, I think that I'm going to make RichMetadata also return primitives of some sort in a future PR, hopefully before v8, so that everything in metadata is some sort of primitive. I'll probably also make them all inherit from a base class like MetadataPrimitive or something like that. When music21 first started, additional class hierarchies really slowed down Python -- deciding to put in GeneralNote and NotRest between Music21Object and Note was really agonizing because of the speed difference. Now it's basically negligible:
less negligible if there's an init and super() call though, but still in the ns range:
|
1. Delete some duplicative test output. 2. Use explicit loops to initialize dictionaries in properties.py. 3. Remove getAllNamedValues() and getAllContributorNamedValues(), replacing them with new options on all(): skipNonContributors, returnPrimitives, and returnSorted. 4. Tweak titleSummary return code in bestTitle. 5. (not from review) Remove dead code left over from a bad merge, in hopes it will improve coverage enough.
I think this last push addresses everything. |
If you want to change |
Oh, and I left the difficult type-hinting on |
A thought about namespace names in MusicXML
That way we are documenting what the namespaces mean, which makes the TLAs less horrible. You can look them up. And it's both human-helpful and other-people's-software-helpful. @ahankinson what do you think? |
The "correct" XML way to do it is to use the
Any child element of an element with an There are only two issues with this:
Here are a few references: https://docs.microsoft.com/en-us/previous-versions/windows/desktop/ms754539(v=vs.85) |
@ahankinson I believe what you are saying is that I would put three If that's what you meant, I think I should make the MusicXML writer do it. (But I'll try it by hand first, to make sure the MusicXML schema is OK with it.) @mscuthbert do you agree? |
yeah, I think I'll flip it to 'onlyContributors' but after this PR is merged. |
When I've tried to put other namespaces into MusicXML files in the past, I ran into major problems with external readers and with Python parsing as well. But it was a long time ago, maybe things have gotten better. In any case, that can be a following project/PR. I'm just running some speed tests and if they pass, will merge. |
Speed tests:
Actual performance might be even better, since I ran the prior tests plugged in but I'm on battery now. Looks good to merge. Thank you for the huge work! |
I'll open up an issue where we can continue to have conversations about some of the important things that Andrew brings up, and look at smoothing out some rough edges before we need to commit to all aspects of the public API |
A first cut so you can see what I'm up to, and I can get some idea from you folks if I'm headed in a good direction.
Fixes #1262.