Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New DublinCore-based metadata implementation #1266

Merged

Conversation

gregchapman-dev
Copy link
Contributor

@gregchapman-dev gregchapman-dev commented Apr 7, 2022

A first cut so you can see what I'm up to, and I can get some idea from you folks if I'm headed in a good direction.
Fixes #1262.

@gregchapman-dev gregchapman-dev marked this pull request as draft April 7, 2022 22:30
@gregchapman-dev
Copy link
Contributor Author

gregchapman-dev commented Apr 7, 2022

I have deleted this very obsolete description of my approach.

@gregchapman-dev gregchapman-dev mentioned this pull request Apr 7, 2022
@gregchapman-dev
Copy link
Contributor Author

This PR now has proposed modifications to xmlToM21.py and m21ToXml.py to use the new Metadata APIs. Anything unrecognized goes into the miscellaneous tag, with the key containing both namespace and key, so if a writer understands one of those namespaces, it can interpret that data with no loss.

@gregchapman-dev
Copy link
Contributor Author

gregchapman-dev commented Apr 11, 2022

Another deleted, updated-but-still-obsolete description of my approach.

@gregchapman-dev
Copy link
Contributor Author

No more class hierarchy, it was not working for me the way I wanted. class Metadata has all the old APIs, which are re-implemented, but fully backward compatible, and all the new APIs, which should be used instead of the old ones. The MusicXML parser and writer can run using either (chosen at run-time). I'll remove the old code when I have other good backward compatibility tests written.

@gregchapman-dev
Copy link
Contributor Author

Next step: get the current set of tests (all backward compatibility-based) succeeding. Then write a bunch of new tests for the new APIs, and some new tests for the old APIs as well.

@coveralls
Copy link

coveralls commented Apr 19, 2022

Coverage Status

Coverage increased (+0.007%) to 93.09% when pulling 3f1d5ea on gregchapman-dev:gregc/metadataDublinCore into 66cba42 on cuthbertLab:master.

@gregchapman-dev
Copy link
Contributor Author

This is ready for review (I am working on writing a bunch of new tests to increase coverage significantly).

@gregchapman-dev gregchapman-dev marked this pull request as ready for review April 19, 2022 22:54
@gregchapman-dev gregchapman-dev force-pushed the gregc/metadataDublinCore branch from 9d50355 to f660326 Compare April 20, 2022 20:14
@gregchapman-dev
Copy link
Contributor Author

This is now ready for review. I have attached a text file containing a description of the nature of the changes here
description.txt
.

Copy link
Member

@mscuthbert mscuthbert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi! I just did a once over and it looks really powerful and a good step in the right direction. My main concern is that I'd like the standard API to look more like Python and less like Java with getItem/setItem etc. Is there a way forward with that?

Some smaller changes asking for more diverse examples and attention to docs generation also.

This isn't a full review -- just what I could do on a long plane flight.

music21/metadata/__init__.py Outdated Show resolved Hide resolved
music21/metadata/__init__.py Outdated Show resolved Hide resolved
music21/metadata/__init__.py Outdated Show resolved Hide resolved
music21/metadata/__init__.py Outdated Show resolved Hide resolved
music21/metadata/__init__.py Outdated Show resolved Hide resolved
music21/metadata/__init__.py Outdated Show resolved Hide resolved
music21/metadata/__init__.py Outdated Show resolved Hide resolved
music21/metadata/__init__.py Outdated Show resolved Hide resolved
music21/metadata/testMetadata.py Show resolved Hide resolved
music21/metadata/testMetadata.py Outdated Show resolved Hide resolved
@gregchapman-dev gregchapman-dev force-pushed the gregc/metadataDublinCore branch from 894ec21 to 881a587 Compare May 14, 2022 20:29
@gregchapman-dev
Copy link
Contributor Author

That latest big push was just a rebase to latest master, and a bunch of "t." additions to my new code to match the new "import typing as t" thing.

@gregchapman-dev gregchapman-dev force-pushed the gregc/metadataDublinCore branch from a8b2132 to 46177eb Compare May 23, 2022 00:10
@mscuthbert
Copy link
Member

thanks for the amazing work on this -- I'm liking what I'm seeing and I'm getting more and more confident this will be a major feature in v.8

@mscuthbert
Copy link
Member

Looking a little more, I thought we were leaning more towards .get('composer') returning a single item (like .getFirst does) and .getAll() returning a tuple.

I'm still digging through the roots, but why can't we use __setattr__ and __getattr__ to allow what .get() and .set() do, and suggest using those only for items that aren't valid Python attributes such as name-spaced attributes. (and raise an AttributeError if trying to set something custom, with an error on set: AttributeError: use .setCustom() to add a custom attribute or .set() to add a namespace?

@mscuthbert
Copy link
Member

The two most Pythonic ways to add or remove an attribute are md.librettist = ... or md['librettist'] = .... Maybe we can use both, the .librettist form for getting a string and the ['librettist'] form for getting/setting a tuple of objects?

@gregchapman-dev
Copy link
Contributor Author

gregchapman-dev commented May 25, 2022 via email

@mscuthbert
Copy link
Member

I’m attracted to the idea of using setattr/getattr and getitem (or whatever it is) to make it possible to do things like: item = md.librettist items= md[‘librettist'] items = md[‘marcrel:LBT’] # that’s ‘librettist’ as well, in ’namespace:name’ form md.librettist = item md[‘librettist’] = items md[‘marcrel:LBT’] = items But it’s a little weird differentiating plural vs singular getters that way. Dict-style keying doesn’t feel obviously more plural.

It's not about plural vs. singular, it's about returning a single Python primitive (generally a string) with __getattr__, and returning the underlying representation (generally a tuple of music21.metadata.primatives.Text/Date, etc. objects) with __getitem__.

We could also do a little Django-style magic and let print(md.marcel__LBT) work, where __ substitutes for :. But that might be overkill.

What I would very much like is that most people can continue using metadata, even expanded metadata, as something like:

In [3]: corpus.parse('mozart').metadata.composer
Out[3]: 'W.A. Mozart'

but they might get:

In [3]: corpus.parse('mozart').metadata['composer']
Out[3]: (<music21.metadata.primitives.Text 'W.A. Mozart'>,)

in case they want to work with something deeper.

If there is more than one composer, I would be okay with something like:

In [3]: corpus.parse('mozart').metadata.composer
Out[3]: 'MULTIPLE'

to signify to look further?

@gregchapman-dev
Copy link
Contributor Author

This is appealing. Let me think a bit on it.

@gregchapman-dev
Copy link
Contributor Author

It's not about plural vs. singular, it's about returning a single Python primitive (generally a string) with getattr, and returning the underlying representation (generally a tuple of music21.metadata.primatives.Text/Date, etc. objects) with getitem.

I did not understand this at first, and I do now. I like this approach a lot.

We could also do a little Django-style magic and let print(md.marcel__LBT) work, where __ substitutes for :. But that might be overkill.

Yeah, I agree that's overkill.

In [3]: corpus.parse('mozart').metadata.composer
Out[3]: 'W.A. Mozart'

Yes.

In [3]: corpus.parse('mozart').metadata['composer']
Out[3]: (<music21.metadata.primitives.Text 'W.A. Mozart'>,)

Yes.

In [3]: corpus.parse('mozart').metadata.composer
Out[3]: 'MULTIPLE'
to signify to look further?

Yes.

I think we still need further discussion on similar shortcuts for set(). I think add() is the only way to add.

@gregchapman-dev
Copy link
Contributor Author

Oh! Are you proposing that __getattr__ and __getitem__ would replace get and getFirst? I would actually be OK with that.

@gregchapman-dev
Copy link
Contributor Author

Latest push is new implementation of __getattr__ and __getitem__.

@gregchapman-dev
Copy link
Contributor Author

gregchapman-dev commented May 29, 2022

As of the latest push, get/getFirst/set are replaced by __getattr__, __getitem__, __setattr__, and __setitem__. This also means that you can get/set single items as strings with md.uniqueName (plus the already existing workId and workId abbreviation attributes. add() remains, as the only way to add more items of a particular name. Let me know what you think.

@gregchapman-dev
Copy link
Contributor Author

Oh, and all the parsers have been updated to use add() as much as makes sense for that file format, and the writers are all dealing with multiples as much as makes sense for that file format.

@mscuthbert
Copy link
Member

mscuthbert commented May 30, 2022

Oh! Are you proposing that __getattr__ and __getitem__ would replace get and getFirst? I would actually be OK with that.

I actually meant that __getattr__ and __getitem__ would replace getFirst and get (=getAll) so the dot-form gets the most simple, ready to go version, and the bracket-form gets a representation that someone needs to know about music21 metadata to work with. I'll look at the code soon to see if that's what was implemented.

@gregchapman-dev
Copy link
Contributor Author

OK, that big push was the rebase to latest master. On to uniqueName keys in md._contents.

@gregchapman-dev
Copy link
Contributor Author

The test failure is unrelated to my changes:
FAIL: _rawQuery (configure.SelectFromList)
Doctest: configure.SelectFromList._rawQuery

@mscuthbert
Copy link
Member

ack -- the push of all previous commits has wiped out Github's memory of what has already been reviewed, what has changed since last review, etc. -- can you point me to the place you've made the most changes since yesterday so I can look at the diffs manually. I was already planning on doing a Squash Merge when it's ready, but definitely will now. :-)

Copy link
Member

@mscuthbert mscuthbert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Figured out that all changes were in the last commit. Whew!

Looking good. I'm still trying to grok when custom attributes are included and when they're excluded in various routines; I think I've found a few places where they're not included. It'd be great to have custom attributes set in more of the tests to confirm that they're being searched/returned by various .all() etc. routines.

Besides the namespace name differences (TLAs), I'm confused about the difference between .all() and .getAllNamedValues() -- when should a user use the first and when the latter? If the namespace:TLA is the main difference, then the method name doesn't do a good job of distinguishing them (and couldn't that be an attribute on .all()?) THANKS!

@@ -1488,7 +1488,7 @@ def testDateSelection(self):


ValueType = t.Union[DateSingle, DateRelative, DateBetween, DateSelection,
Text, Contributor, Copyright]
Text, Contributor, Copyright, int]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is "int" included in this? It doesn't seem to make sense as part of this definition. In the few places where an int can be returned would it be better to type that as t.Union[ValueType, int] -- otherwise you won't be able to do x: ValueType = function(); print(x._data) because ints don't have a ._data attribute.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ValueType is "the types of the values that might be found in self._contents". I didn't realize it also needed to mean "types with _data". I found ValueType really useful (instead of using t.Any), and I'm not sure using t.Union[ValueType, int] everywhere will actually solve your problem (if function returns t.Union[ValueType, int], you still can't print(x._data). Perhaps we shouldn't use int in _contents at all, and have a new type with an int x._data field?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, and int is in there because of fileNumber.

Comment on lines 1218 to 1225
UNIQUE_NAME_TO_MUSIC21_WORK_ID: t.Dict[str, str] = {
x.uniqueName if x.uniqueName
else x.name:
x.oldMusic21WorkId if x.oldMusic21WorkId
else ''
for x in STANDARD_PROPERTY_DESCRIPTIONS
if x.oldMusic21WorkId
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please rewrite as a standard loop. I can't figure out what if goes to what. Only singly-nested generator functions appear (or should appear) in music21 code and this is triply nested.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aw man, I was so proud of writing that crazy thing. I can't read it either, though, so yeah I'll redo those.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was indeed remarkably impressive. the part that made me say, "nah, as amazing as this is, it's gotta go" was seeing else x.name: with a line break after it, and I was like, "wait a second, else can't take an expression after it, he's thinking of elif..." before I realized that the : was the separator of the key:value of the dictionary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines +1303 to +1304
# Metadata.all to here, since it is clearly MusicXML-specific, and I don't
# want to corrupt the actual metadata in other code paths/converters. Perhaps
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:-) +1!

Comment on lines +171 to +172
f'Analyst: {self.analyst}',
f'Proofreader: {self.proofreader}',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you!

Comment on lines +1158 to +1160
if not self._isStandardUniqueName(uniqueName):
# custom metadata, don't search it
continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this be found earlier in the self. _getPluralAttribute(field) line? We would like to be able to search customMetadata if possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, custom metadata is currently not searchable. If you want to search custom metadata there's a bunch of work needed in the search algorithm, because in some (most? all?) cases it will only report the first matching piece of metadata. So a custom piece of metadata that has "title" somewhere in its custom name will mess your expected results up badly. I'm not sure I want to bite that off just now. It would be a pretty big behavior change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. That can be done later, I suppose, since it's not a feature in the current system apparently (I thought it was, but just checked). I'll put that as a followup PR. Your code is so clear that I'm pretty sure I know how to do it pretty easily.

@@ -1809,23 +1768,6 @@ def _getPluralAttribute(self, attributeName: str) -> t.Tuple[str, ...]:
properties.MUSIC21_ABBREVIATION_TO_NAMESPACE_NAME[attributeName]
)

# The following are in searchAttributes, and getattr will find them because
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a comment about the above lines (from prior commits). Couldn't this, faster and more legibly be:

if attributeName in self._contents:
    ...do something with self._contents[attributeName]

try:
    attributeName = ...standardize attribute name
   ... do something with self._contents[attributeName]
except (KeyError, AttributeError) as ae:
    raise AttributeError(f'invalid attributeName: {attributeName}')

that way we get customElement searching in here as well, and short circuit the expensive standardization process when the correctly spelled name exists and the metadata exists.

BTW -- What is the return result for a correctly spelled standard attributeName but without information/entry in md._contents? ('',), None, or AttributeError?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's None for singular attributes, and an empty tuple for plural ones.

It's interesting to expose custom names as attributes, although it would only work for those custom names that are valid Python attribute names. For example, custom names might be things like 'myNamespace:blah', and that would never end up in an attribute name.

Also, I think the side-effect of adding (some) custom element searching is not something we want just yet (from a different discussion).

Copy link
Member

@mscuthbert mscuthbert Aug 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AHH!!!! This is for attributes not for getItem -- never mind. You're right, customElements shouldn't get a .customName interface.

music21/metadata/__init__.py Outdated Show resolved Hide resolved
Comment on lines 2592 to 2607
>>> rmd.all(skipContributors=True)
(('ambitus',
AmbitusShort(semitones=48, diatonic='P1', pitchLowest='C2', pitchHighest='C6')),
('copyright', '© 2014, Creative Commons License (CC-BY)'),
('dateCreated', '1689/--/-- or earlier'),
('fileFormat', 'musicxml'),
('filePath', '...corpus/corelli/opus3no1/1grave.xml'),
('keySignatureFirst', -1),
('keySignatures', [-1]),
('localeOfComposition', 'Rome'),
('movementName',
'Sonata da Chiesa, No. I (opus 3, no. 1)'),
('noteCount', 259),
('numberOfParts', 3),
('pitchHighest', 'C6'),
('pitchLowest', 'C2'),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all this doesn't need to be repeated just to show one thing has changed. use ... to skip identical lines.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

music21/metadata/__init__.py Outdated Show resolved Hide resolved
music21/metadata/__init__.py Outdated Show resolved Hide resolved
@gregchapman-dev
Copy link
Contributor Author

I'm going to pause to let you finish reviewing everything including my (just pushed) change to make self._contents keyed by uniqueName instead of namespaceName. That simplified things quite a bit.

@mscuthbert
Copy link
Member

I'm going to pause to let you finish reviewing everything including my (just pushed) change to make self._contents keyed by uniqueName instead of namespaceName. That simplified things quite a bit.

Done -- it's AWESOME!

Is the method to translate a uniqueName to namespaceName (and vice-versa) public or private? I suspect that people will want to do this, depending on their needs. (and if we eventually get in an attribute to .write(namespaceMetadata=True), we'll want it to be public)

@mscuthbert
Copy link
Member

failing test is my fault -- will fix on master and then rerun.

@gregchapman-dev
Copy link
Contributor Author

Metadata.uniqueNameToNamespaceName and Metadata.namespaceNameToUniqueName are both public.

@mscuthbert
Copy link
Member

Greg -- after the force push, Github seems to be treating pr/1266 and gregchapman-dev:gregc/metadataDublinCore as two different branches, so the tests are failing right now against master because of a boneheaded direct push I did yesterday. When you make your next edits, can you make sure that you can merge master into this first so that we can see if the tests all run. Thanks!

@gregchapman-dev
Copy link
Contributor Author

Until (very very) recently, all returned uniqueNames and getAllNamedValues returned namespaceNames. That is no longer the case, they both return uniqueNames. There may be other slight differences (I was trying to keep all somewhat backward-compatible). I'll take a look to see if I can just have one (probably called all) and make it do one thing.

@gregchapman-dev
Copy link
Contributor Author

I take that back about all. There is still a big remaining difference. all returns all the values as str (except for the RichMetadata stuff, which can be t.Any). getAllNamedValues returns the values as the full ValueType (i.e. Text, Contributor, etc).

So I think we just need a new name for getAllNamedValues. Any suggestions?

@mscuthbert
Copy link
Member

mscuthbert commented Aug 5, 2022

I take that back about all. There is still a big remaining difference. all returns all the values as str (except for the RichMetadata stuff, which can be t.Any). getAllNamedValues returns the values as the full ValueType (i.e. Text, Contributor, etc).

So I think we just need a new name for getAllNamedValues. Any suggestions?

allPrimitives() ?

though would it be hard to make returnPrimitives=True be a keyword option to all()? Otherwise we'll eventually need 4 methods if we also have namespaceNames=True as options.

If you go this route, we'll probably need an @overload operator on the typing (see, for instance, Chord.closedPosition() which gives a different return value based on inPlace=True or False) -- but for now the typing can be t.Dict[str, t.Union[str, ValueType]]

@gregchapman-dev
Copy link
Contributor Author

Typing is hard here, because RichMetadata overrides this, and returns some randomly typed stuff. And you can't have the RichMetadata version typed differently than the Metadata version.

@gregchapman-dev
Copy link
Contributor Author

I like the @overload thing though; I didn't know about that. I'll see what I can do.

@mscuthbert
Copy link
Member

mscuthbert commented Aug 5, 2022

Typing is hard here, because RichMetadata overrides this, and returns some randomly typed stuff. And you can't have the RichMetadata version typed differently than the Metadata version.

Yeah, I think that I'm going to make RichMetadata also return primitives of some sort in a future PR, hopefully before v8, so that everything in metadata is some sort of primitive.

I'll probably also make them all inherit from a base class like MetadataPrimitive or something like that. When music21 first started, additional class hierarchies really slowed down Python -- deciding to put in GeneralNote and NotRest between Music21Object and Note was really agonizing because of the speed difference. Now it's basically negligible:

In [1]: class A:
   ...:     pass

In [2]: class B(A):
   ...:     pass

In [3]: %timeit A()
75.9 ns ± 2.9 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [4]: %timeit B()
78.3 ns ± 2.59 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

less negligible if there's an init and super() call though, but still in the ns range:

In [1]: class A:
   ...:     def __init__(self):
   ...:         pass
 
In [2]: class B(A):
   ...:     def __init__(self):
   ...:         super().__init__()
 
In [3]: %timeit A()
144 ns ± 2.96 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [4]: %timeit B()
415 ns ± 10.4 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

1. Delete some duplicative test output.
2. Use explicit loops to initialize dictionaries in properties.py.
3. Remove getAllNamedValues() and getAllContributorNamedValues(), replacing
   them with new options on all(): skipNonContributors, returnPrimitives, and
   returnSorted.
4. Tweak titleSummary return code in bestTitle.
5. (not from review) Remove dead code left over from a bad merge, in hopes
   it will improve coverage enough.
@gregchapman-dev
Copy link
Contributor Author

I think this last push addresses everything.

@gregchapman-dev
Copy link
Contributor Author

gregchapman-dev commented Aug 5, 2022

If you want to change skipNonContributors to onlyContributors, I'll do it. I couldn't pick between the two, so I flipped a coin. skipNonContributors makes sense next to skipContributors when you're reading the docstring, but when you call all() for contributors, saying skipNonContributors=True makes you scratch your head every time.

@gregchapman-dev
Copy link
Contributor Author

gregchapman-dev commented Aug 5, 2022

Oh, and I left the difficult type-hinting on all() to you. I couldn't figure out how to do it. For now the values returned are t.Any instead of t.Union[str, whatever RichMetadata returns] (if returnPrimitives is False) or t.Union[ValueType, whatever RichMetadata returns] (if returnPrimitives is True).

@gregchapman-dev
Copy link
Contributor Author

gregchapman-dev commented Aug 5, 2022

A thought about namespace names in MusicXML <miscellaneous>: Perhaps in the MusicXML writer I should add a single <miscellaneous-field> for each namespace that we used in the section, as follows:

<miscellaneous-field name=dcterms>https://purl.org/dc/terms/<\miscellaneous-field>
<miscellaneous-field name=marcrel>https://id.loc.gov/vocabulary/relators/<\miscellaneous-field>
<miscellaneous-field name=humdrum>https://www.humdrum.org/reference-records/#<\miscellaneous-field>

That way we are documenting what the namespaces mean, which makes the TLAs less horrible. You can look them up. And it's both human-helpful and other-people's-software-helpful.

@ahankinson what do you think?

@ahankinson
Copy link
Contributor

The "correct" XML way to do it is to use the xmlns attribute:

<root-element xmlns:marcrel="https://id.loc.gov/vocabulary/relators/">
    ...
    <miscellaneous-field>... </miscellaneous-field>
</root-element>

Any child element of an element with an xmlns declaration will inherit the declaration.

There are only two issues with this:

  1. I don't know if all elements within MusicXML accepts the xmlns:... attribute. You probably want to hand-write one in, and then verify it with the MusicXML schema to see whether it passes. Typically you might declare all used namespaces in the document on the document root element.
  2. XML Namespaces are technically only for elements and attributes; they don't really apply for attribute values. So you don't get the automatic namespace expansion facilities in XML processors. But it probably doesn't hurt anything either.

Here are a few references:

https://docs.microsoft.com/en-us/previous-versions/windows/desktop/ms754539(v=vs.85)

https://www.w3.org/TR/xml-names/

@gregchapman-dev
Copy link
Contributor Author

@ahankinson I believe what you are saying is that I would put three xmlns declarations in the <miscellaneous> tag (which is the root element for all the <miscellaneous-field> tags). You're saying that this is the standard way to allow elements and attributes (but not attribute values) to use those three prefixes, and various XML tools will know about that. It will be obvious to humans (but the tools will not really know) what it means when we use those prefixes in attribute values (e.g. name=marcrel:XXX). And software that parsed the MusicXML could know about it, too, but by interpreting that standard info in a slightly non-standard way (applying the xmlns info to the attribute values).

If that's what you meant, I think I should make the MusicXML writer do it. (But I'll try it by hand first, to make sure the MusicXML schema is OK with it.)

@mscuthbert do you agree?

@mscuthbert
Copy link
Member

If you want to change skipNonContributors to onlyContributors, I'll do it. I couldn't pick between the two, so I flipped a coin. skipNonContributors makes sense next to skipContributors when you're reading the docstring, but when you call all() for contributors, saying skipNonContributors=True makes you scratch your head every time.

yeah, I think I'll flip it to 'onlyContributors' but after this PR is merged.

@mscuthbert
Copy link
Member

@ahankinson I believe what you are saying is that I would put three xmlns declarations in the <miscellaneous> tag (which is the root element for all the <miscellaneous-field> tags). You're saying that this is the standard way to allow elements and attributes (but not attribute values) to use those three prefixes, and various XML tools will know about that. It will be obvious to humans (but the tools will not really know) what it means when we use those prefixes in attribute values (e.g. name=marcrel:XXX). And software that parsed the MusicXML could know about it, too, but by interpreting that standard info in a slightly non-standard way (applying the xmlns info to the attribute values).

If that's what you meant, I think I should make the MusicXML writer do it. (But I'll try it by hand first, to make sure the MusicXML schema is OK with it.)

@mscuthbert do you agree?

When I've tried to put other namespaces into MusicXML files in the past, I ran into major problems with external readers and with Python parsing as well. But it was a long time ago, maybe things have gotten better. In any case, that can be a following project/PR.

I'm just running some speed tests and if they pass, will merge.

@mscuthbert
Copy link
Member

Speed tests:

>>> %timeit corpus.manager._metadataBundles.clear(); corpus.manager.cacheMetadataBundleFromDisk(corpus.corpora.CoreC
   ...: orpus())
1.33 s ± 18.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# master:  1.89s; previous push: 1.62s ✅

>>> md = corpus.parse('luca').metadata
>>> %timeit md.search('credo')
16.1 µs ± 297 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
# master: 48µs; previous push: 307µs ✅

>>> %timeit corpus.corpora.CoreCorpus().search('458')
571 ms ± 6.06 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# master: 953ms; prev-push: 7600ms  ✅

Actual performance might be even better, since I ran the prior tests plugged in but I'm on battery now.

Looks good to merge.

Thank you for the huge work!

@mscuthbert mscuthbert merged commit 4d48584 into cuthbertLab:master Aug 5, 2022
@mscuthbert
Copy link
Member

I'll open up an issue where we can continue to have conversations about some of the important things that Andrew brings up, and look at smoothing out some rough edges before we need to commit to all aspects of the public API

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Metadata rework
4 participants