-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create IDs for speeches #13
Comments
I think wrapping speeches in divs is short sighted for a "living" resource -- we don't know how many things like this will be tagged and whether all potential things to tag will allow the hierarchical structure required by xml divs. Metadata blocks would allow tagging whatever features independently of other tagged features without creating a bottomless pit of divs. |
Yes. This is a really good point, its a more future-proof approach. |
I just realized we could use the n attributes that are available for all elements. From the documentation,
Then, we would just include the ID in all u elements that belong to the speech. Eg. for the following speech with the ID i-AzXa4EUmTu6mz8YQsCpizb <note type="speaker" xml:id="i-G36fJpDJFVqwFFQjbknRq2">
Herr ERIKSSON i Bäeckmora (cp):
</note>
<u xml:id="i-3KxGSd288AdTa9bfy9BtMv" xml:n="i-AzXa4EUmTu6mz8YQsCpizb" next="i-QzTu4nNrn4q8kU1N1u4xZC" who="i-7CXHDen9y2qKcYDisT3zjQ">
<seg xml:id="i-EV3wMeu3xQ8QuzNwmvWbjM">
Herr talman! I det som statsrådet Palme sade nu fanns väl egentligen
[...]
som en stor del av svenska folket bestämt önskar få ändring i.
</seg>
</u>
<u xml:id="i-QzTu4nNrn4q8kU1N1u4xZC" xml:n="i-AzXa4EUmTu6mz8YQsCpizb" prev="i-3KxGSd288AdTa9bfy9BtMv" who="i-7CXHDen9y2qKcYDisT3zjQ" next="i-CUrwEDJ9XoTNrw9wfqWrYb">
<seg xml:id="i-4mZ33Z1km8JtDLieMQPm5Q">
Jag anser att statsrådet Palme på denna punkt också skulle uppta
allvar-
</seg>
</u> |
What happens when this same fragment gets tagged as multiple things? Do we have multiple n attribs, or multiple IDs in the n? |
Which fragment are you referring to? |
We should follow the TEI guidelines. n should be used for page number. |
Where does it say that? Here it says
It might be a page number for pb elements, but for u elements I find no such description. |
@ninpnin I refer to the fragment you posted as an example. It's tagged as a speech with ID, but down the line it may be tagged with other things... an interpellation debate, or some other type of sectioning that may or may not coincide exactly with the speech itself. So how does the approach you describe handle multiple possible xml:n values? |
@BobBorges Debates are more suited for div-wrapping, along with any non-overlapping sectioning. But if we have other possibly overlapping things, I unfortunately have no solution for that. |
It seems like putting these thing as element lists in the tei header would be most flexible, and cleanest in the case when a human has to look at the xml. |
does the schema allow for that? |
I agree with Bob, that for now the solution "List speeches in the metadata block" sounds like the best one. Would that work with the TEI schema? I also added a third option. that is to make each speech only be one block and then rather have paragraph breaks within each utterance. It is semantically closer to the TEI schema than how we solve it now (and the id of the u block would be the speech ID). But still, I think the first solution is best. |
there are a couple of options (parlaclarin): From the parlaclarin given examples, looks like standOff is the closest to what we want, but we could also consider like listGrp with type attrib, id, and sub elems that contain a referring ID for the segs we want to label. |
|
Follow at clarin-eric/parla-clarin#25 |
It seems like the parlaclarin people aren't eager to get involved in this. I propose a workaround (or workwithin?) here: Under This strategy would allow several nice improvements (1) speech Ids (in the example, it's the id attribute of notes with type "speech") (2) labeling and categorizing the text without cluttering up the |
Looks good. I would maybe only add the paragraph IDs, so that it doesn't get too cluttered, but that's a minor consideration and a matter of taste. @BobBorges Should I suggest this to the parlaclarin people? |
It passes verification -- both TEI and parlaclarin. So if we're happy with it, I think we don't need to do anything else... or maybe show them the example in case some other people want a speech ID. |
Now we have the notation. Another consideration is, what do we define as a speech? Do we regenerate indices every time a paragraph is added or deleted? Or perhaps every time the first or the last paragraph is changed? Or just the first one? @BobBorges @MansMeg what do you think |
Oh. Im not sure I understand fully how this will work. It seems from you example that you also cluster the segments into debates? I think it is better to keep this clear and only have a speech id and then ptr to all the segments that belong to the same speech? I think the best approach is to make this conclusion as a decision with an example, etc. I.e. file a PR for this to the riksdagen-records repo? |
In this particular example, the speeches are part of an interpellation debate, which is also labeled, but the speeches themselves are not contained in the debate label. This is a way to identify sectioning of the protocols without relying on nested structure we would get by 'div'ing everything off while also providing a way to store debate IDs in a way that fits within the parlaclarin schema.
We have previously discussed labeling debate types and in that context talked about non-hierarchical text structures, potentially overlapping sectioning, and discontiguous sectioning that XML doesn't do well. I agree with you that we want a "speech" with ID and pointers the the speech elements, which is not dependent or child to any other element -- this is exactly how it is here. The debate label is another element that will let us flexibly and iteratively annotate the structures within the text.
If down the line we want to implement some other sectioning that overlaps with the interpellation debate -- lets say it contains speech I wanted to get a feel for the reactions to this, but I'll move it over to a decision proposal now. |
ok, that sounds good! the important thing for now is to be able to identify and annotate individual speeches. but it is good that the structure also allows for annotating speeches to specific debates. |
yes! I think the goal down the line is to ID all "types" of debates and addresses within the protocols as we have started doing with interpellation debates. If we implement a strategy like this, all the action happens in the metadata and we don't have to worry about messing up the text, how to limit recursion in the xml, or any similar problems. |
... I will move this to a decision proposal sometime today |
Current options
The text was updated successfully, but these errors were encountered: