Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create IDs for speeches #13

Open
ninpnin opened this issue Apr 12, 2024 · 25 comments
Open

Create IDs for speeches #13

ninpnin opened this issue Apr 12, 2024 · 25 comments
Assignees

Comments

@ninpnin
Copy link
Contributor

ninpnin commented Apr 12, 2024

Current options

  • List speeches in the metadata block
  • Wrap speeches in div elements
  • Make each utterance block one per speech
@BobBorges
Copy link
Contributor

I think wrapping speeches in divs is short sighted for a "living" resource -- we don't know how many things like this will be tagged and whether all potential things to tag will allow the hierarchical structure required by xml divs. Metadata blocks would allow tagging whatever features independently of other tagged features without creating a bottomless pit of divs.

@MansMeg
Copy link
Contributor

MansMeg commented Apr 17, 2024

Yes. This is a really good point, its a more future-proof approach.

@ninpnin
Copy link
Contributor Author

ninpnin commented Apr 17, 2024

I just realized we could use the n attributes that are available for all elements. From the documentation,

n (number) gives a number (or other label) for an element, which is not necessarily unique within the document.

Then, we would just include the ID in all u elements that belong to the speech. Eg. for the following speech with the ID i-AzXa4EUmTu6mz8YQsCpizb

<note type="speaker" xml:id="i-G36fJpDJFVqwFFQjbknRq2">
  Herr ERIKSSON i Bäeckmora (cp):
</note>
<u xml:id="i-3KxGSd288AdTa9bfy9BtMv" xml:n="i-AzXa4EUmTu6mz8YQsCpizb" next="i-QzTu4nNrn4q8kU1N1u4xZC" who="i-7CXHDen9y2qKcYDisT3zjQ">
  <seg xml:id="i-EV3wMeu3xQ8QuzNwmvWbjM">
    Herr talman! I det som statsrådet Palme sade nu fanns väl egentligen
    [...]
    som en stor del av svenska folket bestämt önskar få ändring i.
  </seg>
</u>
<u xml:id="i-QzTu4nNrn4q8kU1N1u4xZC" xml:n="i-AzXa4EUmTu6mz8YQsCpizb" prev="i-3KxGSd288AdTa9bfy9BtMv" who="i-7CXHDen9y2qKcYDisT3zjQ" next="i-CUrwEDJ9XoTNrw9wfqWrYb">
  <seg xml:id="i-4mZ33Z1km8JtDLieMQPm5Q">
    Jag anser att statsrådet Palme på denna punkt också skulle uppta
    allvar-
  </seg>
</u>

@BobBorges
Copy link
Contributor

What happens when this same fragment gets tagged as multiple things? Do we have multiple n attribs, or multiple IDs in the n?

@ninpnin
Copy link
Contributor Author

ninpnin commented Apr 18, 2024

Which fragment are you referring to?

@MansMeg
Copy link
Contributor

MansMeg commented Apr 18, 2024

We should follow the TEI guidelines. n should be used for page number.
https://www.tei-c.org/release/doc/tei-p5-doc/en/html/TS.html#TSBAUT
https://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-u.html

@ninpnin
Copy link
Contributor Author

ninpnin commented Apr 18, 2024

Where does it say that? Here it says

(number) gives a number (or other label) for an element, which is not necessarily unique within the document.

It might be a page number for pb elements, but for u elements I find no such description.

@BobBorges
Copy link
Contributor

@ninpnin I refer to the fragment you posted as an example. It's tagged as a speech with ID, but down the line it may be tagged with other things... an interpellation debate, or some other type of sectioning that may or may not coincide exactly with the speech itself. So how does the approach you describe handle multiple possible xml:n values?

@ninpnin
Copy link
Contributor Author

ninpnin commented Apr 18, 2024

@BobBorges Debates are more suited for div-wrapping, along with any non-overlapping sectioning. But if we have other possibly overlapping things, I unfortunately have no solution for that.

@BobBorges
Copy link
Contributor

It seems like putting these thing as element lists in the tei header would be most flexible, and cleanest in the case when a human has to look at the xml.

@ninpnin
Copy link
Contributor Author

ninpnin commented Apr 18, 2024

does the schema allow for that?

@MansMeg
Copy link
Contributor

MansMeg commented Apr 18, 2024

I agree with Bob, that for now the solution "List speeches in the metadata block" sounds like the best one. Would that work with the TEI schema?

I also added a third option. that is to make each speech only be one block and then rather have paragraph breaks within each utterance. It is semantically closer to the TEI schema than how we solve it now (and the id of the u block would be the speech ID). But still, I think the first solution is best.

@BobBorges
Copy link
Contributor

BobBorges commented Apr 18, 2024

there are a couple of options (parlaclarin):
<TEI><standOff> contains all kinds of stuff that could be useful here
<teiHeader><profileDesc><textDesc> has domain, interaction, purpose
<teiHeader> has a <xenoData> elem which takes any kind of metadata in whatever format (dangerously flexible :) )

From the parlaclarin given examples, looks like standOff is the closest to what we want, but we could also consider like listGrp with type attrib, id, and sub elems that contain a referring ID for the segs we want to label.

@MansMeg
Copy link
Contributor

MansMeg commented May 8, 2024

  • Open up an issue on how to store speech ids in ParlaClarin (@ninpnin )
  • Write down a decision on how to do it
  • Add it to the corpus

@ninpnin
Copy link
Contributor Author

ninpnin commented Aug 9, 2024

Follow at clarin-eric/parla-clarin#25

@BobBorges
Copy link
Contributor

It seems like the parlaclarin people aren't eager to get involved in this. I propose a workaround (or workwithin?) here:

Under teiHeader/profileDesc/textDesc/constitution, which "describes the internal composition of a text" we can add note elements that contain desc (description) linkGrp (link group) and ptr (pointer) elements. I'm attaching a minimal working example here that passes parlaclarin and tei validation.

Image

This strategy would allow several nice improvements (1) speech Ids (in the example, it's the id attribute of notes with type "speech") (2) labeling and categorizing the text without cluttering up the body descendants or getting out of hand with recursion depth (3) labeling and classifying non-hierarchical and discontinuous features of the text (e.g. a debate that was stopped to attend to some other business and taken up later) (4) easier search and extract of particular types of text (If I'm only interested in speeches from interpellation debates, I parse the header, get u elem IDs from the IP speeches and .find() them instead of iterating over the entire protocol).

_tei_heder_speech_id.xml.txt

@fredrik1984
Copy link
Contributor

Ok, I think it is excellent that we get IDs for speeches. @MansMeg and @ninpnin, do you have any thoughts on the technical aspects here?

@ninpnin
Copy link
Contributor Author

ninpnin commented Oct 14, 2024

Looks good. I would maybe only add the paragraph IDs, so that it doesn't get too cluttered, but that's a minor consideration and a matter of taste.

@BobBorges Should I suggest this to the parlaclarin people?

@BobBorges
Copy link
Contributor

It passes verification -- both TEI and parlaclarin. So if we're happy with it, I think we don't need to do anything else... or maybe show them the example in case some other people want a speech ID.

@ninpnin
Copy link
Contributor Author

ninpnin commented Oct 14, 2024

Now we have the notation. Another consideration is, what do we define as a speech?

Do we regenerate indices every time a paragraph is added or deleted? Or perhaps every time the first or the last paragraph is changed? Or just the first one? @BobBorges @MansMeg what do you think

@MansMeg
Copy link
Contributor

MansMeg commented Oct 14, 2024

Oh. Im not sure I understand fully how this will work. It seems from you example that you also cluster the segments into debates? I think it is better to keep this clear and only have a speech id and then ptr to all the segments that belong to the same speech?

I think the best approach is to make this conclusion as a decision with an example, etc. I.e. file a PR for this to the riksdagen-records repo?

@BobBorges
Copy link
Contributor

In this particular example, the speeches are part of an interpellation debate, which is also labeled, but the speeches themselves are not contained in the debate label. This is a way to identify sectioning of the protocols without relying on nested structure we would get by 'div'ing everything off while also providing a way to store debate IDs in a way that fits within the parlaclarin schema.

I think it is better to keep this clear and only have a speech id and then ptr to all the segments that belong to the same speech?

We have previously discussed labeling debate types and in that context talked about non-hierarchical text structures, potentially overlapping sectioning, and discontiguous sectioning that XML doesn't do well. I agree with you that we want a "speech" with ID and pointers the the speech elements, which is not dependent or child to any other element -- this is exactly how it is here. The debate label is another element that will let us flexibly and iteratively annotate the structures within the text.

  • "speech" elements contain pointers to u and seg elements, only those that belong to the speech – how we define what a speech is, may not be as trivial as the assumption I've been working under, re @ninpnin's comment.
  • other sectioning, debates, addresses, or whatever contains pointers to elements that contain speeches, e.g. in the posted example, the debate note references the debateSection div (pointer to the div), and the speeches (pointer to the note element containing pointers to u and seg elems) that have been identified as belonging to that debate.

If down the line we want to implement some other sectioning that overlaps with the interpellation debate -- lets say it contains speech i-abc123, not i-xyz789, and some other speech in a different debate section div -- nothing prevents us from doing that in this way, and it doesn't require us to make any edits to the xml under <text>.

I wanted to get a feel for the reactions to this, but I'll move it over to a decision proposal now.

@fredrik1984
Copy link
Contributor

ok, that sounds good! the important thing for now is to be able to identify and annotate individual speeches. but it is good that the structure also allows for annotating speeches to specific debates.

@BobBorges
Copy link
Contributor

yes! I think the goal down the line is to ID all "types" of debates and addresses within the protocols as we have started doing with interpellation debates. If we implement a strategy like this, all the action happens in the metadata and we don't have to worry about messing up the text, how to limit recursion in the xml, or any similar problems.

@BobBorges
Copy link
Contributor

... I will move this to a decision proposal sometime today

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants