Create IDs for speeches #13

ninpnin · 2024-04-12T12:44:18Z

Current options

List speeches in the metadata block
Wrap speeches in div elements
Make each utterance block one per speech

BobBorges · 2024-04-17T12:02:49Z

I think wrapping speeches in divs is short sighted for a "living" resource -- we don't know how many things like this will be tagged and whether all potential things to tag will allow the hierarchical structure required by xml divs. Metadata blocks would allow tagging whatever features independently of other tagged features without creating a bottomless pit of divs.

MansMeg · 2024-04-17T13:25:40Z

Yes. This is a really good point, its a more future-proof approach.

ninpnin · 2024-04-17T13:47:34Z

I just realized we could use the n attributes that are available for all elements. From the documentation,

n (number) gives a number (or other label) for an element, which is not necessarily unique within the document.

Then, we would just include the ID in all u elements that belong to the speech. Eg. for the following speech with the ID i-AzXa4EUmTu6mz8YQsCpizb

<note type="speaker" xml:id="i-G36fJpDJFVqwFFQjbknRq2">
  Herr ERIKSSON i Bäeckmora (cp):
</note>
<u xml:id="i-3KxGSd288AdTa9bfy9BtMv" xml:n="i-AzXa4EUmTu6mz8YQsCpizb" next="i-QzTu4nNrn4q8kU1N1u4xZC" who="i-7CXHDen9y2qKcYDisT3zjQ">
  <seg xml:id="i-EV3wMeu3xQ8QuzNwmvWbjM">
    Herr talman! I det som statsrådet Palme sade nu fanns väl egentligen
    [...]
    som en stor del av svenska folket bestämt önskar få ändring i.
  </seg>
</u>
<u xml:id="i-QzTu4nNrn4q8kU1N1u4xZC" xml:n="i-AzXa4EUmTu6mz8YQsCpizb" prev="i-3KxGSd288AdTa9bfy9BtMv" who="i-7CXHDen9y2qKcYDisT3zjQ" next="i-CUrwEDJ9XoTNrw9wfqWrYb">
  <seg xml:id="i-4mZ33Z1km8JtDLieMQPm5Q">
    Jag anser att statsrådet Palme på denna punkt också skulle uppta
    allvar-
  </seg>
</u>

BobBorges · 2024-04-17T13:56:45Z

What happens when this same fragment gets tagged as multiple things? Do we have multiple n attribs, or multiple IDs in the n?

ninpnin · 2024-04-18T08:38:48Z

Which fragment are you referring to?

MansMeg · 2024-04-18T08:41:03Z

We should follow the TEI guidelines. n should be used for page number.
https://www.tei-c.org/release/doc/tei-p5-doc/en/html/TS.html#TSBAUT
https://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-u.html

ninpnin · 2024-04-18T08:47:09Z

Where does it say that? Here it says

(number) gives a number (or other label) for an element, which is not necessarily unique within the document.

It might be a page number for pb elements, but for u elements I find no such description.

BobBorges · 2024-04-18T08:49:04Z

@ninpnin I refer to the fragment you posted as an example. It's tagged as a speech with ID, but down the line it may be tagged with other things... an interpellation debate, or some other type of sectioning that may or may not coincide exactly with the speech itself. So how does the approach you describe handle multiple possible xml:n values?

ninpnin · 2024-04-18T08:52:59Z

@BobBorges Debates are more suited for div-wrapping, along with any non-overlapping sectioning. But if we have other possibly overlapping things, I unfortunately have no solution for that.

BobBorges · 2024-04-18T09:05:08Z

It seems like putting these thing as element lists in the tei header would be most flexible, and cleanest in the case when a human has to look at the xml.

ninpnin · 2024-04-18T14:08:03Z

does the schema allow for that?

MansMeg · 2024-04-18T14:11:42Z

I agree with Bob, that for now the solution "List speeches in the metadata block" sounds like the best one. Would that work with the TEI schema?

I also added a third option. that is to make each speech only be one block and then rather have paragraph breaks within each utterance. It is semantically closer to the TEI schema than how we solve it now (and the id of the u block would be the speech ID). But still, I think the first solution is best.

BobBorges · 2024-04-18T14:40:51Z

there are a couple of options (parlaclarin):
• <TEI><standOff> contains all kinds of stuff that could be useful here
• <teiHeader><profileDesc><textDesc> has domain, interaction, purpose
• <teiHeader> has a <xenoData> elem which takes any kind of metadata in whatever format (dangerously flexible :) )

From the parlaclarin given examples, looks like standOff is the closest to what we want, but we could also consider like listGrp with type attrib, id, and sub elems that contain a referring ID for the segs we want to label.

MansMeg · 2024-05-08T12:50:40Z

Open up an issue on how to store speech ids in ParlaClarin (@ninpnin )
Write down a decision on how to do it
Add it to the corpus

ninpnin · 2024-08-09T08:06:51Z

Follow at clarin-eric/parla-clarin#25

BobBorges · 2024-10-11T10:55:18Z

It seems like the parlaclarin people aren't eager to get involved in this. I propose a workaround (or workwithin?) here:

Under teiHeader/profileDesc/textDesc/constitution, which "describes the internal composition of a text" we can add note elements that contain desc (description) linkGrp (link group) and ptr (pointer) elements. I'm attaching a minimal working example here that passes parlaclarin and tei validation.

This strategy would allow several nice improvements (1) speech Ids (in the example, it's the id attribute of notes with type "speech") (2) labeling and categorizing the text without cluttering up the body descendants or getting out of hand with recursion depth (3) labeling and classifying non-hierarchical and discontinuous features of the text (e.g. a debate that was stopped to attend to some other business and taken up later) (4) easier search and extract of particular types of text (If I'm only interested in speeches from interpellation debates, I parse the header, get u elem IDs from the IP speeches and .find() them instead of iterating over the entire protocol).

_tei_heder_speech_id.xml.txt

fredrik1984 · 2024-10-11T11:26:11Z

Ok, I think it is excellent that we get IDs for speeches. @MansMeg and @ninpnin, do you have any thoughts on the technical aspects here?

ninpnin · 2024-10-14T08:35:47Z

Looks good. I would maybe only add the paragraph IDs, so that it doesn't get too cluttered, but that's a minor consideration and a matter of taste.

@BobBorges Should I suggest this to the parlaclarin people?

BobBorges · 2024-10-14T09:07:09Z

It passes verification -- both TEI and parlaclarin. So if we're happy with it, I think we don't need to do anything else... or maybe show them the example in case some other people want a speech ID.

ninpnin · 2024-10-14T10:15:15Z

Now we have the notation. Another consideration is, what do we define as a speech?

Do we regenerate indices every time a paragraph is added or deleted? Or perhaps every time the first or the last paragraph is changed? Or just the first one? @BobBorges @MansMeg what do you think

MansMeg · 2024-10-14T15:25:02Z

Oh. Im not sure I understand fully how this will work. It seems from you example that you also cluster the segments into debates? I think it is better to keep this clear and only have a speech id and then ptr to all the segments that belong to the same speech?

I think the best approach is to make this conclusion as a decision with an example, etc. I.e. file a PR for this to the riksdagen-records repo?

BobBorges · 2024-10-15T08:11:55Z

In this particular example, the speeches are part of an interpellation debate, which is also labeled, but the speeches themselves are not contained in the debate label. This is a way to identify sectioning of the protocols without relying on nested structure we would get by 'div'ing everything off while also providing a way to store debate IDs in a way that fits within the parlaclarin schema.

I think it is better to keep this clear and only have a speech id and then ptr to all the segments that belong to the same speech?

We have previously discussed labeling debate types and in that context talked about non-hierarchical text structures, potentially overlapping sectioning, and discontiguous sectioning that XML doesn't do well. I agree with you that we want a "speech" with ID and pointers the the speech elements, which is not dependent or child to any other element -- this is exactly how it is here. The debate label is another element that will let us flexibly and iteratively annotate the structures within the text.

"speech" elements contain pointers to u and seg elements, only those that belong to the speech – how we define what a speech is, may not be as trivial as the assumption I've been working under, re @ninpnin's comment.
other sectioning, debates, addresses, or whatever contains pointers to elements that contain speeches, e.g. in the posted example, the debate note references the debateSection div (pointer to the div), and the speeches (pointer to the note element containing pointers to u and seg elems) that have been identified as belonging to that debate.

If down the line we want to implement some other sectioning that overlaps with the interpellation debate -- lets say it contains speech i-abc123, not i-xyz789, and some other speech in a different debate section div -- nothing prevents us from doing that in this way, and it doesn't require us to make any edits to the xml under <text>.

I wanted to get a feel for the reactions to this, but I'll move it over to a decision proposal now.

fredrik1984 · 2024-10-15T08:16:46Z

ok, that sounds good! the important thing for now is to be able to identify and annotate individual speeches. but it is good that the structure also allows for annotating speeches to specific debates.

BobBorges · 2024-10-15T08:19:50Z

yes! I think the goal down the line is to ID all "types" of debates and addresses within the protocols as we have started doing with interpellation debates. If we implement a strategy like this, all the action happens in the metadata and we don't have to worry about messing up the text, how to limit recursion in the xml, or any similar problems.

BobBorges · 2024-10-15T08:22:24Z

... I will move this to a decision proposal sometime today

MansMeg assigned ninpnin Sep 27, 2024

BobBorges mentioned this issue Oct 15, 2024

Decision 7: Speech ID handling swerik-project/the-swedish-parliament-corpus#28

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create IDs for speeches #13

Create IDs for speeches #13

ninpnin commented Apr 12, 2024 •

edited by MansMeg

Loading

BobBorges commented Apr 17, 2024

MansMeg commented Apr 17, 2024

ninpnin commented Apr 17, 2024

BobBorges commented Apr 17, 2024

ninpnin commented Apr 18, 2024

MansMeg commented Apr 18, 2024 •

edited

Loading

ninpnin commented Apr 18, 2024

BobBorges commented Apr 18, 2024

ninpnin commented Apr 18, 2024

BobBorges commented Apr 18, 2024

ninpnin commented Apr 18, 2024

MansMeg commented Apr 18, 2024

BobBorges commented Apr 18, 2024 •

edited

Loading

MansMeg commented May 8, 2024 •

edited by ninpnin

Loading

ninpnin commented Aug 9, 2024

BobBorges commented Oct 11, 2024

fredrik1984 commented Oct 11, 2024

ninpnin commented Oct 14, 2024 •

edited

Loading

BobBorges commented Oct 14, 2024

ninpnin commented Oct 14, 2024 •

edited

Loading

MansMeg commented Oct 14, 2024

BobBorges commented Oct 15, 2024

fredrik1984 commented Oct 15, 2024

BobBorges commented Oct 15, 2024

BobBorges commented Oct 15, 2024

Create IDs for speeches #13

Create IDs for speeches #13

Comments

ninpnin commented Apr 12, 2024 • edited by MansMeg Loading

BobBorges commented Apr 17, 2024

MansMeg commented Apr 17, 2024

ninpnin commented Apr 17, 2024

BobBorges commented Apr 17, 2024

ninpnin commented Apr 18, 2024

MansMeg commented Apr 18, 2024 • edited Loading

ninpnin commented Apr 18, 2024

BobBorges commented Apr 18, 2024

ninpnin commented Apr 18, 2024

BobBorges commented Apr 18, 2024

ninpnin commented Apr 18, 2024

MansMeg commented Apr 18, 2024

BobBorges commented Apr 18, 2024 • edited Loading

MansMeg commented May 8, 2024 • edited by ninpnin Loading

ninpnin commented Aug 9, 2024

BobBorges commented Oct 11, 2024

fredrik1984 commented Oct 11, 2024

ninpnin commented Oct 14, 2024 • edited Loading

BobBorges commented Oct 14, 2024

ninpnin commented Oct 14, 2024 • edited Loading

MansMeg commented Oct 14, 2024

BobBorges commented Oct 15, 2024

fredrik1984 commented Oct 15, 2024

BobBorges commented Oct 15, 2024

BobBorges commented Oct 15, 2024

ninpnin commented Apr 12, 2024 •

edited by MansMeg

Loading

MansMeg commented Apr 18, 2024 •

edited

Loading

BobBorges commented Apr 18, 2024 •

edited

Loading

MansMeg commented May 8, 2024 •

edited by ninpnin

Loading

ninpnin commented Oct 14, 2024 •

edited

Loading

ninpnin commented Oct 14, 2024 •

edited

Loading