Script for creating speech stats prot-speeches.csv #4

Lauler · 2024-04-30T08:06:27Z

Lauler
Apr 30, 2024

Is the script that creates the stats for prot-speeches.csv available somewhere?

https://github.com/swerik-project/the-swedish-parliament-corpus/blob/main/stats/prot-speeches/prot-speeches.csv

I'm interested in comparing my parsing of speeches from xml with the way it's done to to create the stats for swerik-project. Currently I rely on next attribute being present/absent to extract speeches.

Answered by BobBorges

Apr 30, 2024

It's in readme/src/generate-markdown.py in the function starting on line 95 count_pages_speeches_words(). We rely on the assumption that a speech has an introduction, and essentially count the introductions as a proxy for speeches.

Counting next attribs will miss speeches contained in a single element (no idea how many that might be) and include text that is not a speech, but was classified as (also not sure of the scale, but we know it happens).

View full answer

BobBorges · 2024-04-30T09:43:35Z

BobBorges
Apr 30, 2024
Maintainer

It's in readme/src/generate-markdown.py in the function starting on line 95 count_pages_speeches_words(). We rely on the assumption that a speech has an introduction, and essentially count the introductions as a proxy for speeches.

Counting next attribs will miss speeches contained in a single element (no idea how many that might be) and include text that is not a speech, but was classified as (also not sure of the scale, but we know it happens).

4 replies

Lauler May 1, 2024
Author

All methods (including mine) make some assumptions about the data. The next attribute check works by yielding a speech whenever an  element does not contain a next attribute. This only occurs

when we encounter the last  attribute in a series of  attributes that make up a speech.
whenever a speech is composed of only a single  attribute (there will be no next in this situation).

After checking differences in speeches between count_pages_speeches_words() and my function I found the results were fairly similar for 1966-2002. 385927 speeches for count_pages_speeches_words() vs 380549 for the function I linked in original post.

Upon closer inspection I think you may be double counting some speeches, as the assumption that a given speech has a single introduction <note> does not always hold.

For example see here:

https://github.com/swerik-project/riksdagen-records/blob/9f87d27fd49958c7be3f8e884da66c1374061127/data/197879/prot-197879--042.xml#L3265-L3271

https://github.com/swerik-project/riksdagen-records/blob/9f87d27fd49958c7be3f8e884da66c1374061127/data/198889/prot-198889--116.xml#L5554-L5560

Since the quality of the introduction identification is good and reliable, I think an automatic test that would be useful in identifying incorrect <note> tags which should in reality be  is to check whether all tags after an introduction are <note> until the next speaker is encountered. See the first link for an example of that.

MansMeg May 1, 2024
Maintainer

This is excellent! Would it be possible for you to share this function @Lauler ? I think this can be used to look at the quality of speeches in the corpus further.

Lauler May 1, 2024
Author

It's can be found here.

Another useful sanity check for finding likely incorrect  tags is to reset speech_id to None after yielding a speech. Whenever you encounter an  or a series of  which appear without being preceded by an introduction, you will get "speeches" where speech_id is None.

Some of them might technically be utterances as they are in quotes. But I think many of them might be <note>. Examples from screenshot:

1
2
3
4
5

Removing these utterances without a preceding speaker introduction reduces the number of speeches in the period 1966-2002 to 374689.

BobBorges May 1, 2024
Maintainer

Hopefully the not-seg tagging will be improved quite soon. I do think we over tag speeches; I think it's possible, for instance, to submit questions in writing, but they are often/usually labeled as utterances.

MansMeg · 2024-04-30T09:45:17Z

MansMeg
Apr 30, 2024
Maintainer

Yes. We also feel that the quality of the introduction identification is quite good (based on the master thesis by Jesper Mortensen Blomqvist in 2022).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

swerik-project

Script for creating speech stats prot-speeches.csv #4

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

swerik-project

Script for creating speech stats prot-speeches.csv #4

Lauler Apr 30, 2024

Replies: 2 comments · 4 replies

BobBorges Apr 30, 2024 Maintainer

Lauler May 1, 2024 Author

MansMeg May 1, 2024 Maintainer

Lauler May 1, 2024 Author

BobBorges May 1, 2024 Maintainer

MansMeg Apr 30, 2024 Maintainer

Lauler
Apr 30, 2024

Replies: 2 comments 4 replies

BobBorges
Apr 30, 2024
Maintainer

Lauler May 1, 2024
Author

MansMeg May 1, 2024
Maintainer

Lauler May 1, 2024
Author

BobBorges May 1, 2024
Maintainer

MansMeg
Apr 30, 2024
Maintainer