NER on whole paragraphs, instead of just keywords #12359

goonhoon · 2023-03-02T22:35:52Z

goonhoon
Mar 2, 2023

Hi,

I am now annotating about 510 contracts. In most cases I tag whole paragraphs, rather than just words or phrases, with entities that describe the paragraph (eg. 'termination clause'). I now realised that maybe using NER will not work on the recognition of whole paragraphs? All I want to do is for the model to guess what a paragraph within the contract is about. On some of the examples I ran in displacy so far it seems to do well but I wonder whether there is a better way to work with paragraphs.

Thanks :)
GH

Answered by adrianeboyd

Mar 3, 2023

The NER model isn't really intended for longer texts like this, and this sounds like a good use case for spancat, which you could use to directly label all paragraphs in a text.

You'd need a solid working definition of "paragraph" so that you can annotate the same units as paragraphs in your training data and also suggest those exact same units for new texts with a custom suggester function for your spancat component.

I'm surprised that I didn't easily find this kind of suggester somewhere in our projects, but here's a third-party example that suggests sentences, which should be similar in practice to suggesting paragraphs:

https://github.com/thiippal/MoodCat/blob/867438444fd3c0d1cae3e680…

View full answer

adrianeboyd · 2023-03-03T10:40:00Z

adrianeboyd
Mar 3, 2023

The NER model isn't really intended for longer texts like this, and this sounds like a good use case for spancat, which you could use to directly label all paragraphs in a text.

You'd need a solid working definition of "paragraph" so that you can annotate the same units as paragraphs in your training data and also suggest those exact same units for new texts with a custom suggester function for your spancat component.

I'm surprised that I didn't easily find this kind of suggester somewhere in our projects, but here's a third-party example that suggests sentences, which should be similar in practice to suggesting paragraphs:

https://github.com/thiippal/MoodCat/blob/867438444fd3c0d1cae3e68073b7706f3a8d2496/scripts/sent_suggester.py#L22-L64

(Related discussion: #10657)

4 replies

goonhoon Mar 3, 2023
Author

@adrianeboyd thanks a lot, this looks quite helpful so will have a look at home. As for spancat and span labeling, what annotation and data format are required for the model? Could I work with the same structure of annotated entities as I would with NER (i.e., .json converted to the .spacy format)

adrianeboyd Mar 3, 2023

Ah, here's where we have implemented the extra suggesters, I just forgot which repo it was in:

https://github.com/explosion/spacy-experimental#misc

Install spacy-experimental to use these suggesters. The code is here: https://github.com/explosion/spacy-experimental/tree/745f17f1cbb06e2045add082c89fd58b426ecfca/spacy_experimental/span_suggesters

Here's a script that shows how to copy the annotation from doc.ents to spans for spancat and save it back out as .spacy:

https://github.com/explosion/projects/blob/1ff84ca66a1c36c64e4ba2f2e7779c1b2df82fe8/experimental/ner_spancat/scripts/add_ents_to_spans_dict.py

goonhoon Mar 3, 2023
Author

Thanks a lot, I also see that displacy is now supported with spancat, which is also helpful.

What I do not understand, however, is why I need a solid working definition of a 'paragraph'. Similarly to the image below (Kira software), I may sometimes work with words or sentences (eg. the contract title or date), sometimes with shorter paragraphs, and sometimes I work with large, multi-page paragraphs (eg. the termination clause in the image below).

Overlapping is rarely an issue, and contracts are usually quite clear about separating issues into separate units (sentences or paragraphs). Therefore, I do not necessarily care about spancat allowing for overlapping tags, although I can see it being extremely useful.

The image below is what I'd eventually like to achieve, or at least get as close as possible with my fairly limited dataset (500-1000k contracts).

Edit: I think I will give textcat a shot for this purpose, but any further recommendations are also required.

adrianeboyd Mar 6, 2023

If you're not consistently tagging a unit like paragraphs and you have lots of longer spans, then I could suggest trying out the span_finder from spacy-experimental. Here's an example project:

https://github.com/explosion/spacy-experimental/tree/v0.6.2/projects/span_finder

textcat is also an option if you can split your documents up into smaller units in advance, but that's basically the same problem as having a working definition of paragraph.

In general, if you don't have a pretty good idea of the units you'd like to classify, it will be hard to annotate and train consistently. With spancat, if you use an ngram suggester and have a huge range of possible ngram lengths it explodes computationally, so it's better to switch to a different kind of suggester like a sentence suggester.

Looking at the example data above and how some spans end in the middle of a paragraph but not in the middle of a sentence, you could also consider classifying individual sentences with spancat rather than longer spans?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NER on whole paragraphs, instead of just keywords #12359

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

NER on whole paragraphs, instead of just keywords #12359

goonhoon Mar 2, 2023

Replies: 1 comment · 4 replies

adrianeboyd Mar 3, 2023

goonhoon Mar 3, 2023 Author

adrianeboyd Mar 3, 2023

goonhoon Mar 3, 2023 Author

adrianeboyd Mar 6, 2023

goonhoon
Mar 2, 2023

Replies: 1 comment 4 replies

adrianeboyd
Mar 3, 2023

goonhoon Mar 3, 2023
Author

goonhoon Mar 3, 2023
Author