Updating offsets when a text resource is altered #26

ngawangtrinley · 2024-07-11T13:41:52Z

One of the main challenges our project faces is that we have multiple copies of the same text resource with degrees of cleanliness and annotations. For instance we will have 50 instances of the heart sutra with the cleanest one not having TOC annotations or with a very dirty version with great NER tags. In some cases at might also only have a bad quality text resource that is being proofread and annotated over a year.

Our goal is to be able to combine the best aspects of all resources and annotations at any given time.

In other words, we see STAM as the pivot format that will link Buddhist data in archives like BDRC, sttacentral or CBETA and websites like 84000, pecha.org, which means that we will have to update, split and merge text resources and annotations on a regular basis.

We are also putting together training datasets for the project monlam.ai which also requires annotation transfer. For instance our MT model currently suffers from a lot of typos in our 2 million aligned sentence dataset and we need to transfer the segment annotations to cleaner versions of texts we are currently producing.

A couple of years ago, our team came up with an "annotation transfer" or "base text update" mechanism combining our CCTV algorithm with Google's Diff Match Patch package.

What would be your approach to tackle this challenge with STAM?

proycon · 2024-07-11T15:52:19Z

I see the challenge. This is indeed an important question. If an
annotation is made and a text resource changes, that would of course
likely invalidate the annotation. First of all: there is a simple STAM Text
Validation Extension
for STAM that validates annotations on a text and ensures it hasn't changed.

If a text changes you'd indeed have to transfer annotations from the old
text to the new one. This too is something that I have focused on, as
we had a similar use-case. A mechanism for relating identical text
selections across multiple resources was defined in the STAM Transpose
extension.
Using this mechanism, you can 'transpose' any annotations on resource A
to resource B via so-called transposition (which is just a specific type
of annotation defined by this extension). This is already implemented.

A mechanism for finding such identical text selections (i.e. computing
transpositions) is implemented in the stam align tool
(https://github.com/annotation/stam-tools). This uses the Smith-Waterman
or Needleman-Wunsch algorithms from bioinformatics so find identical
sequences in different texts and link them.

Currently, however, this is still too limited for what you want, as it
only works for text selections that are really identical. Say you
annotated a misspelled word like 'accomodate' in one resource, and want
to transfer it to the corrected form 'accommodate' in the updated
resource; then you want the annotation to fully cover the new word.
Whereas the current implementation I have would preserve the old annotation
precisely and still leave the new 'm' out.

The good news is that I'm already looking into ways to achieve this,
since we also need it ourselves in a project. So it's something that's
definitely on my radar to be implemented soon (very rough estimate 2-3 months,
because of the holiday/summer period also).

A couple of years ago, our team came up with an "annotation transfer"
or "base text update" mechanism combining our
CCTV
algorithm with Google's Diff Match Patch package.

That is very interesting. It's good to hear you've done some research into
this. I'll take a look at how you guys did this, perhaps the core algorithm
there is even something that might be portable to STAM. I see it bases off
Myer's diff algorithm primarily.

ngawangtrinley · 2024-07-11T16:35:01Z

Good to hear you're working on this! Having a solution directly integrated to stam would be the best.

Our solution starts with https://github.com/OpenPecha/Toolkit/blob/master/openpecha/blupdate.py and uses DMP as a backup.

We didn't get as far as you with the format architecture, but the fundamentals are very similar: we called text resources "base layer" and annotation datastores "annotation layers" (think of layers in Photoshop) and we save annotations in yaml files rather than JSON. This makes it a quite natural transition for us.

Our colleague @eroux from BDRC came up with the CCTV approach in blupdate.py, so I'm sure he'll be interested to hear what you think and if you have a better solution.

eroux · 2024-07-11T16:44:20Z

ah thanks that's good to hear! In our tests, solutions based on Smith-Waterman or Needleman-Wunsch algorithms don't perform very well while Myer's diff does even on pretty large files. There seems to be a Rust rewrite of the library we're using in Python/C on https://crates.io/crates/diffmatchpatch , perhaps that could be a new option in stam align?

proycon · 2024-07-12T09:37:47Z

Thanks for the link, good to see it's already ported to Rust even. That might indeed make a very good option to implement into stam align.

ngawangtrinley changed the title ~~Updating offsets when a text resource is updated~~ Updating offsets when a text resource is altered Jul 11, 2024

proycon self-assigned this Jul 11, 2024

proycon added the question Further information is requested label Jul 11, 2024

10zinten mentioned this issue Jul 26, 2024

How to modify annotation offset? annotation/stam-python#12

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updating offsets when a text resource is altered #26

Updating offsets when a text resource is altered #26

ngawangtrinley commented Jul 11, 2024

proycon commented Jul 11, 2024

ngawangtrinley commented Jul 11, 2024 •

edited

Loading

eroux commented Jul 11, 2024

proycon commented Jul 12, 2024

Updating offsets when a text resource is altered #26

Updating offsets when a text resource is altered #26

Comments

ngawangtrinley commented Jul 11, 2024

proycon commented Jul 11, 2024

ngawangtrinley commented Jul 11, 2024 • edited Loading

eroux commented Jul 11, 2024

proycon commented Jul 12, 2024

ngawangtrinley commented Jul 11, 2024 •

edited

Loading