Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating offsets when a text resource is altered #26

Open
ngawangtrinley opened this issue Jul 11, 2024 · 4 comments
Open

Updating offsets when a text resource is altered #26

ngawangtrinley opened this issue Jul 11, 2024 · 4 comments
Assignees
Labels
question Further information is requested

Comments

@ngawangtrinley
Copy link

One of the main challenges our project faces is that we have multiple copies of the same text resource with degrees of cleanliness and annotations. For instance we will have 50 instances of the heart sutra with the cleanest one not having TOC annotations or with a very dirty version with great NER tags. In some cases at might also only have a bad quality text resource that is being proofread and annotated over a year.

Our goal is to be able to combine the best aspects of all resources and annotations at any given time.

In other words, we see STAM as the pivot format that will link Buddhist data in archives like BDRC, sttacentral or CBETA and websites like 84000, pecha.org, which means that we will have to update, split and merge text resources and annotations on a regular basis.

We are also putting together training datasets for the project monlam.ai which also requires annotation transfer. For instance our MT model currently suffers from a lot of typos in our 2 million aligned sentence dataset and we need to transfer the segment annotations to cleaner versions of texts we are currently producing.

A couple of years ago, our team came up with an "annotation transfer" or "base text update" mechanism combining our CCTV algorithm with Google's Diff Match Patch package.

What would be your approach to tackle this challenge with STAM?

@ngawangtrinley ngawangtrinley changed the title Updating offsets when a text resource is updated Updating offsets when a text resource is altered Jul 11, 2024
@proycon
Copy link
Collaborator

proycon commented Jul 11, 2024

I see the challenge. This is indeed an important question. If an
annotation is made and a text resource changes, that would of course
likely invalidate the annotation. First of all: there is a simple STAM Text
Validation Extension

for STAM that validates annotations on a text and ensures it hasn't changed.

If a text changes you'd indeed have to transfer annotations from the old
text to the new one. This too is something that I have focused on, as
we had a similar use-case. A mechanism for relating identical text
selections across multiple resources was defined in the STAM Transpose
extension
.
Using this mechanism, you can 'transpose' any annotations on resource A
to resource B via so-called transposition (which is just a specific type
of annotation defined by this extension). This is already implemented.

A mechanism for finding such identical text selections (i.e. computing
transpositions) is implemented in the stam align tool
(https://github.com/annotation/stam-tools). This uses the Smith-Waterman
or Needleman-Wunsch algorithms from bioinformatics so find identical
sequences in different texts and link them.

Currently, however, this is still too limited for what you want, as it
only works for text selections that are really identical. Say you
annotated a misspelled word like 'accomodate' in one resource, and want
to transfer it to the corrected form 'accommodate' in the updated
resource; then you want the annotation to fully cover the new word.
Whereas the current implementation I have would preserve the old annotation
precisely and still leave the new 'm' out.

The good news is that I'm already looking into ways to achieve this,
since we also need it ourselves in a project. So it's something that's
definitely on my radar to be implemented soon (very rough estimate 2-3 months,
because of the holiday/summer period also).

A couple of years ago, our team came up with an "annotation transfer"
or "base text update" mechanism combining our
CCTV
algorithm with Google's Diff Match Patch package.

That is very interesting. It's good to hear you've done some research into
this. I'll take a look at how you guys did this, perhaps the core algorithm
there is even something that might be portable to STAM. I see it bases off
Myer's diff algorithm primarily.

@proycon proycon self-assigned this Jul 11, 2024
@proycon proycon added the question Further information is requested label Jul 11, 2024
@ngawangtrinley
Copy link
Author

ngawangtrinley commented Jul 11, 2024

Good to hear you're working on this! Having a solution directly integrated to stam would be the best.

Our solution starts with https://github.com/OpenPecha/Toolkit/blob/master/openpecha/blupdate.py and uses DMP as a backup.

We didn't get as far as you with the format architecture, but the fundamentals are very similar: we called text resources "base layer" and annotation datastores "annotation layers" (think of layers in Photoshop) and we save annotations in yaml files rather than JSON. This makes it a quite natural transition for us.

Our colleague @eroux from BDRC came up with the CCTV approach in blupdate.py, so I'm sure he'll be interested to hear what you think and if you have a better solution.

@eroux
Copy link

eroux commented Jul 11, 2024

ah thanks that's good to hear! In our tests, solutions based on Smith-Waterman or Needleman-Wunsch algorithms don't perform very well while Myer's diff does even on pretty large files. There seems to be a Rust rewrite of the library we're using in Python/C on https://crates.io/crates/diffmatchpatch , perhaps that could be a new option in stam align?

@proycon
Copy link
Collaborator

proycon commented Jul 12, 2024

Thanks for the link, good to see it's already ported to Rust even. That might indeed make a very good option to implement into stam align.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants