Rosetta Project

General Information

Project website

Resources

Language	Type	Text	Para count	folder	comment
English	tagged text	√	√	Rosetta_data (Amel)
Afrikaans	pdf			Tom and Huck(Noah)
Arabe	pdf			Rosetta_data (Amel)
Basque	tagged text	√	√	Rosetta_data (Amel)
Bengali	pdf			Rosetta_data (Amel)	many illustrations; "12+"
Bulgarian	tagged text	√	√	Rosetta_data (Amel)
Chinese	pdf	√		Rosetta_data (Amel)
Croatian	pdf	√		Rosetta_data (Amel)
Czech	pdf	√		Rosetta_data (Amel)
Dozivljaji	pdf			Tom and Huck(Noah)
Dutch	tagged text	√	√	Rosetta_data (Amel)	2 versions
Estonian	pdf			Tom and Huck(Noah)
Finnish	raw text	√	√	Rosetta_data (Amel)
Finnish	pdf			Tom and Huck(Noah)
German	tagged text	√	√	Rosetta_data (Amel)
Hebrew	pdf			Tom and Huck(Noah)	scan quality is not good
Hungarian	tagged text	√	√	Rosetta_data (Amel)	2 versions
Italian	pdf	√		Rosetta_data (Amel)
Latvian	pdf			Tom and Huck(Noah)
Polish	tagged text	√	√	Rosetta_data (Amel)
Portuguese	raw text; word	√	√	Rosetta_data (Amel)	each line is NOT one paragraph
Romanian	pdf	√		Rosetta_data (Amel)
Russian	tagged text	√	√	Rosetta_data (Amel)
Russian	pdf			Tom and Huck(Noah)
Spanish	url			Rosetta_data (Amel)
Turkish	epub -> txt	√		Rosetta_data (Amel)
Ukrainian	tagged text	√	√	Rosetta_data (Amel)
Vietnamese	pdf	√		Rosetta_data (Amel)
Vietnamese	pdf			Tom and Huck(Noah)
Yiddish	pdf			Tom and Huck(Noah)	poor scan quality As I don't understand Yiddish, I'm not sure that's the Huckleburry finn

Development/Programming

Check out team work boards in Trello
Jump into the hackers team: contact me!

Alignment Algorithms Research

Read this PhD thesis Page 83 (P98 in pdf) - Page 89 (P104 in pdf)

Alignment Evaluations

We have three main structures of producing word alignment, as shown in the diagram below.
During evaluation, we also devised a "cleaning" procedure in our pre-processing, entailing these three changes: removing punctuation marks, lower-casing tokens, and applying different tokenizers.
After four evaluations of accuracy, we produced the results below.

IBM Model #	Iteration Count	Words Evaluated	Structure #	"Cleaning" Status	Accuracy Produced
IBM Model 1	20 iterations	50 words	Structure 1	No Cleaning	40.00%
IBM Model 1	20 iterations	50 words	Structure 1	Yes Cleaning	32.00%
IBM Model 1	20 iterations	20 words	Structure 2	Yes Cleaning	25.00%
IBM Model 2	20 iterations	50 words	Structure 1	No Cleaning	32.00%

Experts analysis and crowd sourcing

Papers to read
- PhD thesis Page 87 (P102 in pdf) - Page 88 (P103 in pdf), Evaluation metrics and Evaluation corpora
- The BAF: A Corpus of English-French Bitext
- Revisiting sentence alignment algorithms for alignment visualization and evaluation
- Exploring Translation Corpora with MkAlign
Website to check
- FarkasTranslations.com: Multi-parallel, corpora of publicly available books.
- The online view of The Adventures of Tom Sawyer can be a baseline view of our tool. (We gonna build something more than that for sure.)
Two-step crowd sourcing plan
- Why we need paragraph alignments of literary texts?
  - Used as a gold alignment for literary texts sentence/paragraph alignment algorithms evaluation. Manual alignments constitute the most reliable references, but are quite rare.
  - Simplify the process of literary alignment evaluation corpora generation.
  - Previous gold references only specify alignment links, they can not evaluate the similarity between aligned paragraphs, which is important for literary translation studies.
- How?
  - Check our two-step crowd sourcing design draft.
    - 0/1/2 score standards in the second step: 2 means aligned (insofar as possible — sometimes it’s truncated, but it’s not like the missing data is in a different paragraph), 1 means sorta aligned (includes situations where a paragraph in English is split across multiple paragraphs), and 0 means it wasn’t translated.
    - A manual annotated example here.
- Paragraph summarization
  - GoWvis: a web application for Graph-of-Words-based text visualization and summarization
  - GoWvis online demo
Possible submission
- Aggregating and analysing crowdsourced annotations for NLP AnnoNLP workshop@EMNLP-IJCNLP 2019
  - Submission deadline: Sep 2, 2019

Relevant workshops and seminars

LaTeCH-CLfL 2019

Timetable

[submission] Rosetta4Slavic

Starting date: 17 April
Project: Translation dashboard prototype development
- Design draft
- Online demo
- Check translation-dashboard folder for more information about the development

Rosetta4Endangered

Starting date: 22 April
Languages:
- Basque
- Finno-Ugric: Finnish & Hungarian
Digging into alignment algorithms
Results analysis by domain experts
Crowd sourcing test by using Translation dashboard

[submission] MDPI information

Special Issue "Computational Linguistics for Low-Resource Languages"
Deadline for manuscript submissions: 15 July 2019
Tentative Title: Mapping the Circulation of Literary Writings through Aligned Translations: The example of Slavic and Finno-Ugric Translations of Adventures of Huckleberry Finn.
Abstract: Because translated texts have been regarded as unreliable due to suspicions of bias and untrustworthiness, they have so far been an overlooked resource in the field of NLP. But localizing, digitizing, and aligning translated texts of well-travelled famous novels can provide a fruitful basis for developing digitized linguistic material in under-resourced language.
In this paper we focus on translations of Mark Twain’s Adventures of Huckleberry Finn into a set of Slavic and Finno-Ugric languages in order to map the circulation of ideas and writings and to build up digitalized linguistic material with a view to help preserve the diversity of languages and cultures.

[submission] Journal of Digital Humanities and Data Mining

Special Issue “Collecting, Preserving, and Disseminating Endangered Cultural Heritage for New Understandings and Multilingual Approaches.”
Deadline for submissions: TBD

Name	Name	Last commit message	Last commit date
Latest commit alexhzhai heat and time maps under construction Aug 27, 2021 1d94242 · Aug 27, 2021 History 228 Commits
align_models	align_models	word alignment run	Aug 17, 2019
corpora	corpora	Create Chinese_tagged.txt	Apr 24, 2020
evaluation	evaluation	run random 1 and make minor changes	Aug 13, 2019
example	example	comment for example code	Jul 7, 2019
images	images	update README with relevant information on 3 structs, accuracies prod…	Jul 31, 2019
internship	internship	more information for worldmap and internship	Jun 23, 2019
references	references	upload one paper to read; Basque language version on the go with a se…	May 12, 2019
translation-dashboard	translation-dashboard	update worldmap with tasks	May 17, 2020
worldmap	worldmap	heat and time maps under construction	Aug 27, 2021
.gitignore	.gitignore	remove large model file	Jul 7, 2019
README.md	README.md	update README with relevant information on 3 structs, accuracies prod…	Jul 31, 2019
__init__.py	__init__.py	import test	Apr 23, 2019
basque_align.py	basque_align.py	add Bulgarian support	Jan 5, 2020
dynamic_web_crawler.py	dynamic_web_crawler.py	Basque language count prepared	May 14, 2019
para_count_generator.py	para_count_generator.py	milestone english-basque alignment baseline model	Jul 6, 2019
raw_txt_parser.py	raw_txt_parser.py	portuguese count support	May 16, 2019
word_align.py	word_align.py	run random 1 and make minor changes	Aug 13, 2019
word_align2.py	word_align2.py	change to new eval method - lowercase, split, keep punctuation	Jul 23, 2019
word_align3.py	word_align3.py	change to new eval method - lowercase, split, keep punctuation	Jul 23, 2019
xml_parser.py	xml_parser.py	more languages support	May 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Rosetta Project

General Information

Project website

Resources

Development/Programming

Alignment Algorithms Research

Alignment Evaluations

Experts analysis and crowd sourcing

Relevant workshops and seminars

Timetable

[submission] Rosetta4Slavic

Rosetta4Endangered

[submission] MDPI information

[submission] Journal of Digital Humanities and Data Mining

About

Releases

Packages

Contributors 3

Languages

zzcoolj/rosetta

Folders and files

Latest commit

History

Repository files navigation

Rosetta Project

General Information

Project website

Resources

Development/Programming

Alignment Algorithms Research

Alignment Evaluations

Experts analysis and crowd sourcing

Relevant workshops and seminars

Timetable

[submission] Rosetta4Slavic

Rosetta4Endangered

[submission] MDPI information

[submission] Journal of Digital Humanities and Data Mining

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages