Skip to content

zzcoolj/rosetta

Folders and files

NameName
Last commit message
Last commit date

Latest commit

1d94242 · Aug 27, 2021
Aug 17, 2019
Apr 24, 2020
Aug 13, 2019
Jul 7, 2019
Jul 31, 2019
Jun 23, 2019
May 12, 2019
May 17, 2020
Aug 27, 2021
Jul 7, 2019
Jul 31, 2019
Apr 23, 2019
Jan 5, 2020
May 14, 2019
Jul 6, 2019
May 16, 2019
Aug 13, 2019
Jul 23, 2019
Jul 23, 2019
May 16, 2019

Repository files navigation

Rosetta Project

General Information

Project website

Resources

Language Type Text Para count folder comment
English tagged text Rosetta_data (Amel)
Afrikaans pdf Tom and Huck(Noah)
Arabe pdf Rosetta_data (Amel)
Basque tagged text Rosetta_data (Amel)
Bengali pdf Rosetta_data (Amel) many illustrations; "12+"
Bulgarian tagged text Rosetta_data (Amel)
Chinese pdf Rosetta_data (Amel)
Croatian pdf Rosetta_data (Amel)
Czech pdf Rosetta_data (Amel)
Dozivljaji pdf Tom and Huck(Noah)
Dutch tagged text Rosetta_data (Amel) 2 versions
Estonian pdf Tom and Huck(Noah)
Finnish raw text Rosetta_data (Amel)
Finnish pdf Tom and Huck(Noah)
German tagged text Rosetta_data (Amel)
Hebrew pdf Tom and Huck(Noah) scan quality is not good
Hungarian tagged text Rosetta_data (Amel) 2 versions
Italian pdf Rosetta_data (Amel)
Latvian pdf Tom and Huck(Noah)
Polish tagged text Rosetta_data (Amel)
Portuguese raw text;
word
Rosetta_data (Amel) each line is NOT one paragraph
Romanian pdf Rosetta_data (Amel)
Russian tagged text Rosetta_data (Amel)
Russian pdf Tom and Huck(Noah)
Spanish url Rosetta_data (Amel)
Turkish epub -> txt Rosetta_data (Amel)
Ukrainian tagged text Rosetta_data (Amel)
Vietnamese pdf Rosetta_data (Amel)
Vietnamese pdf Tom and Huck(Noah)
Yiddish pdf Tom and Huck(Noah) poor scan quality
As I don't understand Yiddish,
I'm not sure that's the Huckleburry finn

Development/Programming

Alignment Algorithms Research

  • Read this PhD thesis Page 83 (P98 in pdf) - Page 89 (P104 in pdf)

Alignment Evaluations

  • We have three main structures of producing word alignment, as shown in the diagram below. 3-structure-diagram
  • During evaluation, we also devised a "cleaning" procedure in our pre-processing, entailing these three changes: removing punctuation marks, lower-casing tokens, and applying different tokenizers.
  • After four evaluations of accuracy, we produced the results below.
IBM Model # Iteration Count Words Evaluated Structure # "Cleaning" Status Accuracy Produced
IBM Model 1 20 iterations 50 words Structure 1 No Cleaning 40.00%
IBM Model 1 20 iterations 50 words Structure 1 Yes Cleaning 32.00%
IBM Model 1 20 iterations 20 words Structure 2 Yes Cleaning 25.00%
IBM Model 2 20 iterations 50 words Structure 1 No Cleaning 32.00%

Experts analysis and crowd sourcing

Relevant workshops and seminars

Timetable

[submission] Rosetta4Slavic

Rosetta4Endangered

  • Starting date: 22 April
  • Languages:
    • Basque
    • Finno-Ugric: Finnish & Hungarian
  • Digging into alignment algorithms
  • Results analysis by domain experts
  • Crowd sourcing test by using Translation dashboard

[submission] MDPI information

  • Special Issue "Computational Linguistics for Low-Resource Languages"
  • Deadline for manuscript submissions: 15 July 2019
  • Tentative Title: Mapping the Circulation of Literary Writings through Aligned Translations: The example of Slavic and Finno-Ugric Translations of Adventures of Huckleberry Finn.
  • Abstract: Because translated texts have been regarded as unreliable due to suspicions of bias and untrustworthiness, they have so far been an overlooked resource in the field of NLP. But localizing, digitizing, and aligning translated texts of well-travelled famous novels can provide a fruitful basis for developing digitized linguistic material in under-resourced language.
    In this paper we focus on translations of Mark Twain’s Adventures of Huckleberry Finn into a set of Slavic and Finno-Ugric languages in order to map the circulation of ideas and writings and to build up digitalized linguistic material with a view to help preserve the diversity of languages and cultures.