Project website
Language | Type | Text | Para count | folder | comment |
---|---|---|---|---|---|
English | tagged text | √ | √ | Rosetta_data (Amel) | |
Afrikaans | Tom and Huck(Noah) | ||||
Arabe | Rosetta_data (Amel) | ||||
Basque | tagged text | √ | √ | Rosetta_data (Amel) | |
Bengali | Rosetta_data (Amel) | many illustrations; "12+" | |||
Bulgarian | tagged text | √ | √ | Rosetta_data (Amel) | |
Chinese | √ | Rosetta_data (Amel) | |||
Croatian | √ | Rosetta_data (Amel) | |||
Czech | √ | Rosetta_data (Amel) | |||
Dozivljaji | Tom and Huck(Noah) | ||||
Dutch | tagged text | √ | √ | Rosetta_data (Amel) | 2 versions |
Estonian | Tom and Huck(Noah) | ||||
Finnish | raw text | √ | √ | Rosetta_data (Amel) | |
Finnish | Tom and Huck(Noah) | ||||
German | tagged text | √ | √ | Rosetta_data (Amel) | |
Hebrew | Tom and Huck(Noah) | scan quality is not good | |||
Hungarian | tagged text | √ | √ | Rosetta_data (Amel) | 2 versions |
Italian | √ | Rosetta_data (Amel) | |||
Latvian | Tom and Huck(Noah) | ||||
Polish | tagged text | √ | √ | Rosetta_data (Amel) | |
Portuguese | raw text; word |
√ | √ | Rosetta_data (Amel) | each line is NOT one paragraph |
Romanian | √ | Rosetta_data (Amel) | |||
Russian | tagged text | √ | √ | Rosetta_data (Amel) | |
Russian | Tom and Huck(Noah) | ||||
Spanish | url | Rosetta_data (Amel) | |||
Turkish | epub -> txt | √ | Rosetta_data (Amel) | ||
Ukrainian | tagged text | √ | √ | Rosetta_data (Amel) | |
Vietnamese | √ | Rosetta_data (Amel) | |||
Vietnamese | Tom and Huck(Noah) | ||||
Yiddish | Tom and Huck(Noah) | poor scan quality As I don't understand Yiddish, I'm not sure that's the Huckleburry finn |
- Check out team work boards in Trello
- Jump into the hackers team: contact me!
- Read this PhD thesis Page 83 (P98 in pdf) - Page 89 (P104 in pdf)
- We have three main structures of producing word alignment, as shown in the diagram below.
- During evaluation, we also devised a "cleaning" procedure in our pre-processing, entailing these three changes: removing punctuation marks, lower-casing tokens, and applying different tokenizers.
- After four evaluations of accuracy, we produced the results below.
IBM Model # | Iteration Count | Words Evaluated | Structure # | "Cleaning" Status | Accuracy Produced |
---|---|---|---|---|---|
IBM Model 1 | 20 iterations | 50 words | Structure 1 | No Cleaning | 40.00% |
IBM Model 1 | 20 iterations | 50 words | Structure 1 | Yes Cleaning | 32.00% |
IBM Model 1 | 20 iterations | 20 words | Structure 2 | Yes Cleaning | 25.00% |
IBM Model 2 | 20 iterations | 50 words | Structure 1 | No Cleaning | 32.00% |
- Papers to read
- PhD thesis Page 87 (P102 in pdf) - Page 88 (P103 in pdf), Evaluation metrics and Evaluation corpora
- The BAF: A Corpus of English-French Bitext
- Revisiting sentence alignment algorithms for alignment visualization and evaluation
- Exploring Translation Corpora with MkAlign
- Website to check
- FarkasTranslations.com: Multi-parallel, corpora of publicly available books.
- The online view of The Adventures of Tom Sawyer can be a baseline view of our tool. (We gonna build something more than that for sure.)
- Two-step crowd sourcing plan
- Why we need paragraph alignments of literary texts?
- Used as a gold alignment for literary texts sentence/paragraph alignment algorithms evaluation. Manual alignments constitute the most reliable references, but are quite rare.
- Simplify the process of literary alignment evaluation corpora generation.
- Previous gold references only specify alignment links, they can not evaluate the similarity between aligned paragraphs, which is important for literary translation studies.
- How?
- Check our two-step crowd sourcing design draft.
- 0/1/2 score standards in the second step: 2 means aligned (insofar as possible — sometimes it’s truncated, but it’s not like the missing data is in a different paragraph), 1 means sorta aligned (includes situations where a paragraph in English is split across multiple paragraphs), and 0 means it wasn’t translated.
- A manual annotated example here.
- Check our two-step crowd sourcing design draft.
- Paragraph summarization
- Why we need paragraph alignments of literary texts?
- Possible submission
- Aggregating and analysing crowdsourced annotations for NLP AnnoNLP workshop@EMNLP-IJCNLP 2019
- Submission deadline: Sep 2, 2019
- Aggregating and analysing crowdsourced annotations for NLP AnnoNLP workshop@EMNLP-IJCNLP 2019
- Starting date: 17 April
- Project: Translation dashboard prototype development
- Design draft
- Online demo
- Check translation-dashboard folder for more information about the development
- Starting date: 22 April
- Languages:
- Basque
- Finno-Ugric: Finnish & Hungarian
- Digging into alignment algorithms
- Results analysis by domain experts
- Crowd sourcing test by using Translation dashboard
- Special Issue "Computational Linguistics for Low-Resource Languages"
- Deadline for manuscript submissions: 15 July 2019
- Tentative Title: Mapping the Circulation of Literary Writings through Aligned Translations: The example of Slavic and Finno-Ugric Translations of Adventures of Huckleberry Finn.
- Abstract: Because translated texts have been regarded as unreliable due to suspicions of bias and untrustworthiness,
they have so far been an overlooked resource in the field of NLP.
But localizing, digitizing, and aligning translated texts of well-travelled famous novels can provide a fruitful basis for developing digitized linguistic material in under-resourced language.
In this paper we focus on translations of Mark Twain’s Adventures of Huckleberry Finn into a set of Slavic and Finno-Ugric languages in order to map the circulation of ideas and writings and to build up digitalized linguistic material with a view to help preserve the diversity of languages and cultures.