From a41e83f31edac18791aabb1396200d161eda1709 Mon Sep 17 00:00:00 2001 From: Michael Kohler Date: Sat, 17 Sep 2022 12:31:52 +0200 Subject: [PATCH] chore: add documentation for overall flow in README (fixes #636) (#637) * chore: add documentation for overall flow in README (fixes #636) * chore: add small note about how to edit diagram * chore: adjust documentation based on suggestions --- README.md | 27 +++++++++++++++++++++++++-- docs/flow.svg | 4 ++++ 2 files changed, 29 insertions(+), 2 deletions(-) create mode 100644 docs/flow.svg diff --git a/README.md b/README.md index a3b8fabb..d7206c3f 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,29 @@ # Common Voice Sentence Collector -The [Sentence Collector](https://commonvoice.mozilla.org/sentence-collector/) is part of the [Common Voice](https://commonvoice.mozilla.org/) project. Its purpose is to provide a tool for contributors to upload public domain sentences, which then can get reviewed and are exported to the Common Voice database. Once imported they will show up for contributors on Common Voice to read out aloud. +The [Sentence Collector](https://commonvoice.mozilla.org/sentence-collector/) is part of the [Common Voice](https://commonvoice.mozilla.org/) project. Its purpose is to provide a tool for contributors to upload public domain sentences, which then can get reviewed and are exported to the Common Voice database. Once imported into the Common Voice website, they will show up for contributors to read out aloud. + +For uploads of thousands of sentences, Sentence Collector is not the best tool. Check out the [Bulk Submission](https://github.com/common-voice/common-voice/blob/main/docs/SENTENCES.md#bulk-submission) guidelines for this use case. Another tool is the [Sentence Extractor](https://github.com/Common-Voice/cv-sentence-extractor) which allows automatic extraction of data sources such as Wikipedia. + +## Detailed Flow + +This explanation only focuses on the Sentence Collector. + +![Diagram](docs/flow.svg) + +*To edit this diagram, load the `flow.svg` in the docs of the repository into [diagrams.net](https://app.diagrams.net/) and then save the updated version back into the repository like any other file changes you'd make.* + +In the diagram above, light blue squares represent Sentence Collector processes. The grey squares are processes outside of the Sentence Collector tooling. The grey processes are the same for other sentence sources, such as bulk submissions and Sentence Extractor. Instead of an automatic export, these use Pull Requests directly adding text files into the [`server/data` folder of the Common Voice website repository](https://github.com/common-voice/common-voice/tree/main/server/data). + +1) Contributors gather sentences from public domain sources and (optionally) pre-process and pre-review them. These sentences can be from public domain books, or even self-written. The source does not matter, as long as the sentences are in the public domain. Contributors then upload these sentences through the [Sentence Collector "Add" form](https://commonvoice.mozilla.org/sentence-collector/#/add). +2) The Sentence Collector validates these sentences based on [rules per language](server/lib/validation/VALIDATION.md) (or the English rule file as default). Any sentence that does not match the validation rules does not get further processed and is shown as error in the Sentence Collector user interface for correction. For example, sentences are not allowed to have numbers in them, such as `2022`. +3) Any sentence that passed the validation gets written to the Sentence Collector database. +4) These sentences then get shown on the [Sentence Collector "Review" page](https://commonvoice.mozilla.org/sentence-collector/#/review) for other contributors to review. +5) Contributor's reviews are saved in the Sentence Collector database. Sentences can be approved or rejected. If at least 2 out of 3 reviews are positive, the sentence will eventually be exported for Common Voice (see the steps below). +6) Once a week an automatic process is triggered (GitHub action) to export all approved sentences to the Common Voice repository. +7) During this export, the [cleanup](https://github.com/common-voice/sentence-collector/blob/main/server/lib/cleanup/CLEANUP.md) scripts are run for each sentence, if configured for a language. This can be used to apply transformations for consistency, such as converting "..." into "…". +8) The resulting `sentence-collector.txt` file is written to the [language specific folder](https://github.com/common-voice/common-voice/tree/main/server/data) in the Common Voice repository. Note that any change to that file within the Common Voice repository will be overwritten by the next export, as the only source is the Sentence Collector database. +9) Sentences added to the Common Voice `server/data` folder do not instantly get imported Common Voice. This means that they are not instantly available for recording on the Common Voice website. The import of new sentences only happens when a new version of the Common Voice website is released. You can find the past releases [here](https://github.com/common-voice/common-voice/releases). +10) If a certain language is enabled for contribution, the imported sentences will then be shown to contributors to record. ## Get involved @@ -21,7 +44,7 @@ The [Sentence Collector](https://commonvoice.mozilla.org/sentence-collector/) is ![Diagram](docs/architecture.svg) -To edit this diagram, load the `architecture.svg` in the docs of the repository into [diagrams.net](https://app.diagrams.net/) and then save the updated version back into the repository like any other file changes you'd make. +*To edit this diagram, load the `architecture.svg` in the docs of the repository into [diagrams.net](https://app.diagrams.net/) and then save the updated version back into the repository like any other file changes you'd make.* ## Local Development diff --git a/docs/flow.svg b/docs/flow.svg new file mode 100644 index 00000000..6c0cf6df --- /dev/null +++ b/docs/flow.svg @@ -0,0 +1,4 @@ + + + +
Contributor uploads through
Sentence Collector's
"Add" form
Contributor uploads through...
2) SC automatic validation
2) SC automatic v...
1) Sentences
1) Sentences
Contributors review
sentences for correctness
on Sentence Collector's
"Review" form
Contributors review...
4) Sentences are shown for review
4) Sentences are show...
3) Write to database
3) Write to datab...
5) Reviews are
saved in database
5) Reviews are...
6) Automatic export
of approved sentences (weekly GitHub action)
6) Automatic export...
8) Common Voice
repository (server/data folder)
8) Common Voice...
Sentences
saved in
sentence-collector.txt
file
Sentences...
7) Cleanup script runs for each sentence (if configured)
7) Cleanup script runs for e...
9) Import into
Common Voice website
when new version of website is released
9) Import into...
10) If language is enabled on Common Voice, sentences get shown to contributor to record
10) If language is enabled on Com...
Text is not SVG - cannot display
\ No newline at end of file