Some bugs in de, es and fr #228

Oktai15 · 2024-09-10T07:43:37Z

Hi!

I use the latest NeMo release: 1.1.0. I found the following bugs.

Bugs

German (de):

text: Here is brettspielversand.de.
norm_text: Here is b r e t t s p i e l v e r s a n d punkt de.
expected output: Here is brettspielversand punkt de.
text: Sinnesbereichen.in allen Sinnen.
norm_text:S i n n e s b e r e i c h e n punkt in allen Sinnen.
expected output: Sinnesbereichen punkt in allen Sinnen.
text: Hier zoome ich auf die Läsion. Wir befinden uns also auf der 2D-Mammographie.
norm_text:Hier zoome ich auf die Läsion. Wir befinden uns also auf der 2D-Mammographie.
expected output: Hier zoome ich auf die Läsion. Wir befinden uns also auf der Zwei-D-Mammographie. (not sure)

For German normalization, I use the following code:

from nemo_text_processing.text_normalization.normalize import Normalizer

normalizer = Normalizer(
  input_case="cased",
  lang="de",
  deterministic=True,
)

norm_text = normalizer.normalize(text, punct_post_process=True)

Spanish (es):

text: El texto de Li Qin en este libro ahora está disponible en forma de libro electrónico.
norm_text: El texto de quincuagésimo primero Qin en este libro ahora está disponible en forma de libro electrónico.
expected output:El texto de Li Qin en este libro ahora está disponible en forma de libro electrónico. (not sure)

For Spanish normalization, I use the following code:

from nemo_text_processing.text_normalization.normalize import Normalizer

normalizer = Normalizer(
  input_case="cased",
  lang="es",
  deterministic=True,
)

norm_text = normalizer.normalize(text, punct_post_process=True)

French (fr):

text: Les Tech Clippings seront diffusés en exclusivité sur la chaîne Youtube DIGITIMES tous les vendredis à 20h.
norm_text: Les Tech Clippings seront diffusés en exclusivité sur la chaîne Youtube DIGITIMES tous les vendredis à 20h.
expected output:Les Tech Clippings seront diffusés en exclusivité sur la chaîne YouTube DIGITIMES tous les vendredis à 20 heures. (not sure)

For French normalization, I use the following code:

from nemo_text_processing.text_normalization.normalize import Normalizer

normalizer = Normalizer(
  input_case="cased",
  lang="fr",
  deterministic=True,
)

norm_text = normalizer.normalize(text, punct_post_process=True)

The text was updated successfully, but these errors were encountered:

Oktai15 · 2024-09-10T07:44:05Z

@ekmb

zoobereq · 2024-09-10T19:43:21Z

German:

Will address.
The model expects canonical punctuation, which in this case requires a whitespace following a sentence-final period. In its absence, the string will likely be transduced as a URL (hence the spacing between individual characters -- see above). As the input string contains non-standard punctuation, the output represents expected behavior.
This is a known issue. Will address.

Spanish:

This was addressed with PR #224, which didn't make it to the current release.

French:

The MEASURE semiotic class is not implemented for French TN (it is present in ITN). Will address.

* Implements the fix Signed-off-by: Simon Zuberek <szuberek@nvidia.com> * Expands the list to TLDs with over 1000 registrations as per Google's registry 06/2020 Signed-off-by: Simon Zuberek <szuberek@nvidia.com> * Updates the TLD mappings and tests Signed-off-by: Simon Zuberek <szuberek@nvidia.com> * Updates the cache Signed-off-by: Simon Zuberek <szuberek@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Simon Zuberek <szuberek@nvidia.com> Co-authored-by: Simon Zuberek <szuberek@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

zoobereq · 2024-10-17T21:46:28Z

A fix for German (1.) and (2.) has been implemented.

Oktai15 added the bug Something isn't working label Sep 10, 2024

zoobereq mentioned this issue Oct 4, 2024

Fixes issue 228 #234

Open

14 tasks

zoobereq mentioned this issue Oct 14, 2024

DE TN Fix for Issue #228 #237

Merged

14 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some bugs in de, es and fr #228

Some bugs in de, es and fr #228

Oktai15 commented Sep 10, 2024

Oktai15 commented Sep 10, 2024

zoobereq commented Sep 10, 2024 •

edited

Loading

zoobereq commented Oct 17, 2024

Some bugs in de, es and fr #228

Some bugs in de, es and fr #228

Comments

Oktai15 commented Sep 10, 2024

Bugs

Oktai15 commented Sep 10, 2024

zoobereq commented Sep 10, 2024 • edited Loading

zoobereq commented Oct 17, 2024

zoobereq commented Sep 10, 2024 •

edited

Loading