Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some bugs in de, es and fr #228

Open
Oktai15 opened this issue Sep 10, 2024 · 3 comments
Open

Some bugs in de, es and fr #228

Oktai15 opened this issue Sep 10, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@Oktai15
Copy link

Oktai15 commented Sep 10, 2024

Hi!

I use the latest NeMo release: 1.1.0. I found the following bugs.

Bugs

German (de):

  1. text: Here is brettspielversand.de.
    norm_text: Here is b r e t t s p i e l v e r s a n d punkt de.
    expected output: Here is brettspielversand punkt de.

  2. text: Sinnesbereichen.in allen Sinnen.
    norm_text:S i n n e s b e r e i c h e n punkt in allen Sinnen.
    expected output: Sinnesbereichen punkt in allen Sinnen.

  3. text: Hier zoome ich auf die Läsion. Wir befinden uns also auf der 2D-Mammographie.
    norm_text:Hier zoome ich auf die Läsion. Wir befinden uns also auf der 2D-Mammographie.
    expected output: Hier zoome ich auf die Läsion. Wir befinden uns also auf der Zwei-D-Mammographie. (not sure)

For German normalization, I use the following code:

from nemo_text_processing.text_normalization.normalize import Normalizer

normalizer = Normalizer(
  input_case="cased",
  lang="de",
  deterministic=True,
)

norm_text = normalizer.normalize(text, punct_post_process=True)

Spanish (es):

  1. text: El texto de Li Qin en este libro ahora está disponible en forma de libro electrónico.
    norm_text: El texto de quincuagésimo primero Qin en este libro ahora está disponible en forma de libro electrónico.
    expected output:El texto de Li Qin en este libro ahora está disponible en forma de libro electrónico. (not sure)

For Spanish normalization, I use the following code:

from nemo_text_processing.text_normalization.normalize import Normalizer

normalizer = Normalizer(
  input_case="cased",
  lang="es",
  deterministic=True,
)

norm_text = normalizer.normalize(text, punct_post_process=True)

French (fr):

  1. text: Les Tech Clippings seront diffusés en exclusivité sur la chaîne Youtube DIGITIMES tous les vendredis à 20h.
    norm_text: Les Tech Clippings seront diffusés en exclusivité sur la chaîne Youtube DIGITIMES tous les vendredis à 20h.
    expected output:Les Tech Clippings seront diffusés en exclusivité sur la chaîne YouTube DIGITIMES tous les vendredis à 20 heures. (not sure)

For French normalization, I use the following code:

from nemo_text_processing.text_normalization.normalize import Normalizer

normalizer = Normalizer(
  input_case="cased",
  lang="fr",
  deterministic=True,
)

norm_text = normalizer.normalize(text, punct_post_process=True)
@Oktai15 Oktai15 added the bug Something isn't working label Sep 10, 2024
@Oktai15
Copy link
Author

Oktai15 commented Sep 10, 2024

@ekmb

@zoobereq
Copy link
Collaborator

zoobereq commented Sep 10, 2024

German:

  1. Will address.
  2. The model expects canonical punctuation, which in this case requires a whitespace following a sentence-final period. In its absence, the string will likely be transduced as a URL (hence the spacing between individual characters -- see above). As the input string contains non-standard punctuation, the output represents expected behavior.
  3. This is a known issue. Will address.

Spanish:

  1. This was addressed with PR #224, which didn't make it to the current release.

French:

  1. The MEASURE semiotic class is not implemented for French TN (it is present in ITN). Will address.

@zoobereq zoobereq mentioned this issue Oct 4, 2024
14 tasks
@zoobereq zoobereq mentioned this issue Oct 14, 2024
14 tasks
tbartley94 pushed a commit that referenced this issue Oct 17, 2024
* Implements the fix

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* Expands the list to TLDs with over 1000 registrations as per Google's registry 06/2020

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* Updates the TLD mappings and tests

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* Updates the cache

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Simon Zuberek <szuberek@nvidia.com>
Co-authored-by: Simon Zuberek <szuberek@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
@zoobereq
Copy link
Collaborator

A fix for German (1.) and (2.) has been implemented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants