Extra spaces added to raw property #3

danrubins · 2015-07-03T06:30:29Z

Hey Xav,
In the object that gets returned, the raw text seems to have some extra spaces between tokens. For example, I would expect that running compendium.analyse('My name is Dr. Jekyll.'); would return the original text as the 'raw' property, as follows:

[ { time: 9,                        // Time of processing, in ms
    length: 6,                      // Count of tokens
    raw: 'My name is Dr. Jekyll.', // Raw string
    stats: ...

However, it actually returns the following:

[ { time: 9,                        // Time of processing, in ms
    length: 6,                      // Count of tokens
    raw: 'My name is Dr. Jekyll .', // Raw string
    stats: ...

It's a bit more pronounced with extra punctuation, since those are tokenized separately:

compendium.analyse('Today is 4/2/2015, or 2/4/2015- depending on where in the world you live!');

[ { time: 6,                        // Time of processing, in ms
    length: 23,                     // Count of tokens
    raw: 'Today is 4 / 2 / 2015 , or 2 / 4 / 2015- depending on where in the world you live !', // Raw string
    stats: ...

It's a minor issue, but it can cause some presentation weirdness.

A quick look at the code makes me think it's happening on line 41 of detector.s.1.entities.js but that's only a first glance.

BTW, great work on this package!

The text was updated successfully, but these errors were encountered:

Ulflander · 2015-07-06T17:43:12Z

Hi Dan!

Thanks for the report, and sorry for the late reply.

The issue was caused by an incomplete implementation: raw was in fact an inaccurate reconstruction of the original string. I plan to implement it later on, as we may sometimes get some different tokens than in the original string (e.g. 2day to today).

In between, release v0.0.20 solve the issue by providing in raw field the real original string of the sentence.

As you look like you're really using the library, I wanted you to know that I'm working on being able to work on different languages, second one after english being french. This release scaffolds some code for doing so, so I hope you won't bump into any surprise - all tests are passing but who knows?

Let me know if you have any more feedback or suggestion!

Cheers!

danrubins · 2015-07-06T18:02:00Z

Thanks! I'll certainly keep an eye out for any issues and PR when I can.

Ulflander closed this as completed in 393f520 Jul 6, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extra spaces added to raw property #3

Extra spaces added to raw property #3

danrubins commented Jul 3, 2015

Ulflander commented Jul 6, 2015

danrubins commented Jul 6, 2015

Extra spaces added to raw property #3

Extra spaces added to raw property #3

Comments

danrubins commented Jul 3, 2015

Ulflander commented Jul 6, 2015

danrubins commented Jul 6, 2015