Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extra spaces added to raw property #3

Closed
danrubins opened this issue Jul 3, 2015 · 2 comments
Closed

Extra spaces added to raw property #3

danrubins opened this issue Jul 3, 2015 · 2 comments

Comments

@danrubins
Copy link

Hey Xav,
In the object that gets returned, the raw text seems to have some extra spaces between tokens. For example, I would expect that running compendium.analyse('My name is Dr. Jekyll.'); would return the original text as the 'raw' property, as follows:

[ { time: 9,                        // Time of processing, in ms
    length: 6,                      // Count of tokens
    raw: 'My name is Dr. Jekyll.', // Raw string
    stats: ...

However, it actually returns the following:

[ { time: 9,                        // Time of processing, in ms
    length: 6,                      // Count of tokens
    raw: 'My name is Dr. Jekyll .', // Raw string
    stats: ...

It's a bit more pronounced with extra punctuation, since those are tokenized separately:

compendium.analyse('Today is 4/2/2015, or 2/4/2015- depending on where in the world you live!');

[ { time: 6,                        // Time of processing, in ms
    length: 23,                     // Count of tokens
    raw: 'Today is 4 / 2 / 2015 , or 2 / 4 / 2015- depending on where in the world you live !', // Raw string
    stats: ...

It's a minor issue, but it can cause some presentation weirdness.

A quick look at the code makes me think it's happening on line 41 of detector.s.1.entities.js but that's only a first glance.

BTW, great work on this package!

@Ulflander
Copy link
Owner

Hi Dan!

Thanks for the report, and sorry for the late reply.

The issue was caused by an incomplete implementation: raw was in fact an inaccurate reconstruction of the original string. I plan to implement it later on, as we may sometimes get some different tokens than in the original string (e.g. 2day to today).

In between, release v0.0.20 solve the issue by providing in raw field the real original string of the sentence.

As you look like you're really using the library, I wanted you to know that I'm working on being able to work on different languages, second one after english being french. This release scaffolds some code for doing so, so I hope you won't bump into any surprise - all tests are passing but who knows?

Let me know if you have any more feedback or suggestion!

Cheers!

@danrubins
Copy link
Author

Thanks! I'll certainly keep an eye out for any issues and PR when I can.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants