Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode sentence boundaries #24

Merged
merged 4 commits into from
May 15, 2019
Merged

Conversation

tomcumming
Copy link
Contributor

@tomcumming tomcumming commented May 6, 2017

This is an implementation of the sentence breaks specification, including changes to the python files to grab the sentence break test data.

I welcome any advice for improving.

@tomcumming tomcumming changed the title Code review please Code review please (Unicode sentence boundary partial implementation) May 6, 2017
@tomcumming tomcumming changed the title Code review please (Unicode sentence boundary partial implementation) Unicode sentence boundaries May 16, 2017
@Manishearth
Copy link
Member

Sorry for letting this stagnate! I'm rather busy to review this right now, but will try to get to it soon!

Sentence boundaries is something I've wanted implemented here for a while 😄

@Manishearth
Copy link
Member

(still no time to look at this, apologies. Really hope to get to it soon)

@rth rth mentioned this pull request May 3, 2019
@rth
Copy link
Contributor

rth commented May 6, 2019

This PR looks quite good and it would be great to have this functionality. Could I help in any way ?

@tomcumming
Copy link
Contributor Author

@rth I can split this out into another crate if required?

@rth
Copy link
Contributor

rth commented May 8, 2019

@tomcumming I would really like to use this implementation (and compare it with other sentences splitting approaches) in rth/vtext#51 . Having this implementation in the unicode-segmentation crate would be ideal, but if it is unlikely to be reviewed in the near future, maybe putting it in some other crate could be a workaround.

Any chance @Manishearth that you would have some review bandwidth for this, or could suggest someone who could review it?

@Manishearth
Copy link
Member

@rth mind doing a review yourself as well? I can also try and review, but I don't think I'd be able to give this a proper thorough review and would feel more comfortable if more people have gone through it.

@rth
Copy link
Contributor

rth commented May 9, 2019

Sure, I'll try to review it in the next few days.

Copy link
Member

@Manishearth Manishearth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks correct! Mostly want more documentation.

src/lib.rs Show resolved Hide resolved
src/sentence.rs Show resolved Hide resolved
src/sentence.rs Show resolved Hide resolved
src/sentence.rs Show resolved Hide resolved
src/sentence.rs Show resolved Hide resolved
src/sentence.rs Outdated Show resolved Hide resolved
src/sentence.rs Show resolved Hide resolved
src/sentence.rs Show resolved Hide resolved
@tomcumming
Copy link
Contributor Author

@Manishearth @rth I have updated the PR including requested changes

@Manishearth
Copy link
Member

Looks good! @rth want to do a second review?

Copy link
Contributor

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the review @Manishearth !

I went through the code in more detail, I find it quite readable and I don't really have anything to add. (Though I am fairly new to rust and don't know that much about Unicode segmentation specs).

I can confirm that src/tables.rs and src/testdata.rs in this PR can be re-generated in their current state with the included python scripts, but they require setting,
scripts/unicode.py

-        os.system("curl -O http://www.unicode.org/Public/UNIDATA/%s"
+        os.system("curl -O http://www.unicode.org/Public/9.0.0/ucd/%s"

as otherwise data for latest Unicode 12.0 is downloaded.

src/lib.rs Show resolved Hide resolved
src/sentence.rs Show resolved Hide resolved
@tomcumming
Copy link
Contributor Author

Fixing the URL for test data should probably be another PR

@rth
Copy link
Contributor

rth commented May 15, 2019

Fixing the URL for test data should probably be another PR

Yes, I'll do it.

Thanks @tomcumming I don't have any other comments.

@Manishearth Manishearth merged commit c7a6b6f into unicode-rs:master May 15, 2019
@Manishearth
Copy link
Member

Thank you! I'll push a release soonish

@Manishearth
Copy link
Member

Published 1.3.0. Thanks for the work on this, and sorry for the delay in reviewing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants