Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement CharRange type #91

Closed
CAD97 opened this issue Aug 9, 2017 · 14 comments
Closed

Implement CharRange type #91

CAD97 opened this issue Aug 9, 2017 · 14 comments
Assignees
Labels
C: utils Utilities enhancement Enhancements to existing features

Comments

@CAD97
Copy link
Collaborator

CAD97 commented Aug 9, 2017

[assign: @CAD97]

Replace usages of (char, char) as an inclusive range with a custom Range struct adapted for use on characters.

@behnam behnam changed the title Enhancement: CharRange Implement CharRange type Aug 9, 2017
@behnam behnam added enhancement Enhancements to existing features C: utils Utilities labels Aug 9, 2017
@behnam
Copy link
Member

behnam commented Aug 9, 2017

Filed #93 as a follow up to this, for CharProperty API accepting CharRange as input.

@CAD97
Copy link
Collaborator Author

CAD97 commented Aug 10, 2017

I found char-iter on crates.io after a little digging, but that's only an iterator and not an actual range object. A cursory search found only better-range that actually provided a range over characters, and that crate doesn't have documentation or source links on crates.io.

(Also, both crates are two years old, but then again character iteration isn't going to change.)

In conclusion, adding a CharRange type will not duplicate work already done in the ecosystem.

Because a character range is such a self-contained interest (and I'm currently working on a project that would also enjoy having one), I'm considering making it its own crate. What do you think @behnam?

@behnam
Copy link
Member

behnam commented Aug 10, 2017

Thanks for the research, @CAD97. Very helpful.

Yeah, now thinking about it again, looks like things like char_property and char_range can be standalone crates, and not just parts of unic-utils, so they can be easily everywhere.

That said, I like the idea of keeping all these things under one repository, because it allows faster development and one-step reviews, instead of going back and forth between repos. The main problem in this area is the fact that we don't have CI/devops tools to test cross-repo PRs before merging. Hopefully this will change in a near future, but for now, it's too much cost to maintain small repos. This is why there are many multi-package repositories (like https://github.com/BurntSushi/ripgrep/), which host many small packages internally, so they can be developed with all the cross-package tests.

I think eventually we can split out these kinds of packages into their own repo. But for the current stage, which is fast development and not maintaining a stable API yet, I think we better keep them all in.

About organization in UNIC, we can actually put these components under /unic/char/range/ and /unic/char/property/, and name them unic-char-range and unic-char-property. This will keep names short, but still show which repository they come from, and keep them easy to find.

What do you think?

@CAD97
Copy link
Collaborator Author

CAD97 commented Aug 10, 2017

That seems like a good path forward to me!

@behnam
Copy link
Member

behnam commented Aug 12, 2017

I think we can generally improve Range<char> in rustc, so we can drop this class later. But, for now, we need the functionality and this model is the best abstraction, and easier to replace with built-in libs later.

Some ideas for now...

We want CharRange to implement std::ops::Range, and we would need a CharRangeInclusive to implement std::ops::RangeInclusive.

Both will be implementing std::iter::ExactSizeIterator and DoubleEndedIterator.

Also, char doesn't satisfy trait std::iter::Step, which makes it harder to support for loops over CharRange directly. We may be able to fix this upstream. I'll file an issue.

What do you think?

@CAD97
Copy link
Collaborator Author

CAD97 commented Aug 12, 2017

ops::Range is a struct, so for now we need to just replace it. Hopefully, as time goes on, we can fold this into ops::Range<char>. For now I've made two structs: CharRange and CharIter. The range object is effectively a fake ops::Range<char> except instead of implementing Iterator it implements IntoIterator<CharIter>. CharIter is effectively a slightly modified version of the iterator I originally proposed in #85.

I'm just putting some polish onto it right now and then I'll send the PR.

@behnam
Copy link
Member

behnam commented Aug 12, 2017

Right, right, not traits to implement, but types to stay similar to.

How about CharRangeInclusive? Don't we need them for those data table records starting/ending at the edges of char boundaries?

@CAD97
Copy link
Collaborator Author

CAD97 commented Aug 12, 2017

I've just done an inclusive range. The reasoning is that that's how character ranges are traditionally described, such as the regex set [a-z], which includes both a and z in the set.

If we really want an exclusive range it can be an alternate constructor.

@behnam
Copy link
Member

behnam commented Aug 12, 2017

Well, I don't really care much about the exclusive range, but I want us to stay as close to rustc as possible in semantics of the names. If we only have an inclusive range, I say we should call it CharRangeInclusive.

For easy of use of the shorter syntax, we can have a chars!(start, end) macro, that returns the inclusive range. This is short, easy to ready, and not breaking consistency with rustc namings.

What do you think?

@CAD97
Copy link
Collaborator Author

CAD97 commented Aug 12, 2017

I've submitted the PR so you can see the code; it checks out but I'm currently expanding docs and writing tests. I wouldn't be unhappy actually making CharRange and CharRangeInclusive. The chars! macro idea is a good one; we can even enable for c in chars!('a'..='z') with the macro!

@behnam
Copy link
Member

behnam commented Aug 12, 2017

Now that we want to be fancy, we can even support the U+xxxx syntax in the macro, so we can have this:

for c in chars!(U+0061..=U+FFFF)

😄

@CAD97
Copy link
Collaborator Author

CAD97 commented Aug 12, 2017

I'm pretty sure that if we want to do that, we'll need a procedural macro. There's no way within macro_rules! hygiene to convert U+0061 to '\u{0061}'.

@behnam
Copy link
Member

behnam commented Aug 12, 2017

Yeah, that was mostly a joke. But, if we ever end up writing a procedural macro for building transformation pipelines, we can consider supporting the U+ format.

bors bot added a commit that referenced this issue Aug 13, 2017
112: [char/range] Add CharRange and CharIter r=behnam

The first half of adressing #91.
Closes #111, this manner of attack is better than it.

This PR only has one type, `CharRange`. It is effectively `std::ops::RangeInclusive` translated to characters. The matter of construction is handled by both half-open and closed constructors offered and a macro to allow for `'a'..='z'` syntax.
@CAD97
Copy link
Collaborator Author

CAD97 commented Aug 13, 2017

Closed by #112

@CAD97 CAD97 closed this as completed Aug 13, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C: utils Utilities enhancement Enhancements to existing features
Projects
None yet
Development

No branches or pull requests

2 participants