-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[char/range] Add CharRange and CharIter #112
Conversation
I have no idea why I am so slow.
@behnam The actual implementation is ready for review. It's just the supporting chaff that I need to copy over -- the (half-)open constructors and the macro, along with test suite. |
(Oof, reverse takes three times as long on my machine (wheras the FilterMap is actually faster! probably branch prediction). I'll have to see if I missed an #[inline] or if it's something else.)
chars!(..=) is const.
Precomputing SURROGATE_RANGE.len() proved to make no difference to the 2ns (+/- 0ns) execution time of this method.
Table generation should probably use the Reverse iteration is currently between 3 and 4 times slower than forward iteration. It must be slightly too complicated to inline in LLVM's eyes, since I did mark it |
bors: try |
Build succeeded |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks great, @CAD97!
One design question though: I guess there are good reasons behind the std lib not having Range*
object be an iterator itself. I think some of those reasons would apply here, as well. Let's take a moment and think about these.
One thing that comes to my mind is the consumption in a for
loop. From the looks of this, I think this would become a problem:
let chars = chars!('a' ..= 'z');
for ch in chars {
println!("{}", ch);
}
println!("'A'? {}", chars.contain('A'));
because here chars
gets consumed by the for loop.
Also, that code doesn't work because chars
would need to be mut
.
So, because of this, and possibly other reasons, I think we should have CharIter
, which actually is just a CharRange
with some *Iterator*
implementations, and keep the Range
not being an iterator itself.
As you noticed in #111, there was a lot of duplication between the two classes. I think it's great to only have one implementation of those, like stepping forward/backward, but I think it's also important to keep range and iterator separate concepts, and just reuse the code by the newtype pattern and #inline
methods.
What do you think?
unic/char/range/src/range.rs
Outdated
} | ||
|
||
/// Construct a range of characters from bounds. | ||
pub fn bound(mut start: Bound<char>, mut stop: Bound<char>) -> CharRange { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very good way to keep a generic API here. ❤️
But, why do you expect the input to be mut
? I think what you need is to take a readonly bound, and construct a value inside:
let start = if start == Bound::Unbounded { start = Bound::Included('\0') } else { start };
I guess this is the kind of issues we usually catch with unit tests. Want to add one for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of note: you can pass an imutable owned value here, because it's taken by (copied) value.
But this formulation makes more sense than what I have. 👍
But on your argument towards separating #[derive(Copy, Clone, Debug, Eq, PartialEq)]
struct R {
low: usize,
high: usize,
}
impl R {
fn contains(&self, item: usize) -> bool { self.low <= item && item <= self.high }
}
impl Iterator for R {
type Item = usize;
fn next(&mut self) -> Option<usize> {
if self.low > self.high { None } else {
let ret = self.low;
self.low += 1;
Some(ret)
}
}
}
fn main() {
let range = R { low: 0, high: 20 };
println!("{}", range.contains(10));
for int in range { println!("{}", int); }
println!("{}", range.contains(10));
} This works. This works because You can make an argument that the actual range object and the actual iteration are separate concerns, and should be handled by different structures. If I were designing a stdlib from scratch, I think I'd go the way of having a separate |
I did a bit of research on why rust-lang/rust#18045 rust-lang/rust#27186 rust-lang/rust#21846 It looks like rust-lang/rust#27186 (comment)
rust-lang/rust#18045 (comment)
This is a footgun, and the reason the stdlib chose to make Note, having the range be directly iterable enables For our implementation, this also has the benefit of allowing us to fully expose the internals of the This research into why |
reverse iteration is fast now? I guess I'm no longer passing the inline depth limit.
This for some reason sped up reverse iteration and slowed down forward. I have an idea of why reverse was sped up (one less fn deep), and some things to try to get forward back below 1ms. |
This makes it take ~0.98ms (on my machine) to iterate all forwards as opposed to ~1.2ms (on my machine). Forward and Reverse iteration are now within margin of error of each other, which means (hopefully) neither are getting an optimization which is being denied to the other. I still have no concrete idea why the reverse->filter_map is ~20% faster than just the filter_map. My only guess is branch prediction, which means it would be very CPU dependent.
(See 7c0c5dc for a bit more on this last speedup.) With that we're now consistently down below 1ms to iterate forward or backward accross the entire unicode range (on my machine) (with margin of error reaching slightly above). This is actually slightly better than the Forward and reverse iteration are within margin of error of each other in performance. I think this is a good indicator that neither path is favored over the other. (Small note; if you just want to iterate over all characters it seems I'm happy with this. I'm ready to merge when you are @behnam. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, my bad about the std lib assumption. I should have double-checked before referring.
I think I agree with your reasoning here. Also, there's the fact that we can always expand CharRange
to impl *Iterator
, so, we're not missing anything by being cautious.
Personally, I like the fact the range objects won't be getting manipulated all the time just because we want to save one .iter()
now and then.
So, the design looks good, and the code is good. Just one corner case needs attention, I think.
unic/char/range/src/range.rs
Outdated
/// | ||
/// ``` | ||
/// # #[macro_use] extern crate unic_char_range; | ||
/// # use unic_char_range::*; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to not hide the extern crate
and use
lines from the doc code blocks. (One empty line between them, plus an empty line before the actual code.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I'll include the lines.
unic/char/range/src/range.rs
Outdated
/// # fn main() { | ||
/// assert_eq!(chars!('a'..='z'), CharRange::closed('a', 'z')); | ||
/// assert_eq!(chars!('a'..'z'), CharRange::open_right('a', 'z')); | ||
/// assert_eq!(chars!(..), CharRange::all()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some reason ❤️ doesn't render on my machine so this looks like an empty comment 🙄
Since we have CharRange::bound
I just realized we could also add chars!(..'z')
and chars!('a'..)
. I don't think those would be of much use though.
unic/char/range/src/range.rs
Outdated
/// assert_eq!(chars!(..), CharRange::all()); | ||
/// # } | ||
/// ``` | ||
#[derive(Copy, Clone, Debug, Eq, PartialEq)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think derived PartialEq
doesn't work for us here, because any case with high < low
is considered empty for us, but eq()
would return false depending on the numbers.
For rustc RangeInclusive
, there's been a talk about having an enum
for the type with an Empty
variant. I don't think we want that here, but need to well-define emptiness.
Also a reminder that CharRange
also needs an is_empty()
, to not enforce .len() == 0
to call-sites.
Another matter with empty case is the open question of if either of low
and high
matter in equality. IMHO, it doesn't matter, and we can have this:
fn eq(&self, other: Self) -> bool {
if self.is_empty() {
other.is_empty()
} else {
self.low == other.low && self.high == other.high
}
}
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually did a bit of reading on RangeInclusive
while making this and I think the current design that's been landed on is having just two fields--a start and a stop--and creating Step
fn to make a known-ended state.
And yes, it makes sense that two empty ranges no matter their internal state should compare the same.
impl CharRange { | ||
/// Construct a closed range of characters. | ||
/// | ||
/// If `stop` is ordered before `start`, the resulting range will be empty. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this line can be moved to the type doc, with addition of eq()
behavior. (And we can drop it from constructors.) (From d11n perspective, since the fields are pub
, there shouldn't be a need to read the constructor docs to understand the interpretation of the fields.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! Thanks, @CAD97!
bors: r+ |
112: [char/range] Add CharRange and CharIter r=behnam The first half of adressing #91. Closes #111, this manner of attack is better than it. This PR only has one type, `CharRange`. It is effectively `std::ops::RangeInclusive` translated to characters. The matter of construction is handled by both half-open and closed constructors offered and a macro to allow for `'a'..='z'` syntax.
Build succeeded |
The first half of adressing #91.
Closes #111, this manner of attack is better than it.
This PR only has one type,
CharRange
. It is effectivelystd::ops::RangeInclusive
translated to characters. The matter of construction is handled by both half-open and closed constructors offered and a macro to allow for'a'..='z'
syntax.