Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[char/range] Add CharRange and CharIter #112

Merged
merged 21 commits into from
Aug 13, 2017
Merged

[char/range] Add CharRange and CharIter #112

merged 21 commits into from
Aug 13, 2017

Conversation

CAD97
Copy link
Collaborator

@CAD97 CAD97 commented Aug 12, 2017

The first half of adressing #91.
Closes #111, this manner of attack is better than it.

This PR only has one type, CharRange. It is effectively std::ops::RangeInclusive translated to characters. The matter of construction is handled by both half-open and closed constructors offered and a macro to allow for 'a'..='z' syntax.

@CAD97 CAD97 requested a review from behnam August 12, 2017 08:19
@CAD97
Copy link
Collaborator Author

CAD97 commented Aug 12, 2017

@behnam The actual implementation is ready for review. It's just the supporting chaff that I need to copy over -- the (half-)open constructors and the macro, along with test suite.

CAD97 added 3 commits August 12, 2017 04:39
(Oof, reverse takes three times as long on my machine
(wheras the FilterMap is actually faster! probably branch prediction).
I'll have to see if I missed an #[inline] or if it's something else.)
@CAD97 CAD97 mentioned this pull request Aug 12, 2017
@CAD97
Copy link
Collaborator Author

CAD97 commented Aug 12, 2017

Table generation should probably use the chars!( '\u{}' ..= '\u{}' ) syntax. It is a constant time expression, unlike the normal constructors, so can be used in the tables.

Reverse iteration is currently between 3 and 4 times slower than forward iteration. It must be slightly too complicated to inline in LLVM's eyes, since I did mark it #[inline] but it doesn't seem to have benefited in the same manner as forward iteration did. (In any case, forward iteration is the normal situation anyway.)

@CAD97 CAD97 self-assigned this Aug 12, 2017
@CAD97
Copy link
Collaborator Author

CAD97 commented Aug 12, 2017

bors: try

bors bot added a commit that referenced this pull request Aug 12, 2017
@bors
Copy link
Contributor

bors bot commented Aug 12, 2017

Build succeeded

Copy link
Member

@behnam behnam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks great, @CAD97!

One design question though: I guess there are good reasons behind the std lib not having Range* object be an iterator itself. I think some of those reasons would apply here, as well. Let's take a moment and think about these.

One thing that comes to my mind is the consumption in a for loop. From the looks of this, I think this would become a problem:

let chars = chars!('a' ..= 'z');
for ch in chars {
  println!("{}", ch);
}
println!("'A'? {}", chars.contain('A'));

because here chars gets consumed by the for loop.

Also, that code doesn't work because chars would need to be mut.

So, because of this, and possibly other reasons, I think we should have CharIter, which actually is just a CharRange with some *Iterator* implementations, and keep the Range not being an iterator itself.

As you noticed in #111, there was a lot of duplication between the two classes. I think it's great to only have one implementation of those, like stepping forward/backward, but I think it's also important to keep range and iterator separate concepts, and just reuse the code by the newtype pattern and #inline methods.

What do you think?

}

/// Construct a range of characters from bounds.
pub fn bound(mut start: Bound<char>, mut stop: Bound<char>) -> CharRange {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very good way to keep a generic API here. ❤️

But, why do you expect the input to be mut? I think what you need is to take a readonly bound, and construct a value inside:

let start = if start == Bound::Unbounded { start = Bound::Included('\0') } else { start };

I guess this is the kind of issues we usually catch with unit tests. Want to add one for this?

Copy link
Collaborator Author

@CAD97 CAD97 Aug 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of note: you can pass an imutable owned value here, because it's taken by (copied) value.

But this formulation makes more sense than what I have. 👍

@CAD97
Copy link
Collaborator Author

CAD97 commented Aug 12, 2017

std::ops::Range is an iterator. That is why I decided to in the "simple" version to make CharRange an iterator, and because it closer parallels the stdlib Range.

But on your argument towards separating Range/RangeIter doesn't work that well. Consider this minimal example (playground link):

#[derive(Copy, Clone, Debug, Eq, PartialEq)]
struct R {
    low: usize,
    high: usize,
}

impl R {
    fn contains(&self, item: usize) -> bool { self.low <= item && item <= self.high }
}

impl Iterator for R {
    type Item = usize;
    fn next(&mut self) -> Option<usize> {
        if self.low > self.high { None } else {
            let ret = self.low;
            self.low += 1;
            Some(ret)
        }
    }
}

fn main() {
    let range = R { low: 0, high: 20 };
    
    println!("{}", range.contains(10));
    for int in range { println!("{}", int); }
    println!("{}", range.contains(10));
}

This works. This works because impl<I: Iterator> IntoIterator for I does a move (or copy). You can move a normal non-mut range into a for loop and it will work. You cannot use it aftwerwards, though, because it is not Copy. Our CharRange is Copy, however, so moving it into the for loop makes a copy.

You can make an argument that the actual range object and the actual iteration are separate concerns, and should be handled by different structures. If I were designing a stdlib from scratch, I think I'd go the way of having a separate Range and RangeIter structures. However, for Rust, the ship has sailed with the design of std::ops::Range, and I think it's probably better to mimic the stdlib here.

@CAD97
Copy link
Collaborator Author

CAD97 commented Aug 12, 2017

I did a bit of research on why Range is not Copy when its index is.

rust-lang/rust#18045 rust-lang/rust#27186 rust-lang/rust#21846
https://botbot.me/mozilla/rust-libs/2015-05-19/?msg=39609682&page=1

It looks like Range used to be but is no longer Copy because it being having a copy iterator can be confusing:

rust-lang/rust#27186 (comment)

These don't have it because they're iterators.
The choice of removing Copy impls instead of adjusted for loop desugaring or linting was made to prevent this problematic case:

let mut iter = 0..n;
for i in iter { if i > 2 { break; } }
iter.collect()

Here iter is actually not mutated, but copied [if Range is Copy]. for i in &mut iter is required to mutate the iterator.

rust-lang/rust#18045 (comment)

let stream = "Hello, world!".chars().cycle();
for _ in range(0u, 10) {
    let chunk: String = stream.take(3).collect();
    println!("{}", chunk);
}

This is a footgun, and the reason the stdlib chose to make Range not Copy, because Range is an Iterator.

Note, having the range be directly iterable enables (0..n).adapter().chain(). But (0..n).iter().adapter().chain() is no worse in my opinion, and allows the Range object to be copy, and because we have IntoIterator, still be used in a for loop.

For our implementation, this also has the benefit of allowing us to fully expose the internals of the Range type without worrying about safety. CharRange can be a fully safe type with only CharIter having to (understandably) call unsafe code. (The unsafe iteration of low past char::MAX is an optimization, which you can see has a drastic effect when compared to reverse iteration. It's one less branch that has to be done on every step.)

This research into why Range is not Copy has moved me firmly onto the side of splitting up the range and iterator concepts properly.

reverse iteration is fast now? I guess I'm no longer passing the inline depth limit.
@CAD97
Copy link
Collaborator Author

CAD97 commented Aug 12, 2017

This for some reason sped up reverse iteration and slowed down forward. I have an idea of why reverse was sped up (one less fn deep), and some things to try to get forward back below 1ms.

CAD97 added 2 commits August 12, 2017 21:10
This makes it take ~0.98ms (on my machine) to iterate all forwards
as opposed to ~1.2ms (on my machine).

Forward and Reverse iteration are now within margin of error
of each other, which means (hopefully) neither are getting an
optimization which is being denied to the other.

I still have no concrete idea why the reverse->filter_map
is ~20% faster than just the filter_map. My only guess is branch
prediction, which means it would be very CPU dependent.
@CAD97
Copy link
Collaborator Author

CAD97 commented Aug 13, 2017

(See 7c0c5dc for a bit more on this last speedup.)

With that we're now consistently down below 1ms to iterate forward or backward accross the entire unicode range (on my machine) (with margin of error reaching slightly above). This is actually slightly better than the char-iter crate which is consistently just above 1ms (though within the reported margin of error of being below).

Forward and reverse iteration are within margin of error of each other in performance. I think this is a good indicator that neither path is favored over the other.

(Small note; if you just want to iterate over all characters it seems (0..(char::MAX as u32+1)).rev().filter_map(char::from_u32) is the fastest by a decent amount more than reported margin of error. 🤷‍♂️)

I'm happy with this. I'm ready to merge when you are @behnam.

Copy link
Member

@behnam behnam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, my bad about the std lib assumption. I should have double-checked before referring.

I think I agree with your reasoning here. Also, there's the fact that we can always expand CharRange to impl *Iterator, so, we're not missing anything by being cautious.

Personally, I like the fact the range objects won't be getting manipulated all the time just because we want to save one .iter() now and then.

So, the design looks good, and the code is good. Just one corner case needs attention, I think.

///
/// ```
/// # #[macro_use] extern crate unic_char_range;
/// # use unic_char_range::*;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to not hide the extern crate and use lines from the doc code blocks. (One empty line between them, plus an empty line before the actual code.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I'll include the lines.

/// # fn main() {
/// assert_eq!(chars!('a'..='z'), CharRange::closed('a', 'z'));
/// assert_eq!(chars!('a'..'z'), CharRange::open_right('a', 'z'));
/// assert_eq!(chars!(..), CharRange::all());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason ❤️ doesn't render on my machine so this looks like an empty comment 🙄

Since we have CharRange::bound I just realized we could also add chars!(..'z') and chars!('a'..). I don't think those would be of much use though.

/// assert_eq!(chars!(..), CharRange::all());
/// # }
/// ```
#[derive(Copy, Clone, Debug, Eq, PartialEq)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think derived PartialEq doesn't work for us here, because any case with high < low is considered empty for us, but eq() would return false depending on the numbers.

For rustc RangeInclusive, there's been a talk about having an enum for the type with an Empty variant. I don't think we want that here, but need to well-define emptiness.

Also a reminder that CharRange also needs an is_empty(), to not enforce .len() == 0 to call-sites.

Another matter with empty case is the open question of if either of low and high matter in equality. IMHO, it doesn't matter, and we can have this:

fn eq(&self, other: Self) -> bool {
  if self.is_empty() {
    other.is_empty()
  } else {
    self.low == other.low && self.high == other.high
  }
}

What do you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually did a bit of reading on RangeInclusive while making this and I think the current design that's been landed on is having just two fields--a start and a stop--and creating Step fn to make a known-ended state.

And yes, it makes sense that two empty ranges no matter their internal state should compare the same.

impl CharRange {
/// Construct a closed range of characters.
///
/// If `stop` is ordered before `start`, the resulting range will be empty.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this line can be moved to the type doc, with addition of eq() behavior. (And we can drop it from constructors.) (From d11n perspective, since the fields are pub, there shouldn't be a need to read the constructor docs to understand the interpretation of the fields.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@behnam behnam changed the title CharRange [char/range] Add CharRange and CharIter Aug 13, 2017
Copy link
Member

@behnam behnam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Thanks, @CAD97!

@behnam
Copy link
Member

behnam commented Aug 13, 2017

bors: r+

bors bot added a commit that referenced this pull request Aug 13, 2017
112: [char/range] Add CharRange and CharIter r=behnam

The first half of adressing #91.
Closes #111, this manner of attack is better than it.

This PR only has one type, `CharRange`. It is effectively `std::ops::RangeInclusive` translated to characters. The matter of construction is handled by both half-open and closed constructors offered and a macro to allow for `'a'..='z'` syntax.
@bors
Copy link
Contributor

bors bot commented Aug 13, 2017

Build succeeded

@bors bors bot merged commit cf50802 into open-i18n:master Aug 13, 2017
@CAD97 CAD97 mentioned this pull request Aug 13, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants