[char/range] Add CharRange and CharIter #112

CAD97 · 2017-08-12T08:18:42Z

The first half of adressing #91.
Closes #111, this manner of attack is better than it.

This PR only has one type, CharRange. It is effectively std::ops::RangeInclusive translated to characters. The matter of construction is handled by both half-open and closed constructors offered and a macro to allow for 'a'..='z' syntax.

I have no idea why I am so slow.

CAD97 · 2017-08-12T08:20:19Z

@behnam The actual implementation is ready for review. It's just the supporting chaff that I need to copy over -- the (half-)open constructors and the macro, along with test suite.

(Oof, reverse takes three times as long on my machine (wheras the FilterMap is actually faster! probably branch prediction). I'll have to see if I missed an #[inline] or if it's something else.)

chars!(..=) is const.

<https://doc.rust-lang.org/std/collections/enum.Bound.html>

Precomputing SURROGATE_RANGE.len() proved to make no difference to the 2ns (+/- 0ns) execution time of this method.

CAD97 · 2017-08-12T10:04:44Z

Table generation should probably use the chars!( '\u{}' ..= '\u{}' ) syntax. It is a constant time expression, unlike the normal constructors, so can be used in the tables.

Reverse iteration is currently between 3 and 4 times slower than forward iteration. It must be slightly too complicated to inline in LLVM's eyes, since I did mark it #[inline] but it doesn't seem to have benefited in the same manner as forward iteration did. (In any case, forward iteration is the normal situation anyway.)

CAD97 · 2017-08-12T10:34:30Z

bors: try

bors · 2017-08-12T10:46:53Z

Build succeeded

continuous-integration/travis-ci/push

behnam

The code looks great, @CAD97!

One design question though: I guess there are good reasons behind the std lib not having Range* object be an iterator itself. I think some of those reasons would apply here, as well. Let's take a moment and think about these.

One thing that comes to my mind is the consumption in a for loop. From the looks of this, I think this would become a problem:

let chars = chars!('a' ..= 'z');
for ch in chars {
  println!("{}", ch);
}
println!("'A'? {}", chars.contain('A'));

because here chars gets consumed by the for loop.

Also, that code doesn't work because chars would need to be mut.

So, because of this, and possibly other reasons, I think we should have CharIter, which actually is just a CharRange with some *Iterator* implementations, and keep the Range not being an iterator itself.

As you noticed in #111, there was a lot of duplication between the two classes. I think it's great to only have one implementation of those, like stepping forward/backward, but I think it's also important to keep range and iterator separate concepts, and just reuse the code by the newtype pattern and #inline methods.

What do you think?

behnam · 2017-08-12T18:44:08Z

unic/char/range/src/range.rs

+    }
+
+    /// Construct a range of characters from bounds.
+    pub fn bound(mut start: Bound<char>, mut stop: Bound<char>) -> CharRange {


This is a very good way to keep a generic API here. ❤️

But, why do you expect the input to be mut? I think what you need is to take a readonly bound, and construct a value inside:

let start = if start == Bound::Unbounded { start = Bound::Included('\0') } else { start };

I guess this is the kind of issues we usually catch with unit tests. Want to add one for this?

Of note: you can pass an imutable owned value here, because it's taken by (copied) value.

But this formulation makes more sense than what I have. 👍

CAD97 · 2017-08-12T20:27:33Z

std::ops::Range is an iterator. That is why I decided to in the "simple" version to make CharRange an iterator, and because it closer parallels the stdlib Range.

But on your argument towards separating Range/RangeIter doesn't work that well. Consider this minimal example (playground link):

#[derive(Copy, Clone, Debug, Eq, PartialEq)]
struct R {
    low: usize,
    high: usize,
}

impl R {
    fn contains(&self, item: usize) -> bool { self.low <= item && item <= self.high }
}

impl Iterator for R {
    type Item = usize;
    fn next(&mut self) -> Option<usize> {
        if self.low > self.high { None } else {
            let ret = self.low;
            self.low += 1;
            Some(ret)
        }
    }
}

fn main() {
    let range = R { low: 0, high: 20 };
    
    println!("{}", range.contains(10));
    for int in range { println!("{}", int); }
    println!("{}", range.contains(10));
}

This works. This works because impl<I: Iterator> IntoIterator for I does a move (or copy). You can move a normal non-mut range into a for loop and it will work. You cannot use it aftwerwards, though, because it is not Copy. Our CharRange is Copy, however, so moving it into the for loop makes a copy.

You can make an argument that the actual range object and the actual iteration are separate concerns, and should be handled by different structures. If I were designing a stdlib from scratch, I think I'd go the way of having a separate Range and RangeIter structures. However, for Rust, the ship has sailed with the design of std::ops::Range, and I think it's probably better to mimic the stdlib here.

CAD97 · 2017-08-12T21:46:04Z

I did a bit of research on why Range is not Copy when its index is.

rust-lang/rust#18045 rust-lang/rust#27186 rust-lang/rust#21846
https://botbot.me/mozilla/rust-libs/2015-05-19/?msg=39609682&page=1

It looks like Range used to be but is no longer Copy because it being having a copy iterator can be confusing:

rust-lang/rust#27186 (comment)

These don't have it because they're iterators.
The choice of removing Copy impls instead of adjusted for loop desugaring or linting was made to prevent this problematic case:
let mut iter = 0..n;
for i in iter { if i > 2 { break; } }
iter.collect()
Here iter is actually not mutated, but copied [if Range is Copy]. for i in &mut iter is required to mutate the iterator.

rust-lang/rust#18045 (comment)

let stream = "Hello, world!".chars().cycle();
for _ in range(0u, 10) {
    let chunk: String = stream.take(3).collect();
    println!("{}", chunk);
}

This is a footgun, and the reason the stdlib chose to make Range not Copy, because Range is an Iterator.

Note, having the range be directly iterable enables (0..n).adapter().chain(). But (0..n).iter().adapter().chain() is no worse in my opinion, and allows the Range object to be copy, and because we have IntoIterator, still be used in a for loop.

For our implementation, this also has the benefit of allowing us to fully expose the internals of the Range type without worrying about safety. CharRange can be a fully safe type with only CharIter having to (understandably) call unsafe code. (The unsafe iteration of low past char::MAX is an optimization, which you can see has a drastic effect when compared to reverse iteration. It's one less branch that has to be done on every step.)

This research into why Range is not Copy has moved me firmly onto the side of splitting up the range and iterator concepts properly.

reverse iteration is fast now? I guess I'm no longer passing the inline depth limit.

CAD97 · 2017-08-12T22:44:18Z

This for some reason sped up reverse iteration and slowed down forward. I have an idea of why reverse was sped up (one less fn deep), and some things to try to get forward back below 1ms.

This makes it take ~0.98ms (on my machine) to iterate all forwards as opposed to ~1.2ms (on my machine). Forward and Reverse iteration are now within margin of error of each other, which means (hopefully) neither are getting an optimization which is being denied to the other. I still have no concrete idea why the reverse->filter_map is ~20% faster than just the filter_map. My only guess is branch prediction, which means it would be very CPU dependent.

CAD97 · 2017-08-13T02:13:16Z

(See 7c0c5dc for a bit more on this last speedup.)

With that we're now consistently down below 1ms to iterate forward or backward accross the entire unicode range (on my machine) (with margin of error reaching slightly above). This is actually slightly better than the char-iter crate which is consistently just above 1ms (though within the reported margin of error of being below).

Forward and reverse iteration are within margin of error of each other in performance. I think this is a good indicator that neither path is favored over the other.

(Small note; if you just want to iterate over all characters it seems (0..(char::MAX as u32+1)).rev().filter_map(char::from_u32) is the fastest by a decent amount more than reported margin of error. 🤷‍♂️)

I'm happy with this. I'm ready to merge when you are @behnam.

behnam

Oh, my bad about the std lib assumption. I should have double-checked before referring.

I think I agree with your reasoning here. Also, there's the fact that we can always expand CharRange to impl *Iterator, so, we're not missing anything by being cautious.

Personally, I like the fact the range objects won't be getting manipulated all the time just because we want to save one .iter() now and then.

So, the design looks good, and the code is good. Just one corner case needs attention, I think.

behnam · 2017-08-13T03:43:01Z

unic/char/range/src/range.rs

+///
+/// ```
+/// # #[macro_use] extern crate unic_char_range;
+/// # use unic_char_range::*;


It's better to not hide the extern crate and use lines from the doc code blocks. (One empty line between them, plus an empty line before the actual code.)

👍 I'll include the lines.

behnam · 2017-08-13T03:43:13Z

unic/char/range/src/range.rs

+/// # fn main() {
+/// assert_eq!(chars!('a'..='z'), CharRange::closed('a', 'z'));
+/// assert_eq!(chars!('a'..'z'), CharRange::open_right('a', 'z'));
+/// assert_eq!(chars!(..), CharRange::all());


For some reason ❤️ doesn't render on my machine so this looks like an empty comment 🙄

Since we have CharRange::bound I just realized we could also add chars!(..'z') and chars!('a'..). I don't think those would be of much use though.

behnam · 2017-08-13T03:56:45Z

unic/char/range/src/range.rs

+/// assert_eq!(chars!(..), CharRange::all());
+/// # }
+/// ```
+#[derive(Copy, Clone, Debug, Eq, PartialEq)]


I think derived PartialEq doesn't work for us here, because any case with high < low is considered empty for us, but eq() would return false depending on the numbers.

For rustc RangeInclusive, there's been a talk about having an enum for the type with an Empty variant. I don't think we want that here, but need to well-define emptiness.

Also a reminder that CharRange also needs an is_empty(), to not enforce .len() == 0 to call-sites.

Another matter with empty case is the open question of if either of low and high matter in equality. IMHO, it doesn't matter, and we can have this:

fn eq(&self, other: Self) -> bool { if self.is_empty() { other.is_empty() } else { self.low == other.low && self.high == other.high } }

What do you think?

I actually did a bit of reading on RangeInclusive while making this and I think the current design that's been landed on is having just two fields--a start and a stop--and creating Step fn to make a known-ended state.

And yes, it makes sense that two empty ranges no matter their internal state should compare the same.

behnam · 2017-08-13T03:59:18Z

unic/char/range/src/range.rs

+impl CharRange {
+    /// Construct a closed range of characters.
+    ///
+    /// If `stop` is ordered before `start`, the resulting range will be empty.


I think this line can be moved to the type doc, with addition of eq() behavior. (And we can drop it from constructors.) (From d11n perspective, since the fields are pub, there shouldn't be a need to read the constructor docs to understand the interpretation of the fields.)

behnam

Awesome! Thanks, @CAD97!

behnam · 2017-08-13T08:49:42Z

bors: r+

112: [char/range] Add CharRange and CharIter r=behnam The first half of adressing #91. Closes #111, this manner of attack is better than it. This PR only has one type, `CharRange`. It is effectively `std::ops::RangeInclusive` translated to characters. The matter of construction is handled by both half-open and closed constructors offered and a macro to allow for `'a'..='z'` syntax.

bors · 2017-08-13T09:04:48Z

Build succeeded

continuous-integration/travis-ci/push

CAD97 added 3 commits August 12, 2017 03:04

This is as simple as I can make it.

165cc3d

I have no idea why I am so slow.

Cut execution time in half with one #[inline] :sigh:

72b2000

Polish

ea8e6d9

CAD97 requested a review from behnam August 12, 2017 08:19

CAD97 added 3 commits August 12, 2017 04:39

Silly bug what are you doing there

56b3249

Constructors

aee0479

Benchmark reversed iteration as well.

8f0bd86

(Oof, reverse takes three times as long on my machine (wheras the FilterMap is actually faster! probably branch prediction). I'll have to see if I missed an #[inline] or if it's something else.)

CAD97 mentioned this pull request Aug 12, 2017

CharRange and CharIter #111

Closed

CAD97 added 8 commits August 12, 2017 05:06

Review notes from #111

d1a28f6

Iterator tests

641916f

macro chars!

d04365f

chars!(..=) is const.

std::collections::Bound

f5520de

<https://doc.rust-lang.org/std/collections/enum.Bound.html>

Unstable features

0e0ba6b

Unbounded macro invocation

ad6ce50

Document features

8153d22

Benchmark CharRange::len()

34c01de

Precomputing SURROGATE_RANGE.len() proved to make no difference to the 2ns (+/- 0ns) execution time of this method.

CAD97 added 2 commits August 12, 2017 06:11

Provide more efficient than default impls of some Iterator fn

61a831c

CharRange::contains is important to our use case

eb22f35

CAD97 self-assigned this Aug 12, 2017

bors bot added a commit that referenced this pull request Aug 12, 2017

Try #112:

ac3205b

behnam reviewed Aug 12, 2017

View reviewed changes

Improve CharRange::bound per review

bf5fabf

Split Range and Iter

6c2d967

reverse iteration is fast now? I guess I'm no longer passing the inline depth limit.

CAD97 added 2 commits August 12, 2017 21:10

Clarifications in style

200e721

behnam reviewed Aug 13, 2017

View reviewed changes

behnam changed the title ~~CharRange~~ [char/range] Add CharRange and CharIter Aug 13, 2017

impl review feedback

cf50802

behnam approved these changes Aug 13, 2017

View reviewed changes

bors bot merged commit cf50802 into open-i18n:master Aug 13, 2017

CAD97 mentioned this pull request Aug 13, 2017

Implement CharRange type #91

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[char/range] Add CharRange and CharIter #112

[char/range] Add CharRange and CharIter #112

CAD97 commented Aug 12, 2017 •

edited

Loading

CAD97 commented Aug 12, 2017

CAD97 commented Aug 12, 2017

CAD97 commented Aug 12, 2017

bors bot commented Aug 12, 2017

behnam left a comment •

edited

Loading

behnam Aug 12, 2017

CAD97 Aug 12, 2017 •

edited

Loading

CAD97 commented Aug 12, 2017

CAD97 commented Aug 12, 2017 •

edited

Loading

CAD97 commented Aug 12, 2017 •

edited

Loading

CAD97 commented Aug 13, 2017

behnam left a comment

behnam Aug 13, 2017

CAD97 Aug 13, 2017

behnam Aug 13, 2017

CAD97 Aug 13, 2017

behnam Aug 13, 2017

CAD97 Aug 13, 2017

behnam Aug 13, 2017

CAD97 Aug 13, 2017

behnam left a comment

behnam commented Aug 13, 2017

bors bot commented Aug 13, 2017

[char/range] Add CharRange and CharIter #112

[char/range] Add CharRange and CharIter #112

Conversation

CAD97 commented Aug 12, 2017 • edited Loading

CAD97 commented Aug 12, 2017

CAD97 commented Aug 12, 2017

CAD97 commented Aug 12, 2017

bors bot commented Aug 12, 2017

Build succeeded

behnam left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CAD97 Aug 12, 2017 • edited Loading

Choose a reason for hiding this comment

CAD97 commented Aug 12, 2017

CAD97 commented Aug 12, 2017 • edited Loading

CAD97 commented Aug 12, 2017 • edited Loading

CAD97 commented Aug 13, 2017

behnam left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

behnam left a comment

Choose a reason for hiding this comment

behnam commented Aug 13, 2017

bors bot commented Aug 13, 2017

Build succeeded

CAD97 commented Aug 12, 2017 •

edited

Loading

behnam left a comment •

edited

Loading

CAD97 Aug 12, 2017 •

edited

Loading

CAD97 commented Aug 12, 2017 •

edited

Loading

CAD97 commented Aug 12, 2017 •

edited

Loading