Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a regular expressions library to the distribution #3591

Closed
brson opened this issue Sep 25, 2012 · 74 comments · Fixed by #13700
Closed

Add a regular expressions library to the distribution #3591

brson opened this issue Sep 25, 2012 · 74 comments · Fixed by #13700
Labels
P-medium Medium priority

Comments

@brson
Copy link
Contributor

brson commented Sep 25, 2012

The exact engine we use needs further discussion. Probably we don't want to use yarr because it requires the nitro jit, which is a big dependency and could be undesirable for various reasons. Ideally, we would have a nice syntax extension that precompiles the regexes (in addition to runtime compilation options).

@eholk
Copy link
Contributor

eholk commented Sep 25, 2012

What about writing a regular expression engine in Rust?

@brson
Copy link
Contributor Author

brson commented Sep 25, 2012

That could be fine.

@erickt
Copy link
Contributor

erickt commented Sep 28, 2012

For anyone who stumbles across this issue looking for a regex library, we do have a pcre binding in cargo.

@killerswan
Copy link
Contributor

@erickt is that related to @uasi's https://github.com/uasi/rust-pcre ?

@erickt
Copy link
Contributor

erickt commented Oct 5, 2012

@killerswan neat! that seems to be an independent implementation. I'll ping @uasi to merge our code.

@graydon
Copy link
Contributor

graydon commented Apr 25, 2013

nominating for feature-complete

@graydon
Copy link
Contributor

graydon commented Apr 25, 2013

(also, as I mention elsewhere but apparently not in this bug, I prefer re2 for this)

@catamorphism
Copy link
Contributor

(bug triage) Milestone is correct; could be a good medium-sized starter project for somebody.

@nickdesaulniers
Copy link

I'm still interested in this. Been working on an cre2 binding. Not 100% sure how to handle functions that expect c enums.

@ktt3ja
Copy link
Contributor

ktt3ja commented Nov 1, 2013

I would like to claim this issue to do as a project for my university class.

@nickdesaulniers
Copy link

@ktt3ja , I haven't been working on this.

@glennsl
Copy link

glennsl commented Nov 3, 2013

I've been working on this, but it's pretty slow going. See https://github.com/glennsl/rust-re

It has a good set of features already, but is currently pretty naively implemented, horribly slow and not very nice to look at. I'm trying to learn rust as I go, and am currently trying to figure out how to optimize it without wandering into unsafe territory.

Feel free to fork it or implement something completely separate. A different perspective would just be good as far as I'm concerned.

@huonw
Copy link
Member

huonw commented Nov 3, 2013

There is a wiki page summarising some research about this: https://github.com/mozilla/rust/wiki/Lib-re

@ktt3ja
Copy link
Contributor

ktt3ja commented Nov 7, 2013

Nvm, I'll be working on something else for my project.

@ferristseng
Copy link

I've been working on this for my class. The project is here https://github.com/ferristseng/regex-rust. I've been trying to add the features outlined in the ECMA-262 standard. If anyone has any suggestions or would like to help, let me know (this is my first time working on a regular expression engine).

@huonw
Copy link
Member

huonw commented Dec 7, 2013

cc @ssbr

@ssbr
Copy link
Contributor

ssbr commented Dec 7, 2013

Evidently there's some duplication of work going on, even if it is early work for everyone. @ferristseng want to join up? I can contribute what I suspect is a more rigorous attempt at a parser (I basically copied the ECMA spec to the letter, with one deviation, so it should be correct -- I did skip a few bits though, and replaced them with empty failing functions), and a large set of tests (150MB worth, I am trying to shrink it down and make them runnable) I've taken out of Firefox. Unfortunately my free time is more limited than I'd like, so I have yet to get around to a good set of runtime implementations. That's the exciting bit I've been trying to get to once all the other pieces are ready.

OTOH I suspect that since this is for school you can't have any outside contributors, which would be disappointing.

@ferristseng
Copy link

Yeah I would definitely like to join up. The class is finished now, so any outside contributions would be welcome. I'm not sure of the quality of the code though, but hopefully there is some useful stuff in there.

@glennsl
Copy link

glennsl commented Dec 9, 2013

Very cool, @ferristseng ! As a start, you might be interested in the testsuite I've adapted from python (see https://github.com/glennsl/rust-re/blob/master/tests.rs). Should be pretty easy to adapt to your own code.

I've been quite busy this past month, but I'm at a point now where I believe I have a fairly good understanding of the issues involved, regarding both performance and functionality. And I'm looking forward to having a bit more time during the holidays to work on this. I would really like to discuss our different approaches and ideas first, though, perhaps via e-mail? My e-mail is firstname at lastname dot net, and my full name is Glenn Slotte.

There is one issue that needs to be addressed in a larger context, however: Unicode. The readme on my repository roughly outlines what's needed for the different levels of Unicode TR18 compliance (there's also a third level of compliance, but that's special interest territory). Normalization and case folding, I think, are the two most important points that need to be addressed and implemented separately (#9084 addresses the case folding).

@Kimundi
Copy link
Member

Kimundi commented Dec 9, 2013

@glennsl Note that we already have nfd and nfkd normalization in the stdlib.

@ssbr
Copy link
Contributor

ssbr commented Dec 9, 2013

I've been quite busy this past month, but I'm at a point now where I believe I have a fairly good understanding of the issues involved, regarding both performance and functionality. And I'm looking forward to having a bit more time during the holidays to work on this. I would really like to discuss our different approaches and ideas first, though, perhaps via e-mail? My e-mail is firstname at lastname dot net, and my full name is Glenn Slotte.

Could this discussion be held publicly? I am interested in it.

@glennsl
Copy link

glennsl commented Dec 9, 2013

@Kimundi Ah, I've missed that, but that should do. Thanks.

@ssbr Yeah I figured you'd want to be involved, and was thinking you, me and @ferristseng could sort out the implementation details and coordinate without bothering everyone else. Sorry for not being very clear on that.

@thestinger
Copy link
Contributor

I don't really like the focus on replicating ECMA-262 because JavaScript has awful Unicode support, especially in regular expressions. It doesn't even have basic Unicode character class support - it's entirely anglo-centric.

@eddyb
Copy link
Member

eddyb commented Dec 10, 2013

@thestinger the Unicode support is fixed in ES6 (with the u flag), maybe that should be the baseline.

@lambda-fairy
Copy link
Contributor

Are you guys still working on this? I've been hacking on an implementation of my own (https://github.com/lfairy/rose). At this stage, it's mostly a cleaned-up version of @glennsl's library.

Pike VM is good. It isn't terribly fast, but it's simple and predictable, which is what we want in the stdlib.

On Unicode support: we really don't need to handle normalization -- no other popular engine implements that. Input can be normalized outside the library anyway.

In addition, neither Python nor grep perform full case folding (Unicode Level 2). AFAIK, only RE2 and PCRE have this feature, the latter very recently (http://bugs.exim.org/show_bug.cgi?id=1208).

Taking this into account, I think we should aim for Level 1 compliance only.

@ferristseng
Copy link

I've been working on cleaning up my code, and adding more features. The code is probably more confusing than it should be.

Something that arose in our discussion was whether or not we should support features like backreferences and assertions. It might be a good idea to support them whether we have a fall back method, where we default to using a recursive backtracking algorithm, or find some other way. The downside to this is depending on what we choose, it will certainly add some complexity and overhead to the code.

@lambda-fairy
Copy link
Contributor

@ferristseng I had a quick look over your code -- I agree that it could do with some cleaning up ;)

As a first step, I suggest making CharClass immutable, and moving the optimization logic from negate into the constructor. The API might look like this:

impl CharClass {
    pub fn new(ranges: ~[(char, char)]);
    pub fn negate(&self) -> self;
    // ...
}

Backreferences obviously require backtracking, but I think you can do forward (lookahead) assertions using Pike VM. Split a separate thread for every assertion; the overall expression matches iff any non-assertion thread matches and all assertion threads match. You can implement this in the VM by adding an AssertionSplit(Position, Position) instruction and keeping separate vectors for assertion and non-assertion threads.

By the way, my email is lambda dot fairy at gmail dot com. I'd love to join in the discussion.

@BurntSushi
Copy link
Member

I'm currently working on this: https://github.com/BurntSushi/re2

It's already at or near feature parity with RE2 (including Unicode classes), but I haven't written tests or benchmarks yet. (And I still need to make the public API.) I'm hoping to submit a PR within the next few days.

@BurntSushi
Copy link
Member

OK, so I'm pretty close to getting to a point where I'm happy with my regexp implementation. It includes an re! macro for compile time safe regexps. There's also over 400 tests (all passing) and lots of benchmarks, including regex-dna for the shootout.

Given that there is interest in having a regexp library included in the Rust distribution, I'd be happy to submit mine for consideration.

Is this something that will have to go through the RFC process? Or can I just submit a PR? AFAIK, my implementation is a pretty faithful port of RE2 with a standard interface. I'm working on doco here: http://burntsushi.net/rustdoc/regexp/

@alexcrichton
Copy link
Member

That looks really impressive, very nice!

In the interest of alerting everyone to a pending regular expression library implementation, it may be nice to have an RFC, but I don't think that the RFC needs to be too long, and I imagine that it will not take too long to get accepted.

@BurntSushi
Copy link
Member

Fantastic, thank you! I'll finish up my doco and write up an RFC as soon as I can.

@Valloric
Copy link
Contributor

@BurntSushi That's some impressive work!

@BurntSushi
Copy link
Member

RFC submitted: rust-lang/rfcs#42

@mdinger
Copy link
Contributor

mdinger commented Apr 17, 2014

Apologies if this is in the wrong place. Perhaps it is irrelevant because this is almost implemented...

I thought a new regex implementation probably wasn't complete without a mention of Perl 6 and I couldn't find it anywhere in any Rust discussions. Anyway, I was under the impression that Perl 6 is attempting to fix many issues with current regex and thus should be paid attention to when implementing a new language which includes regex. I just wanted to point it out in case it was an accidental omission.

The following documents goes over more details of it if you are interested. I hope they are useful.

http://perlcabal.org/syn/S05.html
http://en.wikipedia.org/wiki/Perl_6_rules

@BurntSushi
Copy link
Member

@mdinger I hadn't seen what Perl 6 was doing in this space. I just spent some time looking through your links. I'll say it definitely looks cool---the idea of specifying a regexp with a grammar does seem more readable for more complex regexps. But I think it's a separate issue from providing your average run-of-the-mill regexps that everyone already knows and uses. It looks like a prime case for trying out an EDSL in Rust, actually.

I also say this because the regexps being offered up right now are strictly less expressive than Perl's. They are a pretty faithful port of RE2, which means no backreferences or generalized zero-width assertions (i.e., lookbehind and lookahead). So the need to simplify is probably quite a bit less than with a regexp engine as complex and powerful as Perl's.

@mdinger
Copy link
Contributor

mdinger commented Apr 18, 2014

Very good. As long as it isn't missed because of ignorance. Perl 6 has been in development for a long time without ever getting traction so this might not have been well known or used; it's still perfect for someone else to take advantage of their work.

If this is worth pursuing in any fashion, where should it be noted? In the wiki/Libs linked below? On the wiki in another category? Posted to a mailing list?
https://github.com/mozilla/rust/wiki/Libs

I'm not developing this, I just wanted the correct people to be aware in case they want to pursue it. Sounds useful to me.

BurntSushi added a commit to BurntSushi/rust that referenced this issue Apr 25, 2014
Also adds a regex_macros crate, which provides natively compiled
regular expressions with a syntax extension.

Closes rust-lang#3591.

RFC: 0007-regexps
bors added a commit that referenced this issue Apr 25, 2014
Implements [RFC 7](https://github.com/rust-lang/rfcs/blob/master/active/0007-regexps.md) and will hopefully resolve #3591. The crate is marked as experimental. It includes a syntax extension for compiling regexps to native Rust code.

Embeds and passes the `basic`, `nullsubexpr` and `repetition` tests from [Glenn Fowler's (slightly modified by Russ Cox for leftmost-first semantics) testregex test suite](http://www2.research.att.com/~astopen/testregex/testregex.html). I've also hand written a plethora of other tests that exercise Unicode support, the parser, public API, etc. Also includes a `regex-dna` benchmark for the shootout.

I know the addition looks huge at first, but consider these things:

1. More than half the number of lines is dedicated to Unicode character classes.
2. Of the ~4,500 lines remaining, 1,225 of them are comments.
3. Another ~800 are tests.
4. That leaves 2500 lines for the meat. The parser is ~850 of them. The public API, compiler, dynamic VM and code generator (for `regexp!`) make up the rest.
RalfJung pushed a commit to RalfJung/rust that referenced this issue May 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P-medium Medium priority
Projects
None yet
Development

Successfully merging a pull request may close this issue.