Skip to content

Commit

Permalink
Add support for additional text encodings.
Browse files Browse the repository at this point in the history
This includes, but is not limited to, UTF-16, latin-1, GBK, EUC-JP and
Shift_JIS. (Courtesy of the `encoding_rs` crate.)

Specifically, this feature enables ripgrep to search files that are
encoded in an encoding other than UTF-8. The list of available encodings
is tied directly to what the `encoding_rs` crate supports, which is in
turn tied to the Encoding Standard. The full list of available encodings
can be found here: https://encoding.spec.whatwg.org/#concept-encoding-get

This pull request also introduces the notion that text encodings can be
automatically detected on a best effort basis. Currently, the only
support for this is checking for a UTF-16 bom. In all other cases, a
text encoding of `auto` (the default) implies a UTF-8 or ASCII
compatible source encoding. When a text encoding is otherwise specified,
it is unconditionally used for all files searched.

Since ripgrep's regex engine is fundamentally built on top of UTF-8,
this feature works by transcoding the files to be searched from their
source encoding to UTF-8. This transcoding only happens when:

1. `auto` is specified and a non-UTF-8 encoding is detected.
2. A specific encoding is given by end users (including UTF-8).

When transcoding occurs, errors are handled by automatically inserting
the Unicode replacement character. In this case, ripgrep's output is
guaranteed to be valid UTF-8 (excluding non-UTF-8 file paths, if they
are printed).

In all other cases, the source text is searched directly, which implies
an assumption that it is at least ASCII compatible, but where UTF-8 is
most useful. In this scenario, encoding errors are not detected. In this
case, ripgrep's output will match the input exactly, byte-for-byte.

This design may not be optimal in all cases, but it has some advantages:

1. In the happy path ("UTF-8 everywhere") remains happy. I have not been
   able to witness any performance regressions.
2. In the non-UTF-8 path, implementation complexity is kept relatively
   low. The cost here is transcoding itself. A potentially superior
   implementation might build decoding of any encoding into the regex
   engine itself. In particular, the fundamental problem with
   transcoding everything first is that literal optimizations are nearly
   negated.

Future work should entail improving the user experience. For example, we
might want to auto-detect more text encodings. A more elaborate UX
experience might permit end users to specify multiple text encodings,
although this seems hard to pull off in an ergonomic way.

Fixes #1
  • Loading branch information
BurntSushi committed Mar 12, 2017
1 parent 80e91a1 commit 92f1422
Show file tree
Hide file tree
Showing 10 changed files with 607 additions and 13 deletions.
16 changes: 16 additions & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ path = "tests/tests.rs"
atty = "0.2.2"
bytecount = "0.1.4"
clap = "2.20.5"
encoding_rs = "0.5.0"
env_logger = { version = "0.4", default-features = false }
grep = { version = "0.1.5", path = "grep" }
ignore = { version = "0.1.7", path = "ignore" }
Expand Down
14 changes: 6 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,10 @@ increases the times to `3.081s` for ripgrep and `11.403s` for GNU grep.
of search results, searching multiple patterns, highlighting matches with
color and full Unicode support. Unlike GNU grep, `ripgrep` stays fast while
supporting Unicode (which is always on).
* `ripgrep` supports searching files in text encodings other than UTF-8, such
as UTF-16, latin-1, GBK, EUC-JP, Shift_JIS and more. (Some support for
automatically detecting UTF-16 is provided. Other text encodings must be
specifically specified with the `-E/--encoding` flag.)

In other words, use `ripgrep` if you like speed, filtering by default, fewer
bugs and Unicode support.
Expand All @@ -101,18 +105,12 @@ give you a glimpse at some important downsides or missing features of
support for Unicode categories (e.g., `\p{Sc}` to match currency symbols or
`\p{Lu}` to match any uppercase letter). (Fancier regexes will never be
supported.)
* If you need to search files with text encodings other than UTF-8 (like
UTF-16), then `ripgrep` won't work. `ripgrep` will still work on ASCII
compatible encodings like latin1 or otherwise partially valid UTF-8.
`ripgrep` *can* search for arbitrary bytes though, which might work in
a pinch. (Likely to be supported in the future.)
* `ripgrep` doesn't yet support searching compressed files. (Likely to be
supported in the future.)
* `ripgrep` doesn't have multiline search. (Unlikely to ever be supported.)

In other words, if you like fancy regexes, non-UTF-8 character encodings,
searching compressed files or multiline search, then `ripgrep` may not quite
meet your needs (yet).
In other words, if you like fancy regexes, searching compressed files or
multiline search, then `ripgrep` may not quite meet your needs (yet).

### Is it really faster than everything else?

Expand Down
7 changes: 7 additions & 0 deletions doc/rg.1.md
Original file line number Diff line number Diff line change
Expand Up @@ -136,6 +136,13 @@ Project home page: https://github.com/BurntSushi/ripgrep
--debug
: Show debug messages.

-E, --encoding *ENCODING*
: Specify the text encoding that ripgrep will use on all files
searched. The default value is 'auto', which will cause ripgrep to do
a best effort automatic detection of encoding on a per-file basis.
Other supported values can be found in the list of labels here:
https://encoding.spec.whatwg.org/#concept-encoding-get

-f, --file FILE ...
: Search for patterns from the given file, with one pattern per line. When this
flag is used or multiple times or in combination with the -e/--regexp flag,
Expand Down
14 changes: 12 additions & 2 deletions src/app.rs
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,8 @@ fn app<F>(next_line_help: bool, doc: F) -> App<'static, 'static>
.possible_values(&["never", "auto", "always", "ansi"]))
.arg(flag("colors").value_name("SPEC")
.takes_value(true).multiple(true).number_of_values(1))
.arg(flag("encoding").short("E").value_name("ENCODING")
.takes_value(true).number_of_values(1))
.arg(flag("fixed-strings").short("F"))
.arg(flag("glob").short("g")
.takes_value(true).multiple(true).number_of_values(1)
Expand Down Expand Up @@ -251,6 +253,14 @@ lazy_static! {
change the match color to magenta and the background color for \
line numbers to yellow:\n\n\
rg --colors 'match:fg:magenta' --colors 'line:bg:yellow' foo.");
doc!(h, "encoding",
"Specify the text encoding of files to search.",
"Specify the text encoding that ripgrep will use on all files \
searched. The default value is 'auto', which will cause ripgrep \
to do a best effort automatic detection of encoding on a \
per-file basis. Other supported values can be found in the list \
of labels here: \
https://encoding.spec.whatwg.org/#concept-encoding-get");
doc!(h, "fixed-strings",
"Treat the pattern as a literal string.",
"Treat the pattern as a literal string instead of a regular \
Expand Down Expand Up @@ -335,9 +345,9 @@ lazy_static! {
provided are searched. Empty pattern lines will match all input \
lines, and the newline is not counted as part of the pattern.");
doc!(h, "files-with-matches",
"Only show the path of each file with at least one match.");
"Only show the paths with at least one match.");
doc!(h, "files-without-match",
"Only show the path of each file that contains zero matches.");
"Only show the paths that contains zero matches.");
doc!(h, "with-filename",
"Show file name for each match.",
"Prefix each match with the file name that contains it. This is \
Expand Down
32 changes: 32 additions & 0 deletions src/args.rs
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ use std::sync::Arc;
use std::sync::atomic::{AtomicBool, Ordering};

use clap;
use encoding_rs::Encoding;
use env_logger;
use grep::{Grep, GrepBuilder};
use log;
Expand Down Expand Up @@ -41,6 +42,7 @@ pub struct Args {
column: bool,
context_separator: Vec<u8>,
count: bool,
encoding: Option<&'static Encoding>,
files_with_matches: bool,
files_without_matches: bool,
eol: u8,
Expand Down Expand Up @@ -224,6 +226,7 @@ impl Args {
.after_context(self.after_context)
.before_context(self.before_context)
.count(self.count)
.encoding(self.encoding)
.files_with_matches(self.files_with_matches)
.files_without_matches(self.files_without_matches)
.eol(self.eol)
Expand Down Expand Up @@ -330,6 +333,7 @@ impl<'a> ArgMatches<'a> {
column: self.column(),
context_separator: self.context_separator(),
count: self.is_present("count"),
encoding: try!(self.encoding()),
files_with_matches: self.is_present("files-with-matches"),
files_without_matches: self.is_present("files-without-match"),
eol: b'\n',
Expand Down Expand Up @@ -569,13 +573,18 @@ impl<'a> ArgMatches<'a> {
/// will need to search.
fn mmap(&self, paths: &[PathBuf]) -> Result<bool> {
let (before, after) = try!(self.contexts());
let enc = try!(self.encoding());
Ok(if before > 0 || after > 0 || self.is_present("no-mmap") {
false
} else if self.is_present("mmap") {
true
} else if cfg!(target_os = "macos") {
// On Mac, memory maps appear to suck. Neat.
false
} else if enc.is_some() {
// There's no practical way to transcode a memory map that isn't
// isomorphic to searching over io::Read.
false
} else {
// If we're only searching a few paths and all of them are
// files, then memory maps are probably faster.
Expand Down Expand Up @@ -721,6 +730,29 @@ impl<'a> ArgMatches<'a> {
Ok(ColorSpecs::new(&specs))
}

/// Return the text encoding specified.
///
/// If the label given by the caller doesn't correspond to a valid
/// supported encoding (and isn't `auto`), then return an error.
///
/// A `None` encoding implies that the encoding should be automatically
/// detected on a per-file basis.
fn encoding(&self) -> Result<Option<&'static Encoding>> {
match self.0.value_of_lossy("encoding") {
None => Ok(None),
Some(label) => {
if label == "auto" {
return Ok(None);
}
match Encoding::for_label(label.as_bytes()) {
Some(enc) => Ok(Some(enc)),
None => Err(From::from(
format!("unsupported encoding: {}", label))),
}
}
}
}

/// Returns the approximate number of threads that ripgrep should use.
fn threads(&self) -> Result<usize> {
if self.is_present("sort-files") {
Expand Down
Loading

0 comments on commit 92f1422

Please sign in to comment.