-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add support for other text encodings #1
Comments
TBH I'd just ignore (and/or die violently on) anything that doesn't map to UTF-8, which is the de facto standard in the Rust world. |
@nabijaczleweli Anyway, it's totally possible that this may never happen. But it's also a somewhat feasible thing to support without sacrificing performance for the UTF-8 case (at the cost of some code complexity). It is honestly somewhat low on my list of things I'd like to do with |
(To be clear, the limit to what I'd be willing to support are text encodings that can be feasibly transcoded to UTF-8.) |
How do you plan to detect the encoding of the text that is being ripgrep-ed in order to correctly decode it (or transcode it to utf8)? |
@alexandernst That is precisely one of the problems described in this issue. ;-) |
Just yesterday, I was given a dump of log files from a Windows server, and took a while to figure out why I couldn't find the string that someone else reported as being in one of the files, when I was using For a tool that does recursive grepping, a lot of times what you have is some big directory tree, and you don't really know what's in it, and so you're looking for some needle in that big haystack. Not knowing what's in it can include not knowing the encoding. Right now, what I think you do want something that applies a heuristic to determine the encoding, though I'm not sure you want that by default. I think the way I would go about supporting this is by reporting files that could be text but are not valid UTF-8; so, those files which it's searching as invalid UTF-8, and those which it's detecting as binary but which could be UTF-16, UTF-16BE, or UTF-16LE, or possibly other encodings. If you don't at least report something, you won't know that you're missing something due to the encoding issue, but I'm not sure you'd want to do any kind of automatic encoding detection by default. Then you could add a flag that will do heuristic encoding detection, based on BOM for UTF-16, or even/odd nulls hack for UTF-16 BE or LE, or maybe even character frequency based detection for other encodings. |
@lambda I agree we need to do something, but we are definitely limited. Things like, "report if a file isn't valid UTF-8" are probably off the table, because that would be way too expensive. We're probably limited to scanning the first few hundred/thousand bytes or so. Right now, it's a simple |
To me the inability to Also grepping for anything in Japanese companies, be it source code or text documents, almost always means you deal with Shift-JIS. TBH I wouldn't expect |
@rr- You can do that, it's just not documented on
If you did try this:
Then each hex escape is treated as a Unicode codepoint with its corresponding UTF-8 encoding. But in the former case, we disable Unicode support and get raw byte support. Unicode can actually be selectively re-enabled, so if you wanted a Unicode aware
|
But yes, I think everyone understands that only supporting UTF-8 (errm, ASCII compatible text encodings, technically) might be good enough for the vast majority of users, but is definitely going to be a deal breaker for many as well. This issue really doesn't need more motivation, it needs feasible ideas that can be implemented, and those need to take into account UX, performance and actual correctness. (Thanks @lambda for getting that started.) |
OK how about this:
There are obviously going to be problems around UTF-8 ability to display ö as |
Well the huge problem with that is things like Certainly, we can throw our hands in the air, require specific instruction of encoding by the end user and call it a day. But it doesn't really help getting at the unknown-unknowns that @lambda mentioned. Making unknown-unknowns discoverable is an important feature IMO. But perhaps it is orthogonal to text encoding support. |
There's some similar discussion of UTF-16 for Ag @ ggreer/the_silver_searcher#914 Using byte-order-marks to detect UTF-other encodings may work. At some point, if there's no meta-data to indicate encoding, a tool can only be so-magical. |
As far as specifying encoding: .gitattributes supports
to control its supporting tools, and you're already parsing git-style patterns for ignore. So that seems like a good way of supplying the encoding in complex cases... That would make --encoding=name cmdline switch equivalent to appending a "* encoding=name" line (though you might also want a --default-encoding that "prepends" instead). I'm not sure how best to combine this with recognizing BYTE ORDER MARK, though, and I agree that would also be very nice to have. |
I'm not sure whether it would be useful or not, but I do maintain a Rust wrapper around uchardet, which detects the most common character encodings. It was extracted from Mozilla. In my experience it works quite well for identifying UTF-8 versus UTF-16 (either endianness) and some other common character encodings, but it gets a little dodgier as you go further afield. Any such character detection library is obviously statistical in nature, and it will fail on some files. My Rust wrapper includes a copy of the uchardet source code that hasn't been synced with upstream in a while, but I'd be happy to do so (it's just a git submodule resolved at build time), and to make any necessary feature improvements. |
@emk Sorry for the late response. That's a neat idea! But I'd like to discourage using C dependencies beyond what we get with (My hope is to move most of the search code out into a separate crate maintained in this repo. Once that's done, we can start talking about APIs.) |
Another vote for this: PowerShell on Windows unfortunately writes UTF-16 when piping to a file ( "hi" > test.txt
rg hi Matches nothing. [1] In a classic case of inconsistency, it writes ASCII using |
The UTF-16LE encoding used on Windows (or rather, UCS-2LE, if that's still a thing) is an especially important use-case, and one of the easiest to detect. The typical steps can be seen in this uchardet source file:
@Bobo1239 is currently working to update my Rust |
@BurntSushi mentioned this issue elsewhere (#7) and wrote:
Is there any reason you couldn't just use Rust's standard output facilities? For a first version, detect at least UTF-16LE/BE, transcode to UTF-8 for searching, and allow Rust to output it the same way as existing strings? You pretty much need to choose a single output encoding in any case, and it might as well be whatever Rust generates for you. |
@emk Rust's standard output facilities are encoding agnostic, but that's orthogonal. Here's my concern more explicitly stated:
|
@BurntSushi Good questions, and tricky ones. :-) As far as I can see, the fundamental problem with output in the original encoding is figuring out what to do if you find matches in files that use several different encodings. In that case, you pretty much need to choose a single encoding for output. This seems like a major force pushing towards keeping all output in UTF-8. I suppose the counter argument is that if the input from I originally made Rust bindings for uchardet for use with substudy, which has to deal with quite a few encodings. In my experience, it's possible to detect and transcode the most common encodings automatically, and it works fairly well. Now that I'm thinking about substudy, that provides an intesting data point: I have foreign-language subtitle files in multiple legacy encodings. In a perfectly ideal world, I would love ripgrep to correctly identify ISO Latin-1 or Windows-1252, but transcode everything to UTF-8 for matching and output, because my terminal is configured to display UTF-8, not Windows-1252. Similarly, if I write Overall, it seems like UTF-16LE is the most important case to handle (for Windows support), and it's one of the easiest to implement (at least the basics). |
I'd also like to evaluate I don't disagree with anything you've said. Although I'd like to continue to emphasize that we should attack this problem without any C dependencies first. For example, one could imagine a solution to this problem that has no automatic detection at all. That seems like a decent place to start.
So long as |
Ah, I hadn't seen that yet! Thank you.
If it helps, I'm happy to provide a small Rust crate which provides a translation of the essential bits of the main UTF-8 / UTF-16LE / UTF-16BE detector here. This will at least handle most modern text files on Linux/Windows/Mac, which is probably 98% of what people actually want. The lack of UTF-16LE searching, in particular, is a big limitation on Windows. |
@emk That sounds like a great next step after we have the ability to search transcoded data. :-) |
Fair enough. :-) Yes, I agree it would be good to implement the transliteration machinery before the detection code! If there's no auto-detection, what should the UI for selecting an input encoding look like? As for automatic detection, looking at the source to the encoding detectors for the first time in a couple of years, the logic to detect the major encodings is surprisingly straightforward:
This will generally produce decent results for ASCII and the various UTF encodings. Given your "pure Rust" requirement, it's probably wisest not to try for anything more right away. All the necessary language models, etc., are available in uchardet, but there's a fair amount of fiddly code to handle all the weird encodings out there. |
I don't know what the UI should look like, but we need one. I would On Oct 28, 2016 9:56 PM, "Eric Kidd" notifications@github.com wrote:
|
This includes, but is not limited to, UTF-16, latin-1, GBK, EUC-JP and Shift_JIS. (Courtesy of the `encoding_rs` crate.) Specifically, this feature enables ripgrep to search files that are encoded in an encoding other than UTF-8. The list of available encodings is tied directly to what the `encoding_rs` crate supports, which is in turn tied to the Encoding Standard. The full list of available encodings can be found here: https://encoding.spec.whatwg.org/#concept-encoding-get This pull request also introduces the notion that text encodings can be automatically detected on a best effort basis. Currently, the only support for this is checking for a UTF-16 bom. In all other cases, a text encoding of `auto` (the default) implies a UTF-8 or ASCII compatible source encoding. When a text encoding is otherwise specified, it is unconditionally used for all files searched. Since ripgrep's regex engine is fundamentally built on top of UTF-8, this feature works by transcoding the files to be searched from their source encoding to UTF-8. This transcoding only happens when: 1. `auto` is specified and a non-UTF-8 encoding is detected. 2. A specific encoding is given by end users (including UTF-8). When transcoding occurs, errors are handled by automatically inserting the Unicode replacement character. In this case, ripgrep's output is guaranteed to be valid UTF-8 (excluding non-UTF-8 file paths, if they are printed). In all other cases, the source text is searched directly, which implies an assumption that it is at least ASCII compatible, but where UTF-8 is most useful. In this scenario, encoding errors are not detected. In this case, ripgrep's output will match the input exactly, byte-for-byte. This design may not be optimal in all cases, but it has some advantages: 1. In the happy path ("UTF-8 everywhere") remains happy. I have not been able to witness any performance regressions. 2. In the non-UTF-8 path, implementation complexity is kept relatively low. The cost here is transcoding itself. A potentially superior implementation might build decoding of any encoding into the regex engine itself. In particular, the fundamental problem with transcoding everything first is that literal optimizations are nearly negated. Future work should entail improving the user experience. For example, we might want to auto-detect more text encodings. A more elaborate UX experience might permit end users to specify multiple text encodings, although this seems hard to pull off in an ergonomic way. Fixes #1
This includes, but is not limited to, UTF-16, latin-1, GBK, EUC-JP and Shift_JIS. (Courtesy of the `encoding_rs` crate.) Specifically, this feature enables ripgrep to search files that are encoded in an encoding other than UTF-8. The list of available encodings is tied directly to what the `encoding_rs` crate supports, which is in turn tied to the Encoding Standard. The full list of available encodings can be found here: https://encoding.spec.whatwg.org/#concept-encoding-get This pull request also introduces the notion that text encodings can be automatically detected on a best effort basis. Currently, the only support for this is checking for a UTF-16 bom. In all other cases, a text encoding of `auto` (the default) implies a UTF-8 or ASCII compatible source encoding. When a text encoding is otherwise specified, it is unconditionally used for all files searched. Since ripgrep's regex engine is fundamentally built on top of UTF-8, this feature works by transcoding the files to be searched from their source encoding to UTF-8. This transcoding only happens when: 1. `auto` is specified and a non-UTF-8 encoding is detected. 2. A specific encoding is given by end users (including UTF-8). When transcoding occurs, errors are handled by automatically inserting the Unicode replacement character. In this case, ripgrep's output is guaranteed to be valid UTF-8 (excluding non-UTF-8 file paths, if they are printed). In all other cases, the source text is searched directly, which implies an assumption that it is at least ASCII compatible, but where UTF-8 is most useful. In this scenario, encoding errors are not detected. In this case, ripgrep's output will match the input exactly, byte-for-byte. This design may not be optimal in all cases, but it has some advantages: 1. In the happy path ("UTF-8 everywhere") remains happy. I have not been able to witness any performance regressions. 2. In the non-UTF-8 path, implementation complexity is kept relatively low. The cost here is transcoding itself. A potentially superior implementation might build decoding of any encoding into the regex engine itself. In particular, the fundamental problem with transcoding everything first is that literal optimizations are nearly negated. Future work should entail improving the user experience. For example, we might want to auto-detect more text encodings. A more elaborate UX experience might permit end users to specify multiple text encodings, although this seems hard to pull off in an ergonomic way. Fixes #1
The Whirlwind tour section of the README still needs to be updated:
|
@ssokolow Fixed! Thanks. |
@BurntSushi You might want to edit your anti-pitch on your blog post.
|
@batisteo Hmm right thanks, I always forget to update the blog. |
This increases the initial buffer size from 8KB to 64KB. This actually leads to a reasonably noticeable improvement in at least one work-load, and is unlikely to regress in any other case. Also, since Rust programs (at least on Linux) seem to always use a minimum of 6-8MB of memory, adding an extra 56KB is negligible. Before: $ hyperfine -i "rg 'zqzqzqzq' OpenSubtitles2018.raw.en --no-mmap" Benchmark #1: rg 'zqzqzqzq' OpenSubtitles2018.raw.en --no-mmap Time (mean ± σ): 2.109 s ± 0.012 s [User: 565.5 ms, System: 1541.6 ms] Range (min … max): 2.094 s … 2.128 s 10 runs After: $ hyperfine -i "rg 'zqzqzqzq' OpenSubtitles2018.raw.en --no-mmap" Benchmark #1: rg 'zqzqzqzq' OpenSubtitles2018.raw.en --no-mmap Time (mean ± σ): 1.802 s ± 0.006 s [User: 462.3 ms, System: 1337.9 ms] Range (min … max): 1.795 s … 1.814 s 10 runs
Right now, ripgrep only supports reading UTF-8 encoded text (even if some of it is invalid). In my estimation, this is probably good enough for the vast majority of use cases.
However, it may be useful to search other encodings. I don't think I'd be willing to, say, modify the regex engine itself to support other encodings, but if it were easy to do transcoding on the fly, then I think it wouldn't add too much complexity. The
encoding_rs
project in particular appears to support this type of text decoding.Some thoughts/concerns:
The text was updated successfully, but these errors were encountered: