-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
find a way to surface binary file detection better #306
Comments
ripgrep will skip a lot more than just binary files. By default, ripgrep will completely skip:
So I think skipping binary files is perfectly consistent with this philosophy. Note the documentation of the
Could you elaborate a bit more on why you think binary files should be search by default? If you want to search binary files, then it seems like passing the |
I think I'm open to that. We already have a similar error message that's emitted if ripgrep doesn't search any files at all. |
I'm talking specifically about the case where you supply only one or a few files on the command line. I very much like that ripgrep skips a lot of files when searching recursively, but it's unintuitive when it skips the only file that I told it to search in, without telling me why.
The case where this came up for me was with a text file that had some garbage in it, which made ripgrep classify it as a binary file. So that flag is what I wanted, but I didn't know that at first. It wasn't until I ran the same search with GNU grep that I noticed why ripgrep didn't return any results.
Really? I didn't get that message, and it sounds like I should have. |
I just updated ripgrep via If this is a bug and some error message should actually be printed when using ripgrep like this, feel free to change the issue title accordingly. |
Sorry, I mean that if ripgrep knows it doesn't search any file, then it will emit an error message. Unfortunately, it doesn't take whether a file is binary or not into account, since that state is determined deep inside the search code. It's an unfortunate artifact of the current search internals. Thankfully, the search internals are due for a refactor. I'm preparing to work on that. This is a good issue, because I'll be sure to incorporate this into the design. So yes, I'd consider it a bug. :-) |
This would be incredibly useful. I've recently run into an issue wherein I think ripgrep is assuming a large (in my most recent case, 21G) log file is binary, but without saying so. With no message, it was frustrating to not know why ripgrep was finding no results (grep doesn't seem to think it's binary, as it returns results with no flags required). (after finding this issue, I found that ripgrep can search the file just fine if I run it with |
@NJAldwin The behavior of ripgrep and grep should be the same with respect to binary files (with ripgrep having potentially more false negatives), so I think we should figure out why you're observing a difference first. The |
@BurntSushi happy to open a new issue to investigate further. Is there a way I can verify that ripgrep is actually identifying the file as binary first? (maybe there's some other issue that |
@NJAldwin Unfortunately. That information gets dropped. It's probably best to try to get a reproducible example that we can both see. (You might consider masking your log file somehow and then trimming it down?) |
@BurntSushi aw, unfortunate. Well, I'll see if I can put together a sanitized file that exhibits the same behavior. |
My problem as related to binary values in a file that is most of the time considered to contain only standard text log errors. The log file (70MB size) contained 0x0 somewhere in the middle. So from the beginning ripgrep searched ok and returned results. But later when ripgrep arrived to 0x0, it silently exited. This caused some incorrect assumptions - I thought that the interesting lines are only at the beginning of the file and not in the rest. That assumption was false. I would add +1 to idea to write at least some warning so I know that there is "something wrong" with the file. And then I can use |
Both #1117 and #1129 are recent bugs filed that are basically the same as this one. Namely, the issue here is that ripgrep will silently stop searching a file if it believes it is binary. The reason why ripgrep has this behavior is that the intent is to avoid searching binary files at all. That is, we'd like to ignore those files for the same reason that we ignore hidden files or files in your
The overall emergent behavior here ends up being pretty unintuitive. In particular, we combine the fact that searches can silently stop with the fact that binary detection differs subtly based on the search strategy, which is entirely opaque to the end user. GNU grep does much better here because it never starts from the premise of "smart" filtering like ripgrep does. In particular, once GNU grep detects that a file is binary, by default, it doesn't immediately stop. GNU grep will continue searching until either no match is found in the entire file (in which there is no output) or until the next match is found, at which point, the match is not printed but instead a GNU grep also supports a little known flag, In both GNU grep and ripgrep, the So the question here is: how do we fix this? I originally intended to straighten this out as part of libripgrep, but couldn't come to a decision on what behavior ripgrep should have. On the one hand, I think it's important to continue filtering binary files. So one possible improvement here that I think would be easy to do is this:
Of course, this doesn't solve the issue for the memory map searcher, which is trickier. The problem is that the regex engine is the primary thing that advances through the contents of the file, and it could advance arbitrarily far into the file. That means doing another search for a |
I assume that happens if the content is read from STDIN.
Indeed. I guess that also explains why there are so many bug reports about this which in the end are duplicates of this issue. :-)
Can confirm. But it does more than just that. IMHO the biggest issue as of now (0.10.0) is the missing warning (and proper exit code) if a (lately detected) binary file with matches is not reported as such.
While this is definitely an improvement, it IMHO still causes uncertainty if one would find all matches, especially with (log) files of unknown binary-cleanliness. Maybe there should be an (optional) level between that and
Then again, I'm not sure if I'd have found such an option in the case of #1227. I've found |
Including another flag for specifically enabling the searching of binary files in a way that is similar to how grep operates (again, treating grep as the gold standard here since it doesn't silently hide anything) is something I could get on board with. We already have flags like
That's true, but that's the value proposition of ripgrep. It does filtering "smartly" by default, so there should generally never be an assumption that silent output from ripgrep means it searched everything. This is a key fundamental difference from grep. The problem as I see it is that ripgrep's behavior isn't actually this nice, because it silently gives up on searching part of the way even after printing some matches, which is generally not something one should expect.
No, memory maps aren't used when reading from stdin. stdin uses the standard stream buffer which has the "most correct" binary detection. The memory map searcher will only detect binary files if a NUL byte occurs in the first few KBs of the file. So if your NUL byte in your log file occurs way after that, then searching with the memory map searcher won't (or shouldn't, anyway) exhibit this bug. |
Definitely better than my more bulky proposition.
Fair enough. Another issue here is probably that if grep at some point in the future adds another short flag, ripgrep might want to adopt that option.
From my point of the view the primary value of of ripgrep is its speed. When grepping a 5 GB log file But yes,
Guess I fell for that assumption initially. (And many others before me, too, as it seems…)
*nod* |
Right. My response to this over time is to be as up front about this as possible. The "smart" filtering should be mentioned in the first few sentences of the README, the People use tools for a lot of different reasons, of course, and certainly ripgrep's performance is one of them. But the entire design of ripgrep's command line interface is geared towards smart filtering. With that said, folks should generally be able to use ripgrep as something approximating a grep, but it will require using additional flags. For example, |
Correct. And I must admit, I either missed the last few words of the first man page paragraph or I totally didn't get the significance of "and binary files" when I skimmed over the man page initially and hence forgot about it again later. Probably the latter. What's IMHO less clear and might need an according update, is Debian's ripgrep package description. (Which I read before I read the man page, i.e. which was my "first contact" with ripgrep.) At least neither the skipping of binary files nor the "smart" filtering are explicitly mentioned. ( |
@xtaran Ah good point about Debian. To be fair to them, they probably took an older description that I used. The README was, for example, somewhat recently updated (although I think the man page and help output have had the additional verbiage for longer). |
Indeed, I am lazy :) |
This commit attempts to surface binary filtering in a slightly more user friendly way. Namely, before, ripgrep would silently stop searching a file if it detected a NUL byte, even if it had previously printed a match. This can lead to the user quite reasonably assuming that there are no more matches, since a partial search is fairly unintuitive. (ripgrep has this behavior by default because it really wants to NOT search binary files at all, just like it doesn't search gitignored or hidden files.) With this commit, if a match has already been printed and ripgrep detects a NUL byte, then it will print a warning message indicating that the search stopped prematurely. Moreover, this commit adds a new flag, --binary, which causes ripgrep to stop filtering binary files, but in a way that still avoids dumping binary data into terminals. That is, the --binary flag makes ripgrep behave more like grep's default behavior. For files explicitly specified in a search, e.g., `rg foo some-file`, then no binary filtering is applied (just like no gitignore and no hidden file filtering is applied). Instead, ripgrep behaves as if you gave the --binary flag for all explicitly given files. This was a fairly invasive change, and potentially increases the UX complexity of ripgrep around binary files. (Before, there were two binary modes, where as now there are three.) However, ripgrep is now a bit louder with warning messages when binary file detection might otherwise be hiding potential matches, so hopefully this is a net improvement. Finally, the `-uuu` convenience now maps to `--no-ignore --hidden --binary`, since this is closer to the actualy intent of the `--unrestricted` flag, i.e., to reduce ripgrep's smart filtering. As a consequence, `rg -uuu foo` should now search roughly the same number of bytes as `grep -r foo`, and `rg -uuua foo` should search roughly the same number of bytes as `grep -ra foo`. (The "roughly" weasel word is used because grep's and ripgrep's binary file detection might differ somewhat---perhaps based on buffer sizes---which can impact exactly what is and isn't searched.) See the numerous tests in tests/binary.rs for intended behavior. Fixes #306, Fixes #855
@BurntSushi You mentioned GNU Grep's option Any chance Rigrep will have a similar one? The message says Ripgrep's behavior is the same by default, but it isn't (at least not with Ripgrep 12 I currently have installed). By default Ripgrep prints the "binary file ... matches" message, whereas Grep, with that option, does not. |
You didn't include a repro, so I guess I have to guess at what you mean? I'm guessing you were searching a specific file, in which case, yes ripgrep will search it by default in the same way as GNU grep. But in recursive search, ripgrep uses without-match. I'm on mobile, but maybe in your case, |
Almost: I'm piping a list of files (obtained through a specific means, configurable by the user). I'm calling all this from Emacs, so this is an automation issue. I'm writing because of the changed format for "binary file matches" in Ripgrep 13 (I'm a late upgrader), which now differs from GNU Grep, so I either need to use a more complex regexp, or find a way to drop the binary matches from the output.
It didn't help, alas. Oh well, a more complex regexp it is, then. |
I believe it was GNU grep that actually changed its output. :) |
Oh. Indeed, it had, in 3.5. Thanks for the correction. I guess it's high time I upgrade to the latest LTS distro. ;-( There seems to be another difference, though. Grep's NEWS says:
So not only the format changed, the messages goes to stderr and can be silenced with |
Yes, I haven't decided whether to follow suit or not. |
Right now when I try to search through a file that contains non-printable characters, rg will just exit as if there were no matches, because it skips binary files. GNU grep reports
I don't know if it is planned for ripgrep to search binary files as well at some point, for cases where you just want to know whether there is a match, not where; but either way the current behaviour seems a little bit confusing. In case rg is not meant to ever support binary files, maybe it could print a message like
if there's only one file to be searched and it's recognized as a binary file (or maybe even if there's multiple file arguments but all seem to be binary).
The text was updated successfully, but these errors were encountered: