-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Output format with parseable match indices? #244
Comments
Is |
Note that |
It seems to be what ack clones have. Convenience is probably not that important here - although I like that it avoids text volume because file names are not repeated. There is no specification that I know of. |
The format for
Note that the
They are pretty close, including the reduction in volume by avoiding repeated file paths. Alas, |
One thing that grates me a little bit is that I managed to implement the |
@birkenfeld Could you say more about why you specifically need this? I've never done an editor integration, so I don't really know. However, I do know that lots of folks have already integrated ripgrep into emacs and vim, and this is the first I'm hearing that |
The "orthogonal" way to do this I guess is to add a
I think I'd prefer this over hacking |
The goal is highlighting matched parts of a line in an editor-specific way (e.g. faces in Emacs). The ripgrep Emacs package once had this feature, but relied on parsing color escapes, which changed in 0.3. A vimgrep-like format including match length is possible, but harder to handle because usually you'd want to display a matching line only once, but highlight multiple matches if appropriate. With an ackmate-like format, which specifies all matches in a single line (see initial post), that requires much less postprocessing of the output. cc @nlamirault (the ripgrep.el author) |
Wow. I didn't even consider that as a breaking change. And probably still don't. Relying on color escapes sounds like really bad juju! But I understand why someone might do it if there's no other way!
OK, this is a key thing I was missing. I guess in vim, the opposite is true. This also means that my description of the
Blech. I guess I begrudgingly accept that we should do this. I will attempt to put this into the current printer, but it's not going to be nice, and hopefully I can rethink it for libripgrep. The key problem with adding stuff to the printer is that you need to consider its interaction with every other output flag. For example, does |
The printer isn't that big; it's just subtle. You can skim it and get an idea of the type of case analysis involved here: https://github.com/BurntSushi/ripgrep/blob/master/src/printer.rs It is used directly from the searcher. Do a search for |
Another thing to consider with respect to JSON is that I was thinking about a "JSON lines" format. But I see the proposal from @garygreen above is not. The benefit of a JSON lines approach is that each match can be emitted as it appears. If we instead emit one big JSON blob, then all of the matches must be buffered into memory before we get any output at all. |
I was having the same thoughts, I think a "JSON lines" is an excellent idea and would allow you to stream the results in a consistent format. Maybe this could be like: Start of query: {
"type": "begin_search",
"token": "<some unique token string>",
"query": "ipaddress",
"paths": ["path1", "path2"]
} A line match: {
"type": "line_match",
"token": "<token>",
"file": "E:\\Sites\\cc-new\\app\\Member\\BannedMember.php",
"line": 34,
"content": "'ipaddress' => Request::getClientIp(),",
"ranges": [
[1, 9]
]
} End of query: {
"type": "end_search",
"token": "<token>",
"stats": {
"total_matches": 2,
"files_searched": 102
}
} The |
Interesting wiki on JSON streaming: https://en.wikipedia.org/wiki/JSON_streaming |
JSON output would be very useful indeed for editor integration. e.g. in Emacs one could use the built-in JSON-parsing functions, which are written in C, rather than doing regexp-parsing of ripgrep's output. Does JSON output deserve a separate issue, or should this one be commandeered for it? :) |
For the people who desire JSON output, how would you handle the fact that ripgrep's output is not necessarily guaranteed to be valid UTF-8? This is a problem for JSON because JSON requires its strings to be valid UTF-8 (or UTF-16 or UTF-32, but we can ignore those for the purposes of ripgrep and this problem). At a high level, I think there are three approaches ripgrep could take:
(1) kind of stinks, for obvious reasons. (2) sounds nice on the surface, but I also envision the JSON output containing match offsets for individual matches found within the line. It seems doable to update those offsets such that they are correct for the lossily encoded match string. (e.g., A single (3) feels like it might be the best solution. It preserves, byte-for-byte, the original match and also makes it easy for callers to choose their own behavior. e.g., They could ignore matches that aren't valid UTF-8 for example. Here's a straw man JSON representation for matches that are valid UTF-8: {
"text": "the match",
"bytes": null
} and now for matches that are not valid UTF-8: {
"text": null,
"bytes": "dGhl/21hdGNo"
} where |
I like (2) and (3). Consumers writing a simple script may want to display
_something_ useful in all cases (this is my typical use case). More
sophisticated consumers may want something completely correct and accurate
(e.g. IDE integration or editing the file).
How about providing both UTF-8 encoded and raw bytes, one for each
consumer? Or a command line flag for one or the other.
…On Tue, 24 Jul 2018, 12:11 Andrew Gallant, ***@***.***> wrote:
For the people who desire JSON output, how would you handle the fact that
ripgrep's output is not necessarily guaranteed to be valid UTF-8? This is a
problem for JSON because JSON requires its strings to be valid UTF-8 (or
UTF-16 or UTF-32, but we can ignore those for the purposes of ripgrep and
this problem). At a high level, I think there are three approaches ripgrep
could take:
1. Silently drop any matches containing invalid UTF-8.
2. Lossily encode any invalid UTF-8 using the Unicode replacement
codepoint.
3. Come up with a tagged union representation that represents matches
that are valid UTF-8 as standard JSON strings and matches that are invalid
UTF-8 as base64 encoded JSON strings.
(1) kind of stinks, for obvious reasons.
(2) sounds nice on the surface, but I also envision the JSON output
containing match offsets for individual matches found within the line. It
seems doable to update those offsets such that they are correct for the
lossily encoded match string. (e.g., A single \xFF byte would be replaced
by \xEF\xBF\xBD, which is the UTF-8 encoding of U+FFFD.) This seems
annoying to me. This also seems less than ideal since I could imagine that
consumers might want to use those matches offsets to go find something in
the original file for example, which wouldn't work unless we yielded two
sets of matches offsets: one for the string provided in the JSON and
another for the original string found in the file. Again, that's annoying.
(3) feels like it might be the best solution. It preserves, byte-for-byte,
the original match and also makes it easy for callers to choose their own
behavior. e.g., They could ignore matches that aren't valid UTF-8 for
example. Here's a straw man JSON representation for matches that are valid
UTF-8:
{
"text": "the match",
"bytes": null
}
and now for matches that are not valid UTF-8:
{
"text": null,
"bytes": "dGhl/21hdGNo"
}
where dGhl/21hdGNo is base64 for the\xFFmatch.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#244 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAm2krEEQVVCgIC8SfYiTFVZJSrjdRz6ks5uJwDhgaJpZM4K4lcS>
.
|
I doubt I'll do both. Doing both would require computing match indices for both. In general, "just do everything" isn't a tenable strategy moving forward because of the maintenance costs it implies. A benefit of option (3) is that it is both very simple to implement and very simple to get correct and very hard to use incorrectly. I'm inclined to say that "writing a simple script that handles all cases" ceases to be simple, and that you should just base64 decode the raw bytes and do your own lossy decode if that's what you want. |
Fair points. Then (3) is easy to implement and imposes only a small cost
for the simplr script consumer; sounds like a great option.
…On Tue, 24 Jul 2018, 14:11 Andrew Gallant, ***@***.***> wrote:
I doubt I'll do both. Doing both would require computing match indices for
both.
In general, "just do everything" isn't a tenable strategy moving forward
because of the maintenance costs it implies. A benefit of option (3) is
that it is both very simple to implement and very simple to get correct and
very hard to use incorrectly.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#244 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAm2ksqyoha14sFI7ShXlQ3wsdfdx2s9ks5uJx0EgaJpZM4K4lcS>
.
|
For folks interested extracting machine readable descriptions of search results from ripgrep, would you want to take a quick look my proposed JSON wire format and see whether it looks reasonable to you? cc @roblourens I'm especially interested in your opinion, since I think you're are my biggest machine consumer, and I suspect you'd be happy to stop parsing ANSI escape sequences. :-) |
That is some awesome work Andrew. Thank you so much for all your time and effort on this it's much appreciated. I'm also interested in seeing what Rob makes of it as I'm sure vscode would love to consume the JSON results rather than doing some manual parsing. Anyway, I've had a look over the proposed spec if your interested in hearing my thoughts and it looks perfect, though I have made a few observations as below.
"submatches": [
{"match": {"text": "Watson"}, "start": 15, "end": 21}
] Would this be simpler? "submatches": [
{"text": "Watson", "start": 15, "end": 21}
]
For example: {
"type": "begin",
"token": "nBeqhQgeljaMnkq2J1h6",
"data": {
"path": {"text": "/home/andrew/sherlock"}},
}
}
{
"type": "context",
"token": "nBeqhQgeljaMnkq2J1h6",
"data": {
"lines": {"text": "can extract a clew from a wisp of straw or a flake of cigar ash;\n"},
"line_number": 4,
"absolute_offset": 193,
"submatches": []
}
} This would ensure that you are consuming results for the right search - for example, if your running search in parallel on the same terminal but for different queries, simply saying "path" won't allow you to group the results easily. With a unique token approach (which can just be a random string of characters to make collisions extremely unlikely), you can group those results and know exactly what search they are for. I'm sure this would also be useful for vscode and other consumers as well that handle a |
@garygreen Awesome, thanks for taking a look and giving feedback!
Yes. A
Hmm, good question. It seems reasonable to guarantee an ordering here. I will add that to the docs.
I think I at least want to have a path in every message, which should be much more convenient for consumers. If you don't include a path, then it makes it necessary for even simple use cases like "do something with a match and its file path" to track state. I've thought about your identifier idea. I am not opposed to it. But I would like to see how far we can get without it and just use file paths, which are something I think should be in every message anyway. If you're worried about file paths adding too much redundant to the output, then it is plausible that we could make ripgrep's
I'm not sure I'm convinced of the prevelance of this use case. Generally speaking, you don't exploit parallelism by running multiple ripgrep processes. Instead, you should let ripgrep handle parallelism for you. Also, how is this token generated? Is the onus on ripgrep to ensure that every token it generates is unique? Probably a uuidv4 is good enough for that. |
I think this looks great, and I am very excited to delete the current parsing code 😁 What's the purpose of the In developing vscode's search provider API, I've thought about matches in very long lines, or very long matches. The current version of the API has result objects that include a "preview" of the match, not necessarily the whole line. Then the client will be able to request a preview of a certain size in some TBD format, such as number of characters before and after the match, or just total number of characters. I wonder whether something like that would be useful for ripgrep too. If you have a short match in a very long line, the consumer doesn't necessarily need the full text of the line. On the other hand, if you have several matches in a long line, you won't be duplicating the full line text, which isn't true in vscode's model. If the full text of the line is present, then technically you don't need to include the match text in 'submatches', since it can be easily derived from I like the approach for non-utf8 encodings, I think that's a good idea. |
@roblourens That's great feedback, thank you!
Yeah, the intent is that both The purpose of
Ah this is interesting! My current thinking is to just let consumers such as yourself sort this out, but if the volume of data turns out to be a performance bottleneck then we can certain add some backwards compatible knobs for this going forward.
That's correct, yeah. The text in each submatch is strictly superfluous. It's mostly there as a convenience. We can expose knobs for this too if performance is a problem. @roblourens @garygreen If you'll allow me to summarize to make sure we're on the same page:
|
Interesting observation matching against a huge line. I can see it being a concern for people who match against minified files (god knows why though). I'm unfamiliar with how ripgrep currently handles those matches, I would assume it just outputs a massive line showing the match. It's not really a problem specifically related to the JSON formatter, though. If it does become an issue, then I'm sure we could consider adding some kind of "--match-byes-around=100" type option which would give you 100 bytes around the matches. Again, not sure how ripgrep handles that currently so it might already be available.
|
Yup, this is done.
Definitely going to punt on this until a specific use case for it can be evaluated. I'm trying to focus on shipping the (nearly) minimal subset of useful things here that will let folks migrate off of parsing ANSI escapes without losing or regressing on any features.
You generally know it's done because the process exits and the stdout pipe closes, at which point the consumer reads EOF. But yes, the wire format presented thus far is the library-ized format for a single search. ripgrep will add at least one message, probably called
Oh yes, absolutely. Matches are printed in the order in which they appear. To do otherwise would be crazytown. :-) |
Sounds good, I think And I agree with what you say about large lines - in vscode's case at least, it's the same as what we're doing already, and getting large lines from ripgrep definitely isn't the bottleneck. |
I'm just starting with integrating this project into my own editor, and this is the first problem I ran into. Maybe just change the behavior of Great project, thanks for making it. I was 2 days in to reinventing the wheel badly when I can across it. |
@jessegrosjean Could you elaborate more on the problem? I'm having trouble understanding it. You're saying that you ran into performance problems? Is it possible to reproduce them outside the editor? What you're asking for really sounds like a new "preview" type feature, which is similar to, but different from, what It might be smart to create a new issue for this. I doubt it will wind up in the initial support for JSON. |
@BurntSushi I'm blown away at the speed of ripgrep in terminal. I'm trying not to loose that speed when running from my app. I think the JSON format (as I understand it) makes large searches that I think including a full line of context with each match is just too much bandwidth for some cases. For example: My test case is pathological–search for
Wow! Fast. But then if I do:
My computer starts to die. I kill the process after 20 seconds and starting to run out of memory. I "think" the problem is that unlike the default command
I was asking for this, but I think you are correct, it’s to complicated, and would still require to much overlapping bandwidth in some cases. For example imagine the case where the user searches for Better I think is to just provide some options for what data is included in “match” (from the JSON API). What about an option to just omit the “match.lines” value? My app could then generate the initial list of results quickly (by omitting the “lines” values). And then lazily (as matches are scrolled through the view) load and highlight the actually matched text. |
@jessegrosjean Like I said, I think you should open a new ticket, since we're getting way off the topic of JSON. I did this for you in #999 and gave you a response. TL;DR - I'm not convinced any specific action should be taken on the JSON API at this time. Let's tackle performance problems---if they exist---as they arise. |
@roblourens The |
The output looks great, I am very excited for this. |
This commit updates the CHANGELOG to reflect all the work done to make libripgrep a reality. * Closes #162 (libripgrep) * Closes #176 (multiline search) * Closes #188 (opt-in PCRE2 support) * Closes #244 (JSON output) * Closes #416 (Windows CRLF support) * Closes #917 (trim prefix whitespace) * Closes #993 (add --null-data flag) * Closes #997 (--passthru works with --replace) * Fixes #2 (memory maps and context handling work) * Fixes #200 (ripgrep stops when pipe is closed) * Fixes #389 (more intuitive `-w/--word-regexp`) * Fixes #643 (detection of stdin on Windows is better) * Fixes #441, Fixes #690, Fixes #980 (empty matching lines are weird) * Fixes #764 (coalesce color escapes) * Fixes #922 (memory maps failing is no big deal) * Fixes #937 (color escapes no longer used for empty matches) * Fixes #940 (--passthru does not impact exit status) * Fixes #1013 (show runtime CPU features in --version output)
This commit updates the CHANGELOG to reflect all the work done to make libripgrep a reality. * Closes #162 (libripgrep) * Closes #176 (multiline search) * Closes #188 (opt-in PCRE2 support) * Closes #244 (JSON output) * Closes #416 (Windows CRLF support) * Closes #917 (trim prefix whitespace) * Closes #993 (add --null-data flag) * Closes #997 (--passthru works with --replace) * Fixes #2 (memory maps and context handling work) * Fixes #200 (ripgrep stops when pipe is closed) * Fixes #389 (more intuitive `-w/--word-regexp`) * Fixes #643 (detection of stdin on Windows is better) * Fixes #441, Fixes #690, Fixes #980 (empty matching lines are weird) * Fixes #764 (coalesce color escapes) * Fixes #922 (memory maps failing is no big deal) * Fixes #937 (color escapes no longer used for empty matches) * Fixes #940 (--passthru does not impact exit status) * Fixes #1013 (show runtime CPU features in --version output)
For integration in other tools (editors...) it would be great to have a parseable output format that includes the start and end index for each match in a line. You can do it with colors (but see #242) but it's not really what colors are meant for.
ag has --ackmate, which outputs something like this:
The text was updated successfully, but these errors were encountered: