-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(regex) Segmentation fault: 11 #922
Comments
This is due to incorrectly decoding the width of 3-byte UTF-8 characters: https://github.com/stedolan/jq/blob/370833d55573a223b60ea51b4cea7b6c0326e030/jv_unicode.c#L62 |
Ouch. Proposed fix:
|
Actually, this probably needs to deal with invalid sequences (not alias them to 4-bytes). |
Hmm, actually, we don't need to deal with invalid sequences, since these should be validated strings. |
Feel free to push. I gtg. |
Thanks for the report @pkoppstein. |
Input: this is OK when this is not I have no idea if I am facing the same Unicode bug? I am using jqplay.org |
jqplay.org must not be using a version of jq that contains the fix. |
The fix for this didn't make jq-1.5. The milestone is 1.5.1 and the commit log shows it's not in jq-1.5. We should probably prep a 1.5.1 or a 1.6 release. |
#922 is a really serious bug that makes much of jq's functionality (e.g. nearly all regex-related functionality) effectively unusable. And hence, makes jq effectively unusable if such functionality is needed. #922 also makes silent (non-failure) errors leading to data corruption very probable. E.g. Any transformations or queries of non-ascii data that use sub, match, or capture are highly prone to silent data corruption. E.g. gsub replacements can cause catastrophically wrong transformations of the input, *without any error or warning, and only for input containing codepoints of certain byte lengths AND with specific relative position to the actual locations of the regex matches (making it that much harder for a user to notice the bug during testing). (I.e. It is very easy to encounter such errors, but also very easy to miss them unless you test with the right combinations and sequences/positions of unicode chars and regex patterns. I suffered some serious data corruption (caused by jq #922) until the 'right' combination of data and jq filters lead to some corruption that was catastrophic and non-silent. Until then, the silent corruption was very hard to detect. And even after, it was difficult to diagnose and predict the precise cause, and effects. RE: Workarounds:It also seems impossible to work around the bug using only a jq library/module, since it appears to be impossible to use a jq def to override/hide a builtin that is implemented in 'jq.c'. Hence, even if all jq def builtins affected by #922 were overridden, any jq filters that use the jq.c builtins directly would still be affected. Hence, the only way to reliably prevent #922 from causing data loss seems to be to fix bug in the c code. STEPS TO REPRODUCE:Note: In the following series, each step attempts to sub a 1 ascii char with "#" (also ascii). However, the actual effects of the sub function differ dramatically (due to #922), depending on where in the string the first char is located/matched (i.e. where it is relative to non-ascii, multi-byte chars). In particular, note the 2 cases where substantial portions of the string (38% & 62% in these examples) are simply dropped completely, representing massive data loss. Also, note the data corruption in many cases, where the '#' is incorrectly substituted for multiple chars (in multiple locations) instead of just 1 char. jq -n '"abc \u00b2 def \u00a9 ghi \u2026 jkl \u00ae mno \u201c pqr \u00b6 stu \u00b3 vxy"' |cc jq -n '"abc \u00b2 def \u00a9 ghi \u2026 jkl \u00ae mno \u201c pqr \u00b6 stu \u00b3 vxy"|sub("[b]"; "#"; "ig")' |cc jq -n '"abc \u00b2 def \u00a9 ghi \u2026 jkl \u00ae mno \u201c pqr \u00b6 stu \u00b3 vxy"|sub("[e]"; "#"; "ig")' |cc jq -n '"abc \u00b2 def \u00a9 ghi \u2026 jkl \u00ae mno \u201c pqr \u00b6 stu \u00b3 vxy"|sub("[h]"; "#"; "ig")' |cc jq -n '"abc \u00b2 def \u00a9 ghi \u2026 jkl \u00ae mno \u201c pqr \u00b6 stu \u00b3 vxy"|sub("[k]"; "#"; "ig")' |cc jq -n '"abc \u00b2 def \u00a9 ghi \u2026 jkl \u00ae mno \u201c pqr \u00b6 stu \u00b3 vxy"|sub("[n]"; "#"; "ig")' |cc jq -n '"abc \u00b2 def \u00a9 ghi \u2026 jkl \u00ae mno \u201c pqr \u00b6 stu \u00b3 vxy"|sub("[q]"; "#"; "ig")' |cc jq -n '"abc \u00b2 def \u00a9 ghi \u2026 jkl \u00ae mno \u201c pqr \u00b6 stu \u00b3 vxy"|sub("[t]"; "#"; "ig")' |cc jq -n '"abc \u00b2 def \u00a9 ghi \u2026 jkl \u00ae mno \u201c pqr \u00b6 stu \u00b3 vxy"|sub("[x]"; "#"; "ig")' |cc |
This issue (#922) was CLOSED once a fix had been installed in the "master" version of jq. In any case, using the current version of "master", I have verified that all the test cases in your post pass. Thank you for providing them. If your point is that the latest official numbered release (currently jq 1.5) does not include this fix, then it might help to make that explicit. |
Can we please get an update and ETA on when an official release containing this fix will be released? I couldn't find any info about plans or dates for any releases past 1.5 (aside from the unexplained cancellation of v1.5.1). Given that #922 has now been fixed for 2 years, and was already scheduled for a previous release (1.5.1), it seems like it shouldn't be that much work to do at least a 1.5.1 release. And not releasing the fix is a showstopper for many. FYI: The use of a custom build (e.g. from a Master branch) is unfeasible or forbidden in many organizations. (E.g. Due to policy restrictions related to security, legal, and/or technical reasons.) And as mentioned above, it appears impossible to implement any reliable workaround for #922 without either a new release or a (unfeasible) custom build. (Sorry for not being explicit enough above. You pre-empted my planned follow-up by 3 minutes. :) ) |
This seems pretty ridiculous:
|
The text was updated successfully, but these errors were encountered: