Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GNU grep can't match foreign language characters and outputs everything #3010

Closed
vovcacik opened this issue Nov 2, 2018 · 24 comments · Fixed by #3060
Closed

GNU grep can't match foreign language characters and outputs everything #3010

vovcacik opened this issue Nov 2, 2018 · 24 comments · Fixed by #3060
Labels
bug report Something is not working properly help wanted Help is wanted in order to solve the issue

Comments

@vovcacik
Copy link

vovcacik commented Nov 2, 2018

Hi, I noticed that GNU grep has problem to match czech characters and so it outputs more lines than it should.

Reproducible example:

### Setup

# upgrade grep from busybox to gnu
pkg install grep

# I also installed coreutils, but not sure if it is relevant
pkg install coreutils

# libandroid-support is supposed to extend locale support in Bionic, but it had no effect on my usecase
# I tried with and without it
pkg install libandroid-support

# restart bash session; I also rebooted


### Test grep
LANG=cs_CZ.UTF-8
LC_ALL=cs_CZ.UTF-8
# I also tried "CODESET=cs_CZ.UTF-8" because I saw it in another issue; probably not relevant

echo bar > test.txt
echo hezky česky >> test.txt
echo foo >> test.txt

# busybox grep returns one line as it should
cat test.txt | /data/data/com.termux/files/usr/bin/applets/grep česky

# gnu grep returns all the lines
cat test.txt | grep česky

Shortened output

-bash-4.4$ LANG=cs_CZ.UTF-8
-bash-4.4$ LC_ALL=cs_CZ.UTF-8
-bash: warning: setlocale: LC_ALL: cannot change locale (cs_CZ.UTF-8): No such file or directory

-bash-4.4$ echo bar > test.txt
-bash-4.4$ echo hezky česky >> test.txt
-bash-4.4$ echo foo >> test.txt

-bash-4.4$ cat test.txt | /data/data/com.termux/files/usr/bin/applets/grep česky
hezky česky

-bash-4.4$ cat test.txt | grep česky
bar
hezky česky
foo

I think the LANG variable should make this work but I do set LC_ALL along with it. Unfortunately it fails, but that is a problem I reported separately #3009.


tl;dr

grep variant android 7.1 @ arm fornwall's device android 8 @ aarch64
(reported) 2x 1x 1x
busybox grep
busybox grep -E; egrep
busybox grep -F
gnu grep; grep -G
gnu grep -E; egrep
gnu grep -F, fgrep
gnu grep -P
freebsd /system/bin/grep; grep -G
freebsd /system/bin/grep -E; egrep
freebsd /system/bin/grep -F; fgrep
freebsd /system/bin/grep -P not supported
@ghost
Copy link

ghost commented Nov 2, 2018

LANG=cs_CZ.UTF-8
LC_ALL=cs_CZ.UTF-8

@vovcacik We don't have locale support.

@vovcacik
Copy link
Author

vovcacik commented Nov 3, 2018

I don't want to make this issue about the LC_ALL (there is #3009 for that), but rather about the LANG variable and about the gnu grep.

It seems that the grep see the č=0xC4 0x8D UTF-8 bytes correctly:

grep --color='auto' -P -n "[\x80-\xFF]" test.txt
grep --color='auto' -P -n "[\x80-\xFF]+" test.txt
grep --color='auto' -P -n "[\xC4]" test.txt
grep --color='auto' -P -n "[\x8D]" test.txt

2018-11-03_154149_ivymrcr

@ghost
Copy link

ghost commented Nov 3, 2018

UTF-8 support and locales are different things.
libandroid-support doesn't install locale files for grep and other packages.

@vovcacik
Copy link
Author

vovcacik commented Nov 4, 2018

No doubt about that.

I am afraid that because of

made you think this is also locale problem. I am not saying it is not, but notice that the grep č test.txt does not itself require locale per se. It is basically just character matching, and everything is in UTF-8 so I dont see the problem grep is having.

Example of regex operation that requires locale would be grep "[a-k]" test.txt since the character class locale's collating sequence (to determine what characters are in the "a" "k" range).

@vovcacik
Copy link
Author

vovcacik commented Nov 4, 2018

More findings:

-bash-4.4$ /data/data/com.termux/files/usr/bin/applets/grep č test.txt
hezky česky
-bash-4.4$ /data/data/com.termux/files/usr/bin/applets/grep -F č test.txt
hezky česky
-bash-4.4$ /data/data/com.termux/files/usr/bin/applets/grep -E č test.txt
hezky česky
-bash-4.4$ grep č test.txt
bar
hezky česky
foo
-bash-4.4$ grep -G č test.txt
bar
hezky česky
foo
-bash-4.4$ grep -E č test.txt
bar
hezky česky
foo
-bash-4.4$ grep -P č test.txt
hezky česky
-bash-4.4$ grep -F č test.txt
hezky česky

I guess grep -F success is not that surprising after my previous comment, however grep -P definitely is.

So I've got my workaround, feel free to close if you don't consider this a bug.

@vovcacik vovcacik changed the title GNU grep is can't match foreign language characters and outputs all GNU grep can't match foreign language characters and outputs everything Nov 4, 2018
@fornwall
Copy link
Member

@vovcacik Thanks a lot for reporting! I'm unable to reproduce it on a device I tested with just now. The below transcript indicates that I cannot reproduce your problem, right?

localhost$ echo bar > test.txt
localhost$ echo hezky česky >> test.txt
localhost$ echo foo >> test.txt
localhost$ cat test.txt | grep česky
hezky česky
localhost$ cat test.txt | busybox grep česky
hezky česky
localhost$ which grep
/data/data/com.termux/files/usr/bin/grep

As seen, both busybox grep and coreutils correctly finds only the matching line. This is regardless of me setting LANG=cs_CZ.UTF-8 LC_ALL=cs_CZ.UTF-8 or not.

Some things to try:

  1. Update to latest packages with pkg up if you haven't already done so.
  2. Try running grep without environment variables set and see if that makes a difference (cat test.txt | env -i grep česky).

Does that make a change? If not, could you paste the output from running termux-info here, as it may be specific to arch/android version/device?

@fornwall fornwall added bug report Something is not working properly help wanted Help is wanted in order to solve the issue labels Nov 11, 2018
@vovcacik
Copy link
Author

Yes, it appears alright on your device. You could maybe double check that you are running gnu grep from /data/data/com.termux/files/usr/bin/grep, but I don't see why you wouldn't.

I'll try the suggestions as soon as possible and get back to you.

@vovcacik
Copy link
Author

vovcacik commented Nov 11, 2018

I did the suggestion to pkg up, restarted ssh session and tried env -i. It didn't really help with the gnu grep, but with no environment it seems there is FreeBSD /system/bin/grep that is getting executed.

  • setup
$ echo foo > test.txt
$ echo hezky česky >> test.txt
$ echo bar >> test.txt
  • performing grep
$ cat test.txt | grep česky
foo
hezky česky
bar
$ cat test.txt | busybox grep česky
hezky česky
$ cat test.txt | env -i grep česky
hezky česky
  • identifying grep
$ which grep
/data/data/com.termux/files/usr/bin/grep
$ sha1sum `which grep`
48e865431d5ceffc4cc414885560dd5c4b831f2a  /data/data/com.termux/files/usr/bin/grep
$ env -i which grep
$ env -i /data/data/com.termux/files/usr/bin/applets/which grep
/system/bin/grep
  • identifying grep again
$ grep --version
grep (GNU grep) 3.1
Copyright (C) 2017 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Mike Haertel and others, see <http://git.sv.gnu.org/cgit/grep.git/tree/AUTHORS>.
$ env -i grep --version
grep (BSD grep) 2.5.1-FreeBSD
$ busybox grep --version
busybox: unrecognized option `--version'
BusyBox v1.29.3 (2018-09-10 22:47:04 UTC) multi-call binary.
  • more info for you
$ termux-info
Updatable packages:
All packages up to date
System information:
Linux localhost 3.4.0-eXcaliBur+ #1 SMP PREEMPT Thu Jul 20 00:11:26 EDT 2017 armv7l Android
Termux-packages arch:
arm
Android version:
7.1.2
Device manufacturer:
OnePlus
Device model:
One
$ env
LD_LIBRARY_PATH=/data/data/com.termux/files/usr/lib
SSH_CONNECTION=192.168.1.5 60264 192.168.1.10 8022
LANG=en_US.UTF-8
PREFIX=/data/data/com.termux/files/usr
USER=u0_a83
PWD=/data/data/com.termux/files/home/__test
HOME=/data/data/com.termux/files/home
SSH_CLIENT=192.168.1.5 60264 8022
TMPDIR=/data/data/com.termux/files/usr/tmp
SSH_TTY=/dev/pts/3
SHELL=/data/data/com.termux/files/usr/bin/bash
TERM=xterm
SHLVL=1
ANDROID_ROOT=/system
ANDROID_DATA=/data
LOGNAME=u0_a83
EXTERNAL_STORAGE=/sdcard
PATH=/data/data/com.termux/files/usr/bin:/data/data/com.termux/files/usr/bin/applets
LD_PRELOAD=/data/data/com.termux/files/usr/lib/libtermux-exec.so
OLDPWD=/data/data/com.termux/files/home
_=/data/data/com.termux/files/usr/bin/env
  • grepping through $PATH dirs
$ ls -la /data/data/com.termux/files/usr/bin | grep grep
-rwx------ 1 u0_a83 u0_a83      59 Jul 10  2017 egrep
-rwx------ 1 u0_a83 u0_a83      59 Jul 10  2017 fgrep
-rwx------ 1 u0_a83 u0_a83  129180 Jul 10  2017 grep
$ ls -la /data/data/com.termux/files/usr/bin/applets | grep grep
lrwxrwxrwx 1 u0_a83 u0_a83    10 Nov 11 17:47 egrep -> ../busybox
lrwxrwxrwx 1 u0_a83 u0_a83    10 Nov 11 17:47 grep -> ../busybox
lrwxrwxrwx 1 u0_a83 u0_a83    10 Nov 11 17:47 pgrep -> ../busybox

@Grimler91
Copy link
Member

I can confirm the problem on arm and android 7.1, but busybox grep works as intended

@ghost
Copy link

ghost commented Nov 11, 2018

screenshot_20181111-200705_termux

On AArch64 and Android 8 no problem with gnu grep.

@tomty89
Copy link
Contributor

tomty89 commented Nov 17, 2018

@vovcacik Just being curious, what do

grep -n česky test.txt
grep -no . test.txt

give you?

@vovcacik
Copy link
Author

@tomty89 interesting. The grep switches to binary mode and it stops printing rest of the line when it hits č:

$ echo foo > test.txt
$ echo hezky česky >> test.txt
$ echo bar >> test.txt
$ cat test.txt
foo
hezky česky
bar
$ cat test.txt | grep č
foo
hezky česky
bar
$ grep -n česky test.txt
1:foo
2:hezky česky
3:bar
$ grep -no . test.txt
1:f
1:o
1:o
2:h
2:e
2:z
2:k
2:y
2:
3:b
3:a
3:r
Binary file test.txt matches

But I can't say whether this is expected or not.

@tomty89
Copy link
Contributor

tomty89 commented Nov 17, 2018

Hmm, looks like it's even more messed up than I thought (that the newlines were ignored for some reason, like multiple characters being treated as a single character, for example).

Now I wonder if grep -no česky test.txt gives the same output as grep -no . test.txt does...

@ghost
Copy link

ghost commented Nov 17, 2018

looks like it's even more messed up than I thought

Not sure if grep messed. More like that this is libandroid-support (or libc-specific) problem as I can reproduce this issue in Android 5.1 (x86_64 AVD) but not in Android 9 (x86_64 AVD). It also never happens on my AArch64 device.

Now I wonder if grep -no česky test.txt gives the same output as grep -no . test.txt does...

Output is different. See output of these 2 commands on Android 5.1 (x86_64):

a51

On Android 9 (x86_64) it seems okay, though:
a90


Busybox's grep -no . test.txt:
a90_bb

@ghost
Copy link

ghost commented Nov 17, 2018

@vovcacik @tomty89 I guess this PR will fix that: #3060. At least it worked for me. I can provide *.deb files so you can test it yourself.

@tomty89
Copy link
Contributor

tomty89 commented Nov 17, 2018

I know. Most likely it's some old bionic bug.

Output is different.

That actually makes the problem look even more irrational. Seems like grep ignore newlines but only in a peculiar manner? (Partially ignore it when doing the final output but not when matching?)

Not sure if it's relevant, but I can't make grep in Termux do what's in your second post. In Arch (proot) I can make that happen by unsetting LANG or setting it to C. It seems Termux is always UTF-8.

@tomty89
Copy link
Contributor

tomty89 commented Nov 17, 2018

@xeffyr if depending on libandroid-support fixes it I wonder if it's a duplicate of #3047

@ghost
Copy link

ghost commented Nov 17, 2018

When libandroid-support dependency is set, the script ./build-package.sh will append it's includes to CPPFLAGS:

    if [ "$TERMUX_PKG_DEPENDS" != "${TERMUX_PKG_DEPENDS/libandroid-support/}" ]; then
        # If using the android support library, link to it and include its headers as system headers:
        CPPFLAGS+=" -isystem $TERMUX_PREFIX/include/libandroid-support"
        LDFLAGS+=" -landroid-support"
    fi

@ghost
Copy link

ghost commented Nov 17, 2018

grep packages for testing:

@tomty89
Copy link
Contributor

tomty89 commented Nov 17, 2018

I know, which is silly. I don't see any reason that we should have symlink for one of the headers but not the other (and explicitly depend to libandroid-support package by package when we notice a problem. In fact I'm not sure if there's good reason for not putting them directly under include/.

@fornwall
Copy link
Member

The updated 3.1-1 version of the grep package, now available for installation, should fix this.

@tomty89 Agreed, this whack-a-mole of adding libandroid-support when a problem pops up is a bit silly.

@vovcacik
Copy link
Author

It's fixed, thank you!

@Ferdi265
Copy link

Ferdi265 commented Jun 7, 2021

It seems this is either still not fully fixed or it regressed:

GNU grep fails on ö while busybox grep succeeds.

Screenshot_20210607-033414__01

@peternowee
Copy link

See #5171.

@ghost ghost locked and limited conversation to collaborators Oct 9, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug report Something is not working properly help wanted Help is wanted in order to solve the issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants