Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeEncodeError using twarc utilities on Windows #297

Closed
eam5 opened this issue Sep 17, 2019 · 9 comments
Closed

UnicodeEncodeError using twarc utilities on Windows #297

eam5 opened this issue Sep 17, 2019 · 9 comments

Comments

@eam5
Copy link

eam5 commented Sep 17, 2019

I'm getting encoding errors when running some twarc utilities on windows. So far I'm getting similar error messages when running users.py, emojis.py, json2csv.py, tags.py, wall.py (this is what I've tested so far. wordcloud.py seems to work fine)

Python 3.7.4

encode-error

I've tested the .jsonl file on my mac, and run all the above utilities successfully.

@eam5
Copy link
Author

eam5 commented Nov 4, 2019

An update to this - the solution I've found involves setting the system locale to UTF-8 (beta feature in Windows 10). See here: https://stackoverflow.com/questions/57131654/using-utf-8-encoding-chcp-65001-in-command-prompt-windows-powershell-window/57134096#57134096

This works when using the Windows program Command Prompt to run the twarc utilities

@edsu
Copy link
Member

edsu commented Dec 5, 2019

Many of the utilities and the twarc command itself use the fileinput module to open the input files. This is nice because it means the program can easily read from one or more files, as well as stdin, as a single source of input. With the openhook=fileinput.hook_compressed parameter you can also optionally allow the source file to be gzip compressed, which is handy when you want the utility to read from compressed data.

Unfortunately on Windows (currently) the default character encoding is chcp-65001 instead of utf-8. This means twarc utilities (and twarc itself) will not be able to decode JSON properly unless the file is opened explicitly using the utf8 encoding. You can use an openhook to force the encoding. But you can't use multiple openhooks: both forcing utf8 and checking for gzip.

I think the solution here is to create our own openhook that looks to see if the file is gzip encoded, and also forces utf8. This is pretty easy with python3, but the gzip + utf8 decoding is a bit tricky with python2.

So I think resolving this issue is a good reason to finally switch to Python3 only support and let Python2 go. Which means that #211 would get resolved too!

@melaniewalsh
Copy link
Contributor

I'm teaching twarc in my class this week, and I just wanted to note that students with Windows computers are also getting utf8-related errors when using twarc utilities, specifically json2csv.py. I'm going to suggest that they set the system locale to utf-8, but I just wanted to ping the issue here.

@edsu
Copy link
Member

edsu commented Nov 23, 2020

While doing some experiments with Alejandra Josiowicz I noticed that twarc writes JSON data using UTF-16 on Windows. The twarc utilities mostly assume utf-8 as input. I think one way of solving this long standing bug would be to ensure that twarc outputs utf-8 on all platforms?

@igorbrigadir
Copy link
Contributor

I just realized this is the same error: https://github.com/DocNow/twarc/wiki/twarc2-on-Windows-10#output-format-errors for anyone else looking, the issue is with > redirecting output like that will write in UTF-16 on windows and put a Byte Order Mark into the file.

@edsu
Copy link
Member

edsu commented Mar 22, 2021

Thanks for making this connection @igorbrigadir !I really like being able to use twarc in pipelines so it is a bummer to have to update the docs to not use >. But I guess we could have a prominent section in the documentation about twarc on windows that explains this, and make sure the command have a way to write to a file?

@igorbrigadir
Copy link
Contributor

oh! I had another look, and I think this is the solution: https://jpsoft.com/blogs/2020/04/redirection-piping-and-unicode-at-the-windows-command-prompt/

so this should work:

twarc ... >:u8 output.json

(i haven't checked this on Windows yet)

@edsu
Copy link
Member

edsu commented Mar 22, 2021

I was just looking at that. I'd love to know if it works!

@igorbrigadir
Copy link
Contributor

igorbrigadir commented Sep 28, 2021

Going through closing some old issues: the best solution for Windows is to never use the > output redirection and instead specify a file name output, like

twarc2 search "..." output.jsonl

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants