-
Notifications
You must be signed in to change notification settings - Fork 255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UnicodeEncodeError using twarc utilities on Windows #297
Comments
An update to this - the solution I've found involves setting the system locale to UTF-8 (beta feature in Windows 10). See here: https://stackoverflow.com/questions/57131654/using-utf-8-encoding-chcp-65001-in-command-prompt-windows-powershell-window/57134096#57134096 This works when using the Windows program Command Prompt to run the twarc utilities |
Many of the utilities and the twarc command itself use the fileinput module to open the input files. This is nice because it means the program can easily read from one or more files, as well as stdin, as a single source of input. With the Unfortunately on Windows (currently) the default character encoding is I think the solution here is to create our own openhook that looks to see if the file is gzip encoded, and also forces utf8. This is pretty easy with python3, but the gzip + utf8 decoding is a bit tricky with python2. So I think resolving this issue is a good reason to finally switch to Python3 only support and let Python2 go. Which means that #211 would get resolved too! |
I'm teaching twarc in my class this week, and I just wanted to note that students with Windows computers are also getting utf8-related errors when using twarc utilities, specifically json2csv.py. I'm going to suggest that they set the system locale to utf-8, but I just wanted to ping the issue here. |
While doing some experiments with Alejandra Josiowicz I noticed that twarc writes JSON data using UTF-16 on Windows. The twarc utilities mostly assume utf-8 as input. I think one way of solving this long standing bug would be to ensure that twarc outputs utf-8 on all platforms? |
I just realized this is the same error: https://github.com/DocNow/twarc/wiki/twarc2-on-Windows-10#output-format-errors for anyone else looking, the issue is with |
Thanks for making this connection @igorbrigadir !I really like being able to use twarc in pipelines so it is a bummer to have to update the docs to not use >. But I guess we could have a prominent section in the documentation about twarc on windows that explains this, and make sure the command have a way to write to a file? |
oh! I had another look, and I think this is the solution: https://jpsoft.com/blogs/2020/04/redirection-piping-and-unicode-at-the-windows-command-prompt/ so this should work:
(i haven't checked this on Windows yet) |
I was just looking at that. I'd love to know if it works! |
Going through closing some old issues: the best solution for Windows is to never use the
|
I'm getting encoding errors when running some twarc utilities on windows. So far I'm getting similar error messages when running users.py, emojis.py, json2csv.py, tags.py, wall.py (this is what I've tested so far. wordcloud.py seems to work fine)
Python 3.7.4
I've tested the .jsonl file on my mac, and run all the above utilities successfully.
The text was updated successfully, but these errors were encountered: