Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add file encoding (and some read modes) at open file step #2219

Merged
merged 34 commits into from
Oct 9, 2023

Conversation

valeriupredoi
Copy link
Contributor

@valeriupredoi valeriupredoi commented Oct 5, 2023

Description

As discussed in #1585 we should specify the type of encoding (bogstandard UTF-8 here) when opening file objects; binary open does NOT require that (in fact enforces it to be binary only), so I added encodings to all the YAML files we open, and some read mode 'r' when needed.

Closes #1585


Before you get started

Checklist

It is the responsibility of the author to make sure the pull request is ready to review. The icons indicate whether the item will be subject to the 🛠 Technical or 🧪 Scientific review.


To help with the number pull requests:

@valeriupredoi valeriupredoi added the enhancement New feature or request label Oct 5, 2023
@codecov
Copy link

codecov bot commented Oct 5, 2023

Codecov Report

Merging #2219 (7ab6c1e) into main (469fd09) will increase coverage by 0.00%.
The diff coverage is 93.33%.

@@           Coverage Diff           @@
##             main    #2219   +/-   ##
=======================================
  Coverage   93.27%   93.27%           
=======================================
  Files         238      238           
  Lines       12818    12819    +1     
=======================================
+ Hits        11956    11957    +1     
  Misses        862      862           
Files Coverage Δ
esmvalcore/_citation.py 80.99% <100.00%> (ø)
esmvalcore/_main.py 90.94% <100.00%> (ø)
esmvalcore/_task.py 72.42% <100.00%> (ø)
esmvalcore/cmor/table.py 94.72% <100.00%> (ø)
esmvalcore/config/_config.py 100.00% <100.00%> (ø)
esmvalcore/config/_config_object.py 94.95% <100.00%> (ø)
esmvalcore/config/_esgf_pyclient.py 100.00% <100.00%> (ø)
esmvalcore/config/_logging.py 97.67% <100.00%> (ø)
esmvalcore/esgf/_download.py 100.00% <100.00%> (ø)
esmvalcore/experimental/recipe.py 90.32% <100.00%> (+0.15%) ⬆️
... and 4 more

@valeriupredoi
Copy link
Contributor Author

valeriupredoi commented Oct 5, 2023

what's Codecov's deal? It's very confused
EDIT - nevermind, it's not confused, but plain old silly

@valeriupredoi valeriupredoi added this to the v2.10.0 milestone Oct 5, 2023
@bouweandela
Copy link
Member

bouweandela commented Oct 5, 2023

Great to see some action on this topic! grep -RIn 'open(' $(find -name '*.py') still finds quite a few file opens where the encoding is missing even with the changes here. Could you have another look?

@valeriupredoi
Copy link
Contributor Author

@bouweandela cheers! yes, but:

  • whatever is open for writing there is no need to specify the encoding since the Python writer knows exactly what to use
  • JSON and image opening I'd rather stay away from forcing the encoding and let Python choose it
  • bytes mode opening should not have encoding specified

This leaves a couple open file calls which I'll look into now 👍

@bouweandela
Copy link
Member

We could check if the problem is solved by doing export LC_ALL=C and running the tests, right?

@valeriupredoi
Copy link
Contributor Author

you do that, I am not sure if I can revert the locale vars on my ancient OS 🤣 🤣 JK, can you try pls - while I am looking at the other calls that need encoding

@valeriupredoi
Copy link
Contributor Author

eugh I can't change my locale:

(esmvaltool) valeriu@valeriu-PORTEGE-Z30-C:~/ESMValCore$ sudo update-locale LC_CTYPE="C"
(esmvaltool) valeriu@valeriu-PORTEGE-Z30-C:~/ESMValCore$ locale
LANG=en_GB.UTF-8
LANGUAGE=en_GB:en
LC_CTYPE="en_GB.UTF-8"
LC_NUMERIC="en_GB.UTF-8"
LC_TIME="en_GB.UTF-8"
LC_COLLATE="en_GB.UTF-8"
LC_MONETARY="en_GB.UTF-8"
LC_MESSAGES="en_GB.UTF-8"
LC_PAPER="en_GB.UTF-8"
LC_NAME="en_GB.UTF-8"
LC_ADDRESS="en_GB.UTF-8"
LC_TELEPHONE="en_GB.UTF-8"
LC_MEASUREMENT="en_GB.UTF-8"
LC_IDENTIFICATION="en_GB.UTF-8"
LC_ALL=

very possible since OS is a little old 😁

@bouweandela
Copy link
Member

I think we should be OK - I got GA to do that, locale to C.UTF-8 and things look good test-wise

Yes, but the problem occurred if the text encoding was not utf-8.

whatever is open for writing there is no need to specify the encoding since the Python writer knows exactly what to use

It would be safer to always write text files in utf-8 encoding, so they work regardless of what computer the user is opening them on.

JSON and image opening I'd rather stay away from forcing the encoding and let Python choose it

JSON files are text files, so it would be best to specify the encoding there too. Image files should be opened in binary mode mode='b').

Pylint still gives me a list of warnings (not all relevant, but most seem relevant):

$ pylint --disable=all -e W1514 $(find -name '*.py')
************* Module setup
setup.py:186:9: W1514: Using open without explicitly specifying an encoding (unspecified-encoding)
setup.py:197:9: W1514: Using open without explicitly specifying an encoding (unspecified-encoding)
setup.py:206:21: W1514: Using open without explicitly specifying an encoding (unspecified-encoding)
************* Module gensidebar
doc/gensidebar.py:13:13: W1514: Using open without explicitly specifying an encoding (unspecified-encoding)
doc/gensidebar.py:19:13: W1514: Using open without explicitly specifying an encoding (unspecified-encoding)
************* Module tests.integration.cmor._fixes.icon.test_icon
tests/integration/cmor/_fixes/icon/test_icon.py:2069:9: W1514: Using open without explicitly specifying an encoding (unspecified-encoding)
************* Module tests.integration.recipe.test_recipe
tests/integration/recipe/test_recipe.py:1254:9: W1514: Using open without explicitly specifying an encoding (unspecified-encoding)
************* Module tests.integration.test_task
tests/integration/test_task.py:219:13: W1514: Using open without explicitly specifying an encoding (unspecified-encoding)
tests/integration/test_task.py:276:9: W1514: Using open without explicitly specifying an encoding (unspecified-encoding)
************* Module esmvalcore.preprocessor._io
esmvalcore/preprocessor/_io.py:376:13: W1514: Using open without explicitly specifying an encoding (unspecified-encoding)
************* Module esmvalcore.esgf._download
esmvalcore/esgf/_download.py:97:9: W1514: Using open without explicitly specifying an encoding (unspecified-encoding)
************* Module esmvalcore.cmor.table
esmvalcore/cmor/table.py:418:13: W1514: Using open without explicitly specifying an encoding (unspecified-encoding)
esmvalcore/cmor/table.py:468:17: W1514: Using open without explicitly specifying an encoding (unspecified-encoding)
esmvalcore/cmor/table.py:480:17: W1514: Using open without explicitly specifying an encoding (unspecified-encoding)
************* Module esmvalcore._task
esmvalcore/_task.py:122:13: W1514: Using open without explicitly specifying an encoding (unspecified-encoding)
esmvalcore/_task.py:222:9: W1514: Using open without explicitly specifying an encoding (unspecified-encoding)
************* Module esmvalcore.experimental.recipe
esmvalcore/experimental/recipe.py:73:40: W1514: Using open without explicitly specifying an encoding (unspecified-encoding)
************* Module esmvalcore.experimental.recipe_output
esmvalcore/experimental/recipe_output.py:203:13: W1514: Using open without explicitly specifying an encoding (unspecified-encoding)
************* Module esmvalcore._citation
esmvalcore/_citation.py:104:9: W1514: Using open without explicitly specifying an encoding (unspecified-encoding)
esmvalcore/_citation.py:129:13: W1514: Using open without explicitly specifying an encoding (unspecified-encoding)

I tried testing this on my own computer, but it's a bit of a hassle because I don't have any other character set than utf-8 installed.

@@ -59,7 +60,10 @@ jobs:
- run: python -V 2>&1 | tee test_linux_artifacts_python_${{ matrix.python-version }}/python_version.txt
- run: pip install -e .[develop] 2>&1 | tee test_linux_artifacts_python_${{ matrix.python-version }}/install.txt
- run: flake8
- run: pytest -n 2 -m "not installation" 2>&1 | tee test_linux_artifacts_python_${{ matrix.python-version }}/test_report.txt
- run: |
sudo update-locale LC_ALL=C
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this update-locale thing? You shouldn't need to try and change the default locale for the whole machine. Just do an export LC_ALL=C?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meh. Gonna add encoding to all open file calls anyway

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing we want to test is that it works fine with a default character set other than utf-8, as that is what we keep running into (see linked issues).

@valeriupredoi
Copy link
Contributor Author

Aye, am gonna add encoding to everything then - just to be on the safe side

@zklaus
Copy link

zklaus commented Oct 6, 2023

Aye, am gonna add encoding to everything then - just to be on the safe side

Hm. I know this is something that @bouweandela also brought up, but I feel a bit more cautious about it. For Yaml, the case is clear cut; the Yaml specs dictate UTF(-8). For all other text files, it is perfectly reasonable to use different encodings, isn't it?

Also, a side question: Why are we ever opening Yaml files in binary mode?

@valeriupredoi
Copy link
Contributor Author

valeriupredoi commented Oct 6, 2023

@zklaus indeed that was me concern too in #2219 (comment) - but me not being in the know of file encodings I assumed @bouweandela knows a tad more than me, so I was thinking of following his advice - how about I do that and test with export LC_ALL=C on GAs?

@valeriupredoi
Copy link
Contributor Author

OK @bouweandela this should now cover all instances of open file, except shapefiles, bytes, and images - I really don't want to faff around those (wb and rb strictly don't accept encoding anyway) 🍺

@bouweandela
Copy link
Member

Hm. I know this is something that @bouweandela also brought up, but I feel a bit more cautious about it. For Yaml, the case is clear cut; the Yaml specs dictate UTF(-8). For all other text files, it is perfectly reasonable to use different encodings, isn't it?

The problem is that Python uses the default encoding setting of the machine that it is running on instead of the encoding of the file, as there is no way to specify that, and therefore this causes problems when the system encoding is set to something ancient. To avoid this issue, we need to specify the encoding in all text files we are reading and writing so they work across platforms and users.

Also, a side question: Why are we ever opening Yaml files in binary mode?

It looks like yaml has some code to automatically try to figure out the correct encoding in case you do this, so that should work fine: https://github.com/yaml/pyyaml/blob/155ec463f6a854ac14ccd5e2dda8017ce42a508a/lib/yaml/reader.py#L122

@valeriupredoi
Copy link
Contributor Author

Cheers for chipping in @bouweandela 🍺 Thing is, don't know how to change my UNIX default encoding without having to use a Klingon translator to manage commands in the terminal 😁

@zklaus zklaus merged commit adeb1e2 into main Oct 9, 2023
@zklaus zklaus deleted the add_file_encoding branch October 9, 2023 10:38
@bouweandela
Copy link
Member

bouweandela commented Oct 9, 2023

Thing is, don't know how to change my UNIX default encoding

@valeriupredoi I think you could do this (haven't tried it myself): https://askubuntu.com/a/120068. So basically you 1) make sure that the locale has been generated with the requested character set and then 2) export the LANG and LC_ALL environmental variables to use it. You can check if it was successful by running locale and python -c 'import sys; print(sys.getdefaultencoding())'. After closing the terminal the changes should be gone because your environmental variables have been restored to normal, so no need to worry about messing up your system.

@valeriupredoi
Copy link
Contributor Author

thanks @bouweandela but you assume I can still run dpkg-* which is a false assumption on my abandonware OS 🤣

@valeriupredoi
Copy link
Contributor Author

many thanks for approving and merging, gents - let's wait see if some people with bizarre encodings complain, then we can get back to it, by then I should be running a modern OS too, so I can test 😁

@bouweandela
Copy link
Member

I just checked and the command is available in Ubuntu 14.04. I believe that is your favorite OS, isn't it? 🤣

@valeriupredoi
Copy link
Contributor Author

no, I "evolved" - I am now a big fan of 16.04 🤣

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Specify encoding when reading YAML and other text files
3 participants