Add file encoding (and some read modes) at open file step #2219

valeriupredoi · 2023-10-05T12:38:55Z

Description

As discussed in #1585 we should specify the type of encoding (bogstandard UTF-8 here) when opening file objects; binary open does NOT require that (in fact enforces it to be binary only), so I added encodings to all the YAML files we open, and some read mode 'r' when needed.

Closes #1585

Before you get started

☝ Create an issue to discuss what you are going to do

Checklist

It is the responsibility of the author to make sure the pull request is ready to review. The icons indicate whether the item will be subject to the 🛠 Technical or 🧪 Scientific review.

🛠 This pull request has a descriptive title and labels
🛠 Code is written according to the code quality guidelines
🛠 All checks below this pull request were successful

To help with the number pull requests:

🙏 We kindly ask you to review two other open pull requests in this repository

codecov · 2023-10-05T12:47:16Z

Codecov Report

Merging #2219 (7ab6c1e) into main (469fd09) will increase coverage by 0.00%.
The diff coverage is 93.33%.

@@           Coverage Diff           @@
##             main    #2219   +/-   ##
=======================================
  Coverage   93.27%   93.27%           
=======================================
  Files         238      238           
  Lines       12818    12819    +1     
=======================================
+ Hits        11956    11957    +1     
  Misses        862      862

Files	Coverage Δ
esmvalcore/_citation.py	`80.99% <100.00%> (ø)`
esmvalcore/_main.py	`90.94% <100.00%> (ø)`
esmvalcore/_task.py	`72.42% <100.00%> (ø)`
esmvalcore/cmor/table.py	`94.72% <100.00%> (ø)`
esmvalcore/config/_config.py	`100.00% <100.00%> (ø)`
esmvalcore/config/_config_object.py	`94.95% <100.00%> (ø)`
esmvalcore/config/_esgf_pyclient.py	`100.00% <100.00%> (ø)`
esmvalcore/config/_logging.py	`97.67% <100.00%> (ø)`
esmvalcore/esgf/_download.py	`100.00% <100.00%> (ø)`
esmvalcore/experimental/recipe.py	`90.32% <100.00%> (+0.15%)`	⬆️
... and 4 more

valeriupredoi · 2023-10-05T13:06:41Z

what's Codecov's deal? It's very confused
EDIT - nevermind, it's not confused, but plain old silly

bouweandela · 2023-10-05T14:05:40Z

Great to see some action on this topic! grep -RIn 'open(' $(find -name '*.py') still finds quite a few file opens where the encoding is missing even with the changes here. Could you have another look?

valeriupredoi · 2023-10-05T14:12:02Z

@bouweandela cheers! yes, but:

whatever is open for writing there is no need to specify the encoding since the Python writer knows exactly what to use
JSON and image opening I'd rather stay away from forcing the encoding and let Python choose it
bytes mode opening should not have encoding specified

This leaves a couple open file calls which I'll look into now 👍

bouweandela · 2023-10-05T14:15:15Z

We could check if the problem is solved by doing export LC_ALL=C and running the tests, right?

valeriupredoi · 2023-10-05T14:18:38Z

you do that, I am not sure if I can revert the locale vars on my ancient OS 🤣 🤣 JK, can you try pls - while I am looking at the other calls that need encoding

valeriupredoi · 2023-10-05T14:52:38Z

eugh I can't change my locale:

(esmvaltool) valeriu@valeriu-PORTEGE-Z30-C:~/ESMValCore$ sudo update-locale LC_CTYPE="C"
(esmvaltool) valeriu@valeriu-PORTEGE-Z30-C:~/ESMValCore$ locale
LANG=en_GB.UTF-8
LANGUAGE=en_GB:en
LC_CTYPE="en_GB.UTF-8"
LC_NUMERIC="en_GB.UTF-8"
LC_TIME="en_GB.UTF-8"
LC_COLLATE="en_GB.UTF-8"
LC_MONETARY="en_GB.UTF-8"
LC_MESSAGES="en_GB.UTF-8"
LC_PAPER="en_GB.UTF-8"
LC_NAME="en_GB.UTF-8"
LC_ADDRESS="en_GB.UTF-8"
LC_TELEPHONE="en_GB.UTF-8"
LC_MEASUREMENT="en_GB.UTF-8"
LC_IDENTIFICATION="en_GB.UTF-8"
LC_ALL=

very possible since OS is a little old 😁

bouweandela · 2023-10-06T08:04:51Z

I think we should be OK - I got GA to do that, locale to C.UTF-8 and things look good test-wise

Yes, but the problem occurred if the text encoding was not utf-8.

whatever is open for writing there is no need to specify the encoding since the Python writer knows exactly what to use

It would be safer to always write text files in utf-8 encoding, so they work regardless of what computer the user is opening them on.

JSON and image opening I'd rather stay away from forcing the encoding and let Python choose it

JSON files are text files, so it would be best to specify the encoding there too. Image files should be opened in binary mode mode='b').

Pylint still gives me a list of warnings (not all relevant, but most seem relevant):

$ pylint --disable=all -e W1514 $(find -name '*.py')
************* Module setup
setup.py:186:9: W1514: Using open without explicitly specifying an encoding (unspecified-encoding)
setup.py:197:9: W1514: Using open without explicitly specifying an encoding (unspecified-encoding)
setup.py:206:21: W1514: Using open without explicitly specifying an encoding (unspecified-encoding)
************* Module gensidebar
doc/gensidebar.py:13:13: W1514: Using open without explicitly specifying an encoding (unspecified-encoding)
doc/gensidebar.py:19:13: W1514: Using open without explicitly specifying an encoding (unspecified-encoding)
************* Module tests.integration.cmor._fixes.icon.test_icon
tests/integration/cmor/_fixes/icon/test_icon.py:2069:9: W1514: Using open without explicitly specifying an encoding (unspecified-encoding)
************* Module tests.integration.recipe.test_recipe
tests/integration/recipe/test_recipe.py:1254:9: W1514: Using open without explicitly specifying an encoding (unspecified-encoding)
************* Module tests.integration.test_task
tests/integration/test_task.py:219:13: W1514: Using open without explicitly specifying an encoding (unspecified-encoding)
tests/integration/test_task.py:276:9: W1514: Using open without explicitly specifying an encoding (unspecified-encoding)
************* Module esmvalcore.preprocessor._io
esmvalcore/preprocessor/_io.py:376:13: W1514: Using open without explicitly specifying an encoding (unspecified-encoding)
************* Module esmvalcore.esgf._download
esmvalcore/esgf/_download.py:97:9: W1514: Using open without explicitly specifying an encoding (unspecified-encoding)
************* Module esmvalcore.cmor.table
esmvalcore/cmor/table.py:418:13: W1514: Using open without explicitly specifying an encoding (unspecified-encoding)
esmvalcore/cmor/table.py:468:17: W1514: Using open without explicitly specifying an encoding (unspecified-encoding)
esmvalcore/cmor/table.py:480:17: W1514: Using open without explicitly specifying an encoding (unspecified-encoding)
************* Module esmvalcore._task
esmvalcore/_task.py:122:13: W1514: Using open without explicitly specifying an encoding (unspecified-encoding)
esmvalcore/_task.py:222:9: W1514: Using open without explicitly specifying an encoding (unspecified-encoding)
************* Module esmvalcore.experimental.recipe
esmvalcore/experimental/recipe.py:73:40: W1514: Using open without explicitly specifying an encoding (unspecified-encoding)
************* Module esmvalcore.experimental.recipe_output
esmvalcore/experimental/recipe_output.py:203:13: W1514: Using open without explicitly specifying an encoding (unspecified-encoding)
************* Module esmvalcore._citation
esmvalcore/_citation.py:104:9: W1514: Using open without explicitly specifying an encoding (unspecified-encoding)
esmvalcore/_citation.py:129:13: W1514: Using open without explicitly specifying an encoding (unspecified-encoding)

I tried testing this on my own computer, but it's a bit of a hassle because I don't have any other character set than utf-8 installed.

zklaus · 2023-10-06T08:33:31Z

.github/workflows/run-tests.yml

@@ -59,7 +60,10 @@ jobs:
      - run: python -V 2>&1 | tee test_linux_artifacts_python_${{ matrix.python-version }}/python_version.txt
      - run: pip install -e .[develop] 2>&1 | tee test_linux_artifacts_python_${{ matrix.python-version }}/install.txt
      - run: flake8
-      - run: pytest -n 2 -m "not installation" 2>&1 | tee test_linux_artifacts_python_${{ matrix.python-version }}/test_report.txt
+      - run: |
+          sudo update-locale LC_ALL=C


What is this update-locale thing? You shouldn't need to try and change the default locale for the whole machine. Just do an export LC_ALL=C?

Meh. Gonna add encoding to all open file calls anyway

The thing we want to test is that it works fine with a default character set other than utf-8, as that is what we keep running into (see linked issues).

valeriupredoi · 2023-10-06T10:33:23Z

Aye, am gonna add encoding to everything then - just to be on the safe side

zklaus · 2023-10-06T11:00:43Z

Aye, am gonna add encoding to everything then - just to be on the safe side

Hm. I know this is something that @bouweandela also brought up, but I feel a bit more cautious about it. For Yaml, the case is clear cut; the Yaml specs dictate UTF(-8). For all other text files, it is perfectly reasonable to use different encodings, isn't it?

Also, a side question: Why are we ever opening Yaml files in binary mode?

valeriupredoi · 2023-10-06T11:25:23Z

@zklaus indeed that was me concern too in #2219 (comment) - but me not being in the know of file encodings I assumed @bouweandela knows a tad more than me, so I was thinking of following his advice - how about I do that and test with export LC_ALL=C on GAs?

valeriupredoi · 2023-10-06T13:50:43Z

OK @bouweandela this should now cover all instances of open file, except shapefiles, bytes, and images - I really don't want to faff around those (wb and rb strictly don't accept encoding anyway) 🍺

bouweandela · 2023-10-09T09:57:24Z

Hm. I know this is something that @bouweandela also brought up, but I feel a bit more cautious about it. For Yaml, the case is clear cut; the Yaml specs dictate UTF(-8). For all other text files, it is perfectly reasonable to use different encodings, isn't it?

The problem is that Python uses the default encoding setting of the machine that it is running on instead of the encoding of the file, as there is no way to specify that, and therefore this causes problems when the system encoding is set to something ancient. To avoid this issue, we need to specify the encoding in all text files we are reading and writing so they work across platforms and users.

Also, a side question: Why are we ever opening Yaml files in binary mode?

It looks like yaml has some code to automatically try to figure out the correct encoding in case you do this, so that should work fine: https://github.com/yaml/pyyaml/blob/155ec463f6a854ac14ccd5e2dda8017ce42a508a/lib/yaml/reader.py#L122

valeriupredoi · 2023-10-09T10:37:25Z

Cheers for chipping in @bouweandela 🍺 Thing is, don't know how to change my UNIX default encoding without having to use a Klingon translator to manage commands in the terminal 😁

bouweandela · 2023-10-09T10:42:31Z

Thing is, don't know how to change my UNIX default encoding

@valeriupredoi I think you could do this (haven't tried it myself): https://askubuntu.com/a/120068. So basically you 1) make sure that the locale has been generated with the requested character set and then 2) export the LANG and LC_ALL environmental variables to use it. You can check if it was successful by running locale and python -c 'import sys; print(sys.getdefaultencoding())'. After closing the terminal the changes should be gone because your environmental variables have been restored to normal, so no need to worry about messing up your system.

valeriupredoi · 2023-10-09T11:17:35Z

thanks @bouweandela but you assume I can still run dpkg-* which is a false assumption on my abandonware OS 🤣

valeriupredoi · 2023-10-09T11:18:46Z

many thanks for approving and merging, gents - let's wait see if some people with bizarre encodings complain, then we can get back to it, by then I should be running a modern OS too, so I can test 😁

bouweandela · 2023-10-09T12:11:53Z

I just checked and the command is available in Ubuntu 14.04. I believe that is your favorite OS, isn't it? 🤣

valeriupredoi · 2023-10-09T12:13:59Z

no, I "evolved" - I am now a big fan of 16.04 🤣

valeriupredoi added 8 commits October 5, 2023 13:28

add encoding

bab35e2

add encoding

249c02d

add encoding

ef8bb03

add encoding

fd3e838

add encoding

e4e2889

add encoding

1597b01

add encoding

36ab853

add encoding

c5b917b

valeriupredoi added the enhancement New feature or request label Oct 5, 2023

valeriupredoi requested review from zklaus and bouweandela October 5, 2023 12:38

add encoding

e46cd4f

valeriupredoi added 4 commits October 5, 2023 13:48

add encoding

2fd4b44

add encoding

123e40b

add encoding

0b11336

oopsies

68fb6fc

valeriupredoi added this to the v2.10.0 milestone Oct 5, 2023

valeriupredoi added 5 commits October 5, 2023 15:31

add encoding

aa71473

add encoding

45b9b36

add encoding

e6869c0

add encoding

f3791a8

add encoding

8745a2d

set locale to C on GA and test

6edd9a5

zklaus reviewed Oct 6, 2023

View reviewed changes

valeriupredoi added 11 commits October 6, 2023 13:55

more encodings

924f9a6

and more

ef6a8d1

set C locale

a57260e

cant seem to bidge the locale on the GA machine so sod it

bcc5a48

more encodings

684818b

more encodings

a1c519a

more encodings

4be7acf

encoding

99496c1

encoding

6706300

encoding

4c13913

missed one

ead901a

valeriupredoi added 2 commits October 6, 2023 15:23

minor refactor

3382289

Merge branch 'main' into add_file_encoding

280d6ac

fix more encodings

7ab6c1e

bouweandela approved these changes Oct 9, 2023

View reviewed changes

zklaus merged commit adeb1e2 into main Oct 9, 2023

zklaus deleted the add_file_encoding branch October 9, 2023 10:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add file encoding (and some read modes) at open file step #2219

Add file encoding (and some read modes) at open file step #2219

valeriupredoi commented Oct 5, 2023 •

edited

Loading

codecov bot commented Oct 5, 2023 •

edited

Loading

valeriupredoi commented Oct 5, 2023 •

edited

Loading

bouweandela commented Oct 5, 2023 •

edited

Loading

valeriupredoi commented Oct 5, 2023

bouweandela commented Oct 5, 2023

valeriupredoi commented Oct 5, 2023

valeriupredoi commented Oct 5, 2023

bouweandela commented Oct 6, 2023

zklaus Oct 6, 2023

valeriupredoi Oct 6, 2023

bouweandela Oct 9, 2023

valeriupredoi commented Oct 6, 2023

zklaus commented Oct 6, 2023

valeriupredoi commented Oct 6, 2023 •

edited

Loading

valeriupredoi commented Oct 6, 2023

bouweandela commented Oct 9, 2023

valeriupredoi commented Oct 9, 2023

bouweandela commented Oct 9, 2023 •

edited

Loading

valeriupredoi commented Oct 9, 2023

valeriupredoi commented Oct 9, 2023

bouweandela commented Oct 9, 2023

valeriupredoi commented Oct 9, 2023

Add file encoding (and some read modes) at open file step #2219

Add file encoding (and some read modes) at open file step #2219

Conversation

valeriupredoi commented Oct 5, 2023 • edited Loading

Description

Before you get started

Checklist

codecov bot commented Oct 5, 2023 • edited Loading

Codecov Report

valeriupredoi commented Oct 5, 2023 • edited Loading

bouweandela commented Oct 5, 2023 • edited Loading

valeriupredoi commented Oct 5, 2023

bouweandela commented Oct 5, 2023

valeriupredoi commented Oct 5, 2023

valeriupredoi commented Oct 5, 2023

bouweandela commented Oct 6, 2023

zklaus Oct 6, 2023

Choose a reason for hiding this comment

valeriupredoi Oct 6, 2023

Choose a reason for hiding this comment

bouweandela Oct 9, 2023

Choose a reason for hiding this comment

valeriupredoi commented Oct 6, 2023

zklaus commented Oct 6, 2023

valeriupredoi commented Oct 6, 2023 • edited Loading

valeriupredoi commented Oct 6, 2023

bouweandela commented Oct 9, 2023

valeriupredoi commented Oct 9, 2023

bouweandela commented Oct 9, 2023 • edited Loading

valeriupredoi commented Oct 9, 2023

valeriupredoi commented Oct 9, 2023

bouweandela commented Oct 9, 2023

valeriupredoi commented Oct 9, 2023

valeriupredoi commented Oct 5, 2023 •

edited

Loading

codecov bot commented Oct 5, 2023 •

edited

Loading

valeriupredoi commented Oct 5, 2023 •

edited

Loading

bouweandela commented Oct 5, 2023 •

edited

Loading

valeriupredoi commented Oct 6, 2023 •

edited

Loading

bouweandela commented Oct 9, 2023 •

edited

Loading