Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix error loading LUT files with Turkish locale #984

Merged
merged 3 commits into from
Apr 22, 2020
Merged

Fix error loading LUT files with Turkish locale #984

merged 3 commits into from
Apr 22, 2020

Conversation

brechtvl
Copy link
Contributor

@brechtvl brechtvl commented Apr 9, 2020

In file parsing there are tests like Lower(x) == "spilut". However this was
using the locale dependent std::tolower, which for Turkish does not change
I to i.

Instead this implements a custom Lower() function which is not affected by
locale and so can be used for file parsing.

For more background on this problem see for example:
https://docs.microsoft.com/en-us/previous-versions/dotnet/articles/ms973919(v=msdn.10)?redirectedfrom=MSDN#the-motivation-the-turkish-i-problem

Signed-off-by: Brecht Van Lommel brecht@blender.org

In file parsing there are tests like Lower(x) == "spilut". However this was
using the locale dependent std::tolower, which for Turkish does not change
I to i.

Instead this implements a custom Lower() function which is not affected by
locale and so can be used for file parsing.

Signed-off-by: Brecht Van Lommel <brecht@blender.org>
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Apr 9, 2020

CLA Check
The committers are authorized under a signed CLA.

@hodoulp
Copy link
Member

hodoulp commented Apr 9, 2020

Thanks @brechtvl for contributing to the project.

  • The localization is a challenging topic as the software languages originate from US english speaking people so the add of other languages always faces huge challenges.
  • Having a clear way to understand which strings are to be localized and which are not is also something complex to put in place and maintain on long term. In OCIO, the distinction does not exist. The StringUtils::Lower & StringUtils::Upper are used everywhere from description field to file path.

Because of its locale support the method std::tolower seems 'good enough' in both use cases (even if no perfect). So, totally disabling the locale support (by implementing custom lower & upper methods based on en_US centric letters only) does not help the localization and only partially help the 'not localized' strings. For example a file path could be categorized as 'not localized' strings. However, the Windows filesystem accepts any letters such as "C:\Économie.ÇaVÂ" (french letters where lower must be "c:\économie.çavâ") so std::tolower should work if locale is fr_CA (and fails if locale is en_US which is fine). But it always fails with the proposed customized methods. Note: I noticed that the macos filesystem also accepts french letters.

Not being an expert I would suggest to first check for the locale and only customize the methods if the Turkish locale is enabled. But I still do not see how to correctly support the two use cases in OCIO.

That's an interesting subject that OCIO does not address so any locales (letters, numerics, etc) are badly supported. It would be interesting to have your feedback on the topic.

@brechtvl
Copy link
Contributor Author

brechtvl commented Apr 9, 2020

StringUtils appears to only be designed to handle "byte strings" (in Python terminology). All of the functions assume a character is stored in one byte and do not handle Unicode.

The existing implementation can not correctly lowercase C:\Économie.ÇaVÂ, nor does it matter since it does that for testing the file extensions which have no Unicode characters.

For parsing file headers/tokens I think using byte strings is the correct solution. Even if you did support Unicode (e.g. for colorspace names) you don't want the interpretation of an OCIO config file or LUT to ever depend on the locale. User interface display of text might be affected by locale, but the color space transforms should not be.

I did the initial implementation of Unicode string support in OSL, OIIO and Arnold. In my experience it's best to use all UTF-8 strings in the API and internally since that's what Linux and macOS will already give you. Then for Windows you need to add code for every Windows API function that deals with filepaths, program arguments to convert between the native encoding and UTF-8.

I think it would be good to improve OpenColorIO to do the same. But in my opinion that's actually solving a different problem and this change by itself is safe.

@hodoulp
Copy link
Member

hodoulp commented Apr 9, 2020

Thanks for the information. Let me read a little bit...

@hodoulp
Copy link
Member

hodoulp commented Apr 9, 2020

 StringUtils appears to only be designed to handle "byte strings" (in Python terminology). All of the functions assume a character is stored in one byte and do not handle Unicode.
The existing implementation can not correctly lowercase C:\Économie.ÇaVÂ, nor does it matter since it does that for testing the file extensions which have no Unicode characters.

OCIO supports UTF-8 encoding. Some existing unit tests validate the point:

  • tests/cpu/Config_tests.cpp starting at line 2948
  • tests/cpu/FileRules_tests.cpp starting at line 254

@brechtvl
Copy link
Contributor Author

brechtvl commented Apr 9, 2020

Yes, I think OCIO supports UTF-8 encoding to the extent that is automatically provided by UTF-8 backwards compatibility with ASCII. Which is enough when using ASCII keywords and delimiters and passing along or comparing any other strings unmodified.

But for operations like lowercasing É to é it's not enough, the existing and proposed code only works for ASCII characters. I would argue that OpenColorIO should not be doing any such locale or encoding dependent string operations and stay clear from that minefield.

@doug-walker
Copy link
Collaborator

Hey Brecht, thanks for your work on this. There are a number of locale related issues logged here on GitHub, so it's great to have someone with your expertise participating.

The main concern I have with your proposal is that Lower is used in a lot of places and your change may have unanticipated consequences for OCIO behavior. For example it is used by parseColorSpaceFromString, so it will affect how ColorSpace names are located within file paths. It will also impact the new FileRules. It is also used in various Role and ColorSpace name comparisons.

For situations where we are comparing to known tokens in OCIO (such as supported file extensions), these are all ASCII so your change will work. For cases where we are working with ColorSpace names, I think the behavior needs to be validated a bit wider within the community that may not be actively tracking these issues.

I would summarize your proposal as follows: when comparing strings such as ColorSpace names and paths, comparisons on ASCII characters will be non-case-sensitive but other characters will be case-sensitive. Is the community OK with that?

My own take is that I agree with you that we do not want locale differences to change the way a config works, so I'm happy with your fix. However, I'm also not in the practice of using non-ASCII characters, hence I think the proposal needs to be validated. I will post on Slack to try to get some more people to take a look.

If others could weigh in here, I'm interested more broadly in opinions on whether we even want to offer the courtesy of non-case-sensitive behavior. It seems like the potential bugs it introduces may outweigh the convenience of not requiring people to be precise with case. (And I'm referring here specifically to things like ColorSpace names rather than things like file extensions which are already ASCII and which will remain non-case-sensitive.) There are a number of v2 features that currently use Lower() that we may want to change based on the community opinion.

@brechtvl
Copy link
Contributor Author

brechtvl commented Apr 10, 2020

I would summarize your proposal as follows: when comparing strings such as ColorSpace names and paths, comparisons on ASCII characters will be non-case-sensitive but other characters will be case-sensitive.

It's not so much that I'm proposing this, it's what OpenColorIO is already doing. What I'm proposing is a conservative fix for the Turkish locale case, that as far as I can tell changes nothing for C or English locales. Non-ASCII characters are already and would continue to be unchanged by Lower().

Removing case-insensitive behavior from file parsing in OpenColorIO seems like a good solution too. I guess that would be done selectively in some places and Lower() would still need to be fixed.

@doug-walker
Copy link
Collaborator

Yes, agreed your fix is no change if the locale is already English but the use-case I had in mind was people in non-English countries that would have a non-English locale and might be using non-ASCII characters in their configs. It might be a change in behavior for them.

Again, I'm not opposed to your fix, I just wanted to encourage others to take a look at it since it is potentially a change in behavior beyond the immediate issue of file extensions.

@hodoulp
Copy link
Member

hodoulp commented Apr 14, 2020

@brechtvl I tried hard to reproduce the problem on my macos machine by changing the locale (to tr_TR.UTF-8 on the command line, in the main or in a unit test). But I only succeed to have the . vs. , problematic and not the Turkish I vs. i one. Do you have a way to reproduce the problem?

Note: To test I've added OCIO_CHECK_ASSERT(FormatNameFoundByExtension("SPI1d", "spi1d")); here hoping a test failure if the locale is Turkish.

On the other hand, looking at the file extension detection code in the library I wanted to narrow the code down to the file extension check only (to avoid changing a code impacting too many features) and I found that only one method could be updated (i.e. FormatRegistry::getFileFormatForExtension() here) to help the locale agnostic comparison for the file extensions.

Below is the code change:

void FormatRegistry::getFileFormatForExtension(const std::string & extension,
                                               FileFormatVector & possibleFormats) const
{
    const std::locale loc("en_US.UTF-8");

    std::string lowerStr = extension;
    std::transform(lowerStr.begin(), lowerStr.end(), lowerStr.begin(),
                    [loc](char ch) -> char
                        {
                            return std::tolower(ch, loc);
                        }
                    );

    FileFormatVectorMap::const_iterator iter = m_formatsByExtension.find(lowerStr);
    if(iter != m_formatsByExtension.end())
        possibleFormats = iter->second;
} 

With a way to reproduce the problem, I should be able to test the change or you can test it in your environment.

@doug-walker
Copy link
Collaborator

We discussed this PR during our TSC meeting today. (Also, I had previously posted on Slack to try and get others to take a look at the PR.) There was support in the TSC for the notion that we want to reduce the possibilities for locale differences to cause unexpected changes of behavior in OCIO, even if that means effectively imposing an English locale. Based on that, I'm approving the PR.

@brechtvl , if you get any more spare time, there is also a localization issue around number parsing (e.g. #297, and #379) that could use some help! ;)

@hodoulp hodoulp merged commit 3aab90d into AcademySoftwareFoundation:master Apr 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants