Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes Unicode paths on Windows #1363

Merged

Conversation

itsmattkc
Copy link
Contributor

@itsmattkc itsmattkc commented Apr 9, 2021

Fixes #406.

Unlike most other platforms, Windows' Unicode is standardized around UTF-16, an encoding not compatible with char * arrays common in C/C++. As such, to support Unicode correctly when using Win32 APIs, strings must be converted to and from UTF-16 and the Unicode versions of the APIs must be used over the ANSI versions.

This commit introduces the following:

  • Utility functions for converting between UTF-8 and UTF-16LE on all platforms:

    • Platform::Utf8ToUtf16
    • Platform::Utf16ToUtf8
    • NOTE: These functions use std::codecvt_utf8_utf16 on macOS and Linux but MultiByteToWideChar and WideCharToMultiByte on Windows because MSVC gave deprecation warnings recommending the Win32 functions with the former.
  • Test for these conversion functions to ensure the conversion to and from UTF-8 and UTF-16LE is correct.

  • Utility wrappers for "std::ifstream" that automatically convert to and from UTF-16 on Windows so that filenames requiring Unicode encoding function correctly:

    • Platform::CreateInputFileStream
    • Platform::OpenInputFileStream
  • Moves the file default compute hash function to the Platform class (Platform::CreateFileContentHash) and switches to the Win32 UTF-16 variant on Windows.

  • Adds the UNICODE macro before including "Windows.h" on Windows which ensures all functions called are the Unicode variants instead of the default ANSI variants.

    • Implicitly changes functions such as GetEnvironmentVariable and SetEnvironmentVariable to their Unicode variants.
  • Changes the following environment variable related functions to their Win32 Unicode variants on Windows:

    • environ -> _wenviron
    • _putenv_s -> _wputenv_s
  • Updates tests using SetEnvironmentVariable to use wide string literals since that function has been switched to the Unicode variant.

In my testing, this appears to function correctly, loading configs, LUTs, and looks with non-ASCII paths, however I have not tested exhaustively. There may be more file-related functions I haven't found/wrapped.

Happy to take any suggestions or make any changes to the implementation upon request.

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Apr 9, 2021

CLA Signed

The committers are authorized under a signed CLA.

Copy link
Member

@hodoulp hodoulp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @itsmattkc for this pull request to tackle the challenging file path character encoding difference between Windows and Linux platforms.

Even If I made several comments/suggestions, I think that you are on the right track.

Any changes in the library must now provide a unit test (a new one or updating an existing one) to demonstrate that the change works on all the supported platforms. Even if it could end-up to be a no-op for Linux platform, a unit test must validate it also.

Note: if the method DefaultComputeHash can be moved in platform then the methods Utf16ToUtf8() and Utf8ToUtf16 could then be completely private i.e. no more define in an header file.

src/OpenColorIO/Config.cpp Outdated Show resolved Hide resolved
src/OpenColorIO/PathUtils.cpp Outdated Show resolved Hide resolved
src/OpenColorIO/Platform.h Outdated Show resolved Hide resolved
src/OpenColorIO/Platform.h Outdated Show resolved Hide resolved
src/OpenColorIO/Platform.h Outdated Show resolved Hide resolved
src/OpenColorIO/Platform.cpp Show resolved Hide resolved
@itsmattkc itsmattkc force-pushed the win32-unicode-fix branch 3 times, most recently from 433ac99 to 1658bbf Compare April 12, 2021 06:30
@hodoulp
Copy link
Member

hodoulp commented Apr 16, 2021

Thanks @itsmattkc for all the efforts to fix the file path on the Windows platform.

I do not understand the rational for the latest commit. Could you explain it?

@itsmattkc
Copy link
Contributor Author

Hey, yes I've made a few more changes the PR overall and am still in the process of finalizing it. I implemented all of your suggestions but then thought it might be good to support Unicode environment variables too (such as the OCIO variable) which has led to having to change some other things too (notably the Windows-based environment variable tests).

As for the latest commit, I assume you're referring to enforcing UTF-16LE over UTF-16BE? That's just because Windows specifically uses UTF-16LE as its Unicode encoding. When I implemented a test of the UTF-8/UTF-16 conversion functions (using pre-existing literals as "control variables"), I found it would fail in some environments because some platforms (notably the CentOS 7 CI set up here) defaulted to UTF-16BE which would lead to the strings failing to compare. Since these functions are specifically intended for Windows anyway, it seemed appropriate to just enforce UTF-16LE in the conversion.

@itsmattkc itsmattkc force-pushed the win32-unicode-fix branch 2 times, most recently from 0da7c40 to 0d480fb Compare April 18, 2021 07:10
@itsmattkc
Copy link
Contributor Author

Okay, sorry for the delay, this PR is now ready to be re-reviewed. I've squashed my commits for neatness and updated the OP to better reflect the changes made.

@itsmattkc itsmattkc requested a review from hodoulp April 18, 2021 09:08
@Simran-B
Copy link
Contributor

  • The PR covers std::ifstream, but std::ofstream is also used in a few places (not in the core code however as far as I see). I suppose output file streams should also be handled?
    • apps (ociobakelut, ociocheck, ...)
    • bindings (Python)
    • tests (CPU, GPU)
    • vendor (Photoshop)
  • I wonder if there is any Big Endian Windows where UTF-16BE might be used, such as on ARM, but it appears to be Little Endian even on AArch: https://reverseengineering.stackexchange.com/questions/17922/determining-endianness-of-pe-files-windows-on-arm
  • Not sure if it's worth mentioning in the commit message, but Microsoft calls UTF-8 "multi-byte" and UTF-16 "wide-character" in their APIs. Some functions have a suffix "A" for ANSI and "W" for wide-character. This isn't really visible in the PR because the UNICODE macro takes care of using the right functions.

@itsmattkc
Copy link
Contributor Author

  • Ah yeah I missed ofstream, I can add support for that too for completeness.
  • Yeah, I believe for consistency, Windows is always UTF-16LE.
  • I kind of allude in the commit message to the UNICODE macro switching the functions to "Unicode variants", but I can clarify a little further that functions like SetEnvironmentVariable, which by default resolves to the ANSI SetEnvironmentVariableA, now resolves to the Unicode/wide char SetEnvironmentVariableW thanks to the UNICODE macro.

Copy link
Member

@hodoulp hodoulp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work.
I only made comments to improve the readability of the code.

vendor/aftereffects/OpenColorIO_AE_Context.cpp Outdated Show resolved Hide resolved
src/OpenColorIO/Platform.h Outdated Show resolved Hide resolved
src/OpenColorIO/Platform.h Outdated Show resolved Hide resolved
src/OpenColorIO/Platform.cpp Outdated Show resolved Hide resolved
tests/cpu/Platform_tests.cpp Show resolved Hide resolved
tests/cpu/Platform_tests.cpp Outdated Show resolved Hide resolved
@itsmattkc itsmattkc force-pushed the win32-unicode-fix branch 3 times, most recently from f91e774 to cfbc71f Compare April 22, 2021 00:37
Unlike most other platforms, Windows' Unicode is standardized around
UTF-16, an encoding not compatible with "char *" arrays common in
C/C++. As such, to support Unicode correctly when using Win32 APIs,
strings must be converted to and from UTF-16 and the Unicode
versions of the APIs must be used over the ANSI versions.

This commit introduces the following:

- Utility funcions for converting between UTF-8 and UTF-16LE on all
  platforms:
  - Platform::Utf8ToUtf16
  - Platform::Utf16ToUtf8

- Adds test for these conversion functions to ensure the conversion
  to and from UTF-8 and UTF-16LE is correct.

- Utility wrappers for "std::ifstream" that automatically convert
  to and from UTF-16 so that filenames requiring Unicode encoding
  function correctly:
  - Platform::CreateInputFileStream
  - Platform::OpenInputFileStream

- Moves the file default compute hash function to the Platform
  class (Platform::CreateFileContentHash) and switches to the
  Win32 UTF-16 variant on Windows.

- Adds the "UNICODE" macro before including "Windows.h" which
  ensures all functions called are the Unicode variants instead of
  the default ANSI variants.
  - Implicitly changes functions such as GetEnvironmentVariable and
    SetEnvironmentVariable to their Unicode variants.

- Changes the following environment variable related functions to
  their Win32 Unicode variants on Windows:
  - environ -> _wenviron
  - _putenv_s -> _wputenv_s

- Updates tests using SetEnvironmentVariable to use wide string
  literals since that function has been switched to the Unicode
  variant.

Signed-off-by: itsmattkc <itsmattkc@gmail.com>
@itsmattkc
Copy link
Contributor Author

Thanks for the review, I believe I have resolved all of your suggestions. Let me know if anything else should be changed.

@itsmattkc itsmattkc requested a review from hodoulp April 22, 2021 02:19
@fnordware
Copy link
Contributor

Windows projects usually define UNICODE in the build process, usually a Visual Studio project. By defining it in a header like we have here, we may force UNICODE on a Windows project that doesn't want it.

As a library, I think we should respond to UNICODE, not set it. If we have something that explicitly needs the Unicode version, just call it directly, i.e. use SetWindowTextW() instead of SetWindowText().

UNICODE might need to be added to the CMake stuff somewhere. For Windows it might be on by default, but there should be a way to switch it off and everything should still work.

@hodoulp
Copy link
Member

hodoulp commented May 4, 2021

Good catch @fnordware the UNICODE define is in a private header so it cannot propagate to any other libraries using OpenColorIO. Did you face a problem?

@fnordware
Copy link
Contributor

fnordware commented May 5, 2021

I still think defining UNICODE in our own header is a mistake. It should be defined by the build system. It would be like defining __GNUC__ in a header.

@itsmattkc
Copy link
Contributor Author

itsmattkc commented May 5, 2021

Hmm you might be right, I can take a look and see if it can be made agnostic without much difficulty

EDIT: Never mind, I understand what you're saying. So it would make sense to define it in CMake rather than a header?

@hodoulp
Copy link
Member

hodoulp commented May 5, 2021

I googled on that UNICODE topic and almost all Windows UNICODE examples define UNICODE in the build system (i.e. not in the source code).

Signed-off-by: itsmattkc <itsmattkc@gmail.com>
Signed-off-by: itsmattkc <itsmattkc@gmail.com>
Signed-off-by: itsmattkc <itsmattkc@gmail.com>
Signed-off-by: itsmattkc <itsmattkc@gmail.com>
Signed-off-by: itsmattkc <itsmattkc@gmail.com>
@hodoulp
Copy link
Member

hodoulp commented Oct 19, 2021

@itsmattkc There is a missing ifdef somewhere for Linux and macOS platforms:

2: [ 902/1000] [Platform / create_temp_filename                             ] - PASSED
2: 
2: FAILED: Only supported by the Windows platform..
2: [ 903/1000] [Platform / utf8_utf16_convert                               ] - FAILED

@hodoulp
Copy link
Member

hodoulp commented Oct 19, 2021

Concerning the Windows cmake break, it could be the same fixes than the ones in the pull request #1516.

Signed-off-by: itsmattkc <34096995+itsmattkc@users.noreply.github.com>
Signed-off-by: itsmattkc <34096995+itsmattkc@users.noreply.github.com>
@itsmattkc
Copy link
Contributor Author

Looks like I made a mistake in the CMake files, it should compile now

Copy link
Member

@hodoulp hodoulp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last questions I think by approving.

CMakeLists.txt Outdated Show resolved Hide resolved
src/OpenColorIO/Platform.h Show resolved Hide resolved
Signed-off-by: itsmattkc <itsmattkc@gmail.com>
hodoulp
hodoulp previously approved these changes Oct 20, 2021
Copy link
Member

@hodoulp hodoulp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great Job.

Copy link
Collaborator

@doug-walker doug-walker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for these improvements! I made one minor suggestion, but approving now.

CMakeLists.txt Outdated Show resolved Hide resolved
Signed-off-by: itsmattkc <34096995+itsmattkc@users.noreply.github.com>
src/OpenColorIO/CMakeLists.txt Outdated Show resolved Hide resolved
CMakeLists.txt Show resolved Hide resolved
src/OpenColorIO/CMakeLists.txt Outdated Show resolved Hide resolved
src/OpenColorIO/Platform.cpp Outdated Show resolved Hide resolved
@hodoulp hodoulp dismissed their stale review October 26, 2021 17:58

Some issues were found

Signed-off-by: itsmattkc <34096995+itsmattkc@users.noreply.github.com>
tests/cpu/CMakeLists.txt Outdated Show resolved Hide resolved
Signed-off-by: itsmattkc <34096995+itsmattkc@users.noreply.github.com>
Copy link
Member

@hodoulp hodoulp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work.

@hodoulp
Copy link
Member

hodoulp commented Oct 29, 2021

@itsmattkc please, update your pull request so I can merge it.

@hodoulp hodoulp merged commit d074757 into AcademySoftwareFoundation:master Oct 29, 2021
hodoulp pushed a commit that referenced this pull request Oct 29, 2021
* Fixes Unicode paths on Windows

Unlike most other platforms, Windows' Unicode is standardized around
UTF-16, an encoding not compatible with "char *" arrays common in
C/C++. As such, to support Unicode correctly when using Win32 APIs,
strings must be converted to and from UTF-16 and the Unicode
versions of the APIs must be used over the ANSI versions.

This commit introduces the following:

- Utility funcions for converting between UTF-8 and UTF-16LE on all
  platforms:
  - Platform::Utf8ToUtf16
  - Platform::Utf16ToUtf8

- Adds test for these conversion functions to ensure the conversion
  to and from UTF-8 and UTF-16LE is correct.

- Utility wrappers for "std::ifstream" that automatically convert
  to and from UTF-16 so that filenames requiring Unicode encoding
  function correctly:
  - Platform::CreateInputFileStream
  - Platform::OpenInputFileStream

- Moves the file default compute hash function to the Platform
  class (Platform::CreateFileContentHash) and switches to the
  Win32 UTF-16 variant on Windows.

- Adds the "UNICODE" macro before including "Windows.h" which
  ensures all functions called are the Unicode variants instead of
  the default ANSI variants.
  - Implicitly changes functions such as GetEnvironmentVariable and
    SetEnvironmentVariable to their Unicode variants.

- Changes the following environment variable related functions to
  their Win32 Unicode variants on Windows:
  - environ -> _wenviron
  - _putenv_s -> _wputenv_s

- Updates tests using SetEnvironmentVariable to use wide string
  literals since that function has been switched to the Unicode
  variant.

Signed-off-by: itsmattkc <itsmattkc@gmail.com>

* Moved UNICODE and _UNICODE definitions to CMake

Signed-off-by: itsmattkc <34096995+itsmattkc@users.noreply.github.com>

* handle empty strings and use const references

Signed-off-by: itsmattkc <34096995+itsmattkc@users.noreply.github.com>

* only use wenviron when unicode is enabled

Signed-off-by: itsmattkc <itsmattkc@gmail.com>

* add ANSI Win32 version of Getenv

Signed-off-by: itsmattkc <itsmattkc@gmail.com>

* Add CMake option for compiling with Win32 Unicode support

Signed-off-by: itsmattkc <itsmattkc@gmail.com>

* Throw exception if UTF functions are called on non-Windows platforms

Signed-off-by: itsmattkc <itsmattkc@gmail.com>

* Throw OCIO Exception

Signed-off-by: itsmattkc <itsmattkc@gmail.com>

* Don't run UTF-8/16 conversion test on non-Windows

Signed-off-by: itsmattkc <34096995+itsmattkc@users.noreply.github.com>

* Fix CMake error

Signed-off-by: itsmattkc <34096995+itsmattkc@users.noreply.github.com>

* minor CMake adjustment

Signed-off-by: itsmattkc <itsmattkc@gmail.com>

* Clarified CMake option for Win32 unicode

Signed-off-by: itsmattkc <34096995+itsmattkc@users.noreply.github.com>

* minor improvements

Signed-off-by: itsmattkc <34096995+itsmattkc@users.noreply.github.com>

* fix macro in cpu test cmake

Signed-off-by: itsmattkc <34096995+itsmattkc@users.noreply.github.com>
michdolan pushed a commit that referenced this pull request Nov 1, 2021
* Fixes Unicode paths on Windows

Unlike most other platforms, Windows' Unicode is standardized around
UTF-16, an encoding not compatible with "char *" arrays common in
C/C++. As such, to support Unicode correctly when using Win32 APIs,
strings must be converted to and from UTF-16 and the Unicode
versions of the APIs must be used over the ANSI versions.

This commit introduces the following:

- Utility funcions for converting between UTF-8 and UTF-16LE on all
  platforms:
  - Platform::Utf8ToUtf16
  - Platform::Utf16ToUtf8

- Adds test for these conversion functions to ensure the conversion
  to and from UTF-8 and UTF-16LE is correct.

- Utility wrappers for "std::ifstream" that automatically convert
  to and from UTF-16 so that filenames requiring Unicode encoding
  function correctly:
  - Platform::CreateInputFileStream
  - Platform::OpenInputFileStream

- Moves the file default compute hash function to the Platform
  class (Platform::CreateFileContentHash) and switches to the
  Win32 UTF-16 variant on Windows.

- Adds the "UNICODE" macro before including "Windows.h" which
  ensures all functions called are the Unicode variants instead of
  the default ANSI variants.
  - Implicitly changes functions such as GetEnvironmentVariable and
    SetEnvironmentVariable to their Unicode variants.

- Changes the following environment variable related functions to
  their Win32 Unicode variants on Windows:
  - environ -> _wenviron
  - _putenv_s -> _wputenv_s

- Updates tests using SetEnvironmentVariable to use wide string
  literals since that function has been switched to the Unicode
  variant.

Signed-off-by: itsmattkc <itsmattkc@gmail.com>

* Moved UNICODE and _UNICODE definitions to CMake

Signed-off-by: itsmattkc <34096995+itsmattkc@users.noreply.github.com>

* handle empty strings and use const references

Signed-off-by: itsmattkc <34096995+itsmattkc@users.noreply.github.com>

* only use wenviron when unicode is enabled

Signed-off-by: itsmattkc <itsmattkc@gmail.com>

* add ANSI Win32 version of Getenv

Signed-off-by: itsmattkc <itsmattkc@gmail.com>

* Add CMake option for compiling with Win32 Unicode support

Signed-off-by: itsmattkc <itsmattkc@gmail.com>

* Throw exception if UTF functions are called on non-Windows platforms

Signed-off-by: itsmattkc <itsmattkc@gmail.com>

* Throw OCIO Exception

Signed-off-by: itsmattkc <itsmattkc@gmail.com>

* Don't run UTF-8/16 conversion test on non-Windows

Signed-off-by: itsmattkc <34096995+itsmattkc@users.noreply.github.com>

* Fix CMake error

Signed-off-by: itsmattkc <34096995+itsmattkc@users.noreply.github.com>

* minor CMake adjustment

Signed-off-by: itsmattkc <itsmattkc@gmail.com>

* Clarified CMake option for Win32 unicode

Signed-off-by: itsmattkc <34096995+itsmattkc@users.noreply.github.com>

* minor improvements

Signed-off-by: itsmattkc <34096995+itsmattkc@users.noreply.github.com>

* fix macro in cpu test cmake

Signed-off-by: itsmattkc <34096995+itsmattkc@users.noreply.github.com>

Co-authored-by: itsmattkc <34096995+itsmattkc@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Issues with non-ascii folders on Windows
6 participants