Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Certain invalid UTF-8 sequences can cause the output to fail #4086

Closed
j4james opened this issue Dec 30, 2019 · 3 comments · Fixed by #4422
Closed

Certain invalid UTF-8 sequences can cause the output to fail #4086

j4james opened this issue Dec 30, 2019 · 3 comments · Fixed by #4422
Labels
Area-Output Related to output processing (inserting text into buffer, retrieving buffer text, etc.) Area-Server Down in the muck of API call servicing, interprocess communication, eventing, etc. Issue-Bug It either shouldn't be doing this or needs an investigation. Needs-Tag-Fix Doesn't match tag requirements Priority-2 A description (P2) Product-Conhost For issues in the Console codebase Resolution-Fix-Committed Fix is checked in, but it might be 3-4 weeks until a release.
Milestone

Comments

@j4james
Copy link
Collaborator

j4james commented Dec 30, 2019

Environment

Windows build number: Version 10.0.18362.418
Windows Terminal version (if applicable): 0.7.3451.0

Steps to reproduce

  1. Open a bash shell.
  2. Execute the command printf "\xA3"

Expected behavior

I'd expect to see something like the U+FFFD error character, or worst case nothing at all.

Actual behavior

The output is aborted with an error:

-bash: printf: write error: Input/output error

This looks similar to issue #3320, but PR #3380 doesn't fix it, and a recent build from master still produces the problem.

I did a bit of experimenting, and if you replace the MB_ERR_INVALID_CHARS with 0 in the _ParseFullRange method, that seems to help, although I don't know whether that's the right solution.

I know there was a follow up task created which may end up fixing this (#3378), but I thought it best to have the issue filed as an actual bug. If you don't think that's necessary, though, you can always close this as a dup.

@ghost ghost added Needs-Triage It's a new issue that the core contributor team needs to triage at the next triage meeting Needs-Tag-Fix Doesn't match tag requirements labels Dec 30, 2019
@german-one
Copy link
Contributor

#4093 could be a first step. I'm quite certain the functions and classes could be used to update Utf8ToWideCharParser in a proper way.

@j4james
Copy link
Collaborator Author

j4james commented Jan 2, 2020

@german-one I'm a bit confused. Is #4093 still a work in progress, or is it not intended to replace the conhost utf8 parser yet? I was hoping I could just merge your PR and it would fix the problems I was seeing, but that doesn't seem to be the case.

@german-one
Copy link
Contributor

german-one commented Jan 2, 2020

It's intended to be the first step. My implementation is able to supersede Utf8ToWideCharParser entirely. And indeed that's the next item on my list. But first I want to get the hand-rolled conversions reviewed by the core contributors. Unifying the UTF-8 parsers is too much for only one PR I guess.

@DHowett-MSFT DHowett-MSFT added Area-Output Related to output processing (inserting text into buffer, retrieving buffer text, etc.) Area-Server Down in the muck of API call servicing, interprocess communication, eventing, etc. Issue-Bug It either shouldn't be doing this or needs an investigation. Product-Conhost For issues in the Console codebase labels Jan 6, 2020
@ghost ghost removed the Needs-Tag-Fix Doesn't match tag requirements label Jan 6, 2020
@DHowett-MSFT DHowett-MSFT removed the Needs-Triage It's a new issue that the core contributor team needs to triage at the next triage meeting label Jan 6, 2020
@DHowett-MSFT DHowett-MSFT added this to the 21H1 milestone Jan 6, 2020
@DHowett-MSFT DHowett-MSFT added the Priority-2 A description (P2) label Jan 6, 2020
@ghost ghost added the In-PR This issue has a related PR label Jan 31, 2020
@ghost ghost added the Needs-Tag-Fix Doesn't match tag requirements label Feb 4, 2020
DHowett-MSFT pushed a commit that referenced this issue Feb 4, 2020
Replace `utf8Parser` with `til::u8u16` in order to have the same
conversion algorithms used in terminal and conhost.

This PR addresses item 2 in this list:
1. ✉ Implement `til::u8u16` and `til::u16u8` (done in PR #4093)
2. ✔ **Unify UTF-8 handling using `til::u8u16` (this PR)**
    2.1. ✔ **Update VtInputThread::_HandleRunInput()**
    2.2. ✔ **Update ApiRoutines::WriteConsoleAImpl()**
    2.3. ❌ (optional / ask the core team) Remove Utf8ToWideCharParser from the code base to avoid further use
3. ❌ Enable BOM discarding (follow up)
    3.1. ❌ extend `til::u8u16` and `til::u16u8` with a 3rd parameter to enable discarding the BOM
    3.2. ❌ Make use of the 3rd parameter to discard the BOM in all current function callers, or (optional / ask the core team) make it the default for  `til::u8u16` and `til::u16u8` 
4. ❌ Find UTF-16 to UTF-8 conversions and examine if they can be unified, too (follow up)

Closes #4086
Closes #3378
@ghost ghost added Resolution-Fix-Committed Fix is checked in, but it might be 3-4 weeks until a release. and removed In-PR This issue has a related PR labels Feb 4, 2020
ghost pushed a commit that referenced this issue Feb 21, 2020
… single lead byte only (GH#4673) (#4685)

## Summary of the Pull Request
Fixes a flaw that happened if `til::u8u16` received a single lead byte.

## PR Checklist
* [x] Closes #4673 
* [x] Tests added/passed

## Detailed Description of the Pull Request / Additional comments
The loop for caching partials didn't run and thus, the lead byte was
converted to U+FFFD. That's because the loop starts with `sequenceLen`
initialized with 1. And if the string has a length of 1 the initial
condition is `1<1` which is evaluated to `false` and the body of the
loop was never executed.

## Validation Steps Performed
1) updated the code of the state class and tested manually that `printf
   "\xE2"; printf "\x98\xBA\n"` prints a U+263A character
2) updated the unit tests to make sure that still up to 3 partials are
   cached
3) updated the unit tests to make sure caching also works if the string
   consists of a lead byte only
4) tested manually that #4086 is still resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area-Output Related to output processing (inserting text into buffer, retrieving buffer text, etc.) Area-Server Down in the muck of API call servicing, interprocess communication, eventing, etc. Issue-Bug It either shouldn't be doing this or needs an investigation. Needs-Tag-Fix Doesn't match tag requirements Priority-2 A description (P2) Product-Conhost For issues in the Console codebase Resolution-Fix-Committed Fix is checked in, but it might be 3-4 weeks until a release.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants