dev/core#2127 - Don't accidentally trim à characters when importing files #19241
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Overview
https://lab.civicrm.org/dev/core/-/issues/2127
When importing data with columns ending in
à
theà
disappears.This is similar to @sluc23's PR #18780 but also handles latin1 encoding and shores up the tests.
Before
Letters like
à
get trimmed.After
Letters like
à
get to stay.Technical Details
The byte pattern for
à
is 0xc3 0xa0, which is very close the byte pattern for a non-breaking space 0xc2 0xa0. Thetrim()
function operates on bytes and the parameter is a list of bytes to trim, so when you give it 0xc2 0xa0 it will trim either byte from the string. When there's anà
, this means it corrupts the string so that it just ends in 0xc3 and gets truncated.Further, note that in latin1 encoding, a non-breaking space is just 0xa0. Further further, php seems unable to detect the encoding when you have that, so it fails when trying to apply the regex.
I also updated the original test to check the actual value instead of just the length which is now more varied with the extra tests and also it just seemed better to check the full value. I considered using bin2hex which might be a more robust way of showing what's happening but it seemed more readable with json which should be just as good given the environment civi runs under, just note that json represents
à
using a unicode representation, as opposed to a utf8 byte sequence.Comments
Shored up the tests so it also checks a file with a BOM, and also pulled the trim function out in order to throw some more tests at it.