-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reject FHIR resources with the null char \u0000 or similar control chars #2605
Comments
team decision: default should be to reject resources that include invalid Unicode character points below 32. validation to be done during object construction. we should also introduce a new config parameter on ModelConfig and expose that from fhir-server-config (default to strict and keep out of default/example configs) |
Test casesJSON I played with the Resource string as an InputChannel and injected invalid characters under 32 (not 9,10,13) and the JSON parser from Eclipse Parsson throws an Exception as an invalid. I have test cases to support JSON, and I don't think there is any action needed to further protect the JSON payload and UNICODE character set. XML I played with XML Resource string and injected the same forbidden character. The Parser throws XMLStreamException. Next steps:
|
Signed-off-by: Paul Bastide <pbastide@us.ibm.com>
I ran tests with JMH and a custom bechmark with the Control Characters.
1 With Integer and Byte Optimization (Current)Code Pattern 1: Last Byte CodeThe LOGIC ORDER IS IMPORTANT. These characters are LESS than 32 We need to shortcircuit very quickly. The RANGE must go first.
Code Pattern 2: Character or Integer RangeThe LOGIC ORDER IS IMPORTANT. These characters are LESS than 32 We need to shortcircuit very quickly. The RANGE must go first.
Benchmark Results
2 Set Lookup (Old)Code Pattern 3: Set Contains
Benchmark Results
3 Original Code with no unicode checks (Orig)Benchmark Results
Overall #2 is the fastest peforming of the checks, and is approximately 30% lower in performance than the original check. Note, slight tweaks to the base code was implemented around StringFormat coverted to a StringBuilder as the Performance jumped in Operations per Second when using StringBuilder. Net, I've opted for #2. |
Reject FHIR resources with control chars #2605
Confirmed functionality is working as expected. Request that the FHIR config parm and new error text be modified to indicate it is a check for "control characters" which includes not only Unicode but ASCII. |
Retested after requested changes were made. Confirmed behavior of new config parm: fhirServer/core/checkControlCharacters and verified user's guide updates. Tested responses for u0000 thru u001F control character inclusion in a searchable string parm. Six of the control characters throw an error regardless of the setting for checkControlCharacters: for u000B, u000C, u001C, u001D, u001E, u001F when "checkControlCharacters": true or defaulted when "checkControlCharacters": false |
Is your feature request related to a problem? Please describe.
During a database reindex, the server began returning 500 Server Errors for a specific resource instance.
It turns out that one of the resources contained a string that included
\u0000
but PostgreSQL does not allow the null char in its strings.The original resource content was stored successfully because we gzip the resource contents before saving it to the DB.
However, now the element that contains this character is matching on a new search parameter and so that is what causes the problem.
The spec says this:
Describe the solution you'd like
Strictly enforce this guidance as part of constructing model types that extend String or Uri. Instead of allowing it in resource content but not in extracted fields, just reject resources with it up front.
This is the preferred approach because it avoids data quality issues up-front. For example, the NULL char can be used for certain exploits as described at https://owasp.org/www-community/attacks/Embedding_Null_Code
As part of #2424 we are no longer running parse validation when we read resources from the database, so it should be safe to add without introducing an "escape hatch" that disables the added validation.
A consequence of this is that certain actions like Patch or Reindex could fail (potentially in a confusing way) until the previously-ingested data is fixed to avoid these problem characters.
Describe alternatives you've considered
The main alternative is to allow these characters in both resource content and extracted/indexed fields.
In this case, we should still probably add a WARNING to the output so that users are still informed that they are not following this SHOULD clause in the spec.
To work around the specific issues with the \u0000 char in postgresql, we'd want to replace this character with something like https://www.fileformat.info/info/unicode/char/fffd/index.htm.
Acceptance Criteria
GIVEN a resource contains the \u0000 char
WHEN its posted to the server
THEN it is rejected with a clear error message
GIVEN a resource has already been ingested with the \u0000 char
WHEN the resource is reindexed
THEN it fails with a clear error message
Additional context
#2424 contains some related discussion, but I don't think that we ever really addressed the root issue.
The text was updated successfully, but these errors were encountered: