Reject FHIR resources with the null char \u0000 or similar control chars #2605

lmsurpre · 2021-07-12T22:37:47Z

Is your feature request related to a problem? Please describe.
During a database reindex, the server began returning 500 Server Errors for a specific resource instance.

It turns out that one of the resources contained a string that included \u0000 but PostgreSQL does not allow the null char in its strings.

The original resource content was stored successfully because we gzip the resource contents before saving it to the DB.
However, now the element that contains this character is matching on a new search parameter and so that is what causes the problem.

The spec says this:

Strings SHOULD not contain Unicode character points below 32, except for u0009 (horizontal tab), u0010 (carriage return) and u0013 (line feed).

Describe the solution you'd like
Strictly enforce this guidance as part of constructing model types that extend String or Uri. Instead of allowing it in resource content but not in extracted fields, just reject resources with it up front.

This is the preferred approach because it avoids data quality issues up-front. For example, the NULL char can be used for certain exploits as described at https://owasp.org/www-community/attacks/Embedding_Null_Code

As part of #2424 we are no longer running parse validation when we read resources from the database, so it should be safe to add without introducing an "escape hatch" that disables the added validation.
A consequence of this is that certain actions like Patch or Reindex could fail (potentially in a confusing way) until the previously-ingested data is fixed to avoid these problem characters.

Describe alternatives you've considered
The main alternative is to allow these characters in both resource content and extracted/indexed fields.
In this case, we should still probably add a WARNING to the output so that users are still informed that they are not following this SHOULD clause in the spec.

To work around the specific issues with the \u0000 char in postgresql, we'd want to replace this character with something like https://www.fileformat.info/info/unicode/char/fffd/index.htm.

Acceptance Criteria

GIVEN a resource contains the \u0000 char
WHEN its posted to the server
THEN it is rejected with a clear error message
GIVEN a resource has already been ingested with the \u0000 char
WHEN the resource is reindexed
THEN it fails with a clear error message

Additional context
#2424 contains some related discussion, but I don't think that we ever really addressed the root issue.

The text was updated successfully, but these errors were encountered:

lmsurpre · 2021-09-27T14:52:54Z

team decision: default should be to reject resources that include invalid Unicode character points below 32. validation to be done during object construction. we should also introduce a new config parameter on ModelConfig and expose that from fhir-server-config (default to strict and keep out of default/example configs)

prb112 · 2021-09-30T19:23:13Z

Test cases

JSON I played with the Resource string as an InputChannel and injected invalid characters under 32 (not 9,10,13) and the JSON parser from Eclipse Parsson throws an Exception as an invalid. I have test cases to support JSON, and I don't think there is any action needed to further protect the JSON payload and UNICODE character set.

XML I played with XML Resource string and injected the same forbidden character. The Parser throws XMLStreamException.

Next steps:

Try specialized search value parameter coding

Signed-off-by: Paul Bastide <pbastide@us.ibm.com>

prb112 · 2021-10-07T17:31:17Z

I ran tests with JMH and a custom bechmark with the Control Characters.

/*
 * (C) Copyright IBM Corp. 2019, 2020
 *
 * SPDX-License-Identifier: Apache-2.0
 */

package com.ibm.fhir.benchmark;

import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;
import java.util.Set;

import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.Setup;
import org.openjdk.jmh.annotations.State;

import com.ibm.fhir.benchmark.runner.FHIRBenchmarkRunner;
import com.ibm.fhir.model.util.ValidationSupport;

public class FHIRValidatorNullCharsBenchmark {

    @State(Scope.Benchmark)
    public static class FHIRValidatorNullCharsState {

        private Set<Character> chars = buildUnsupportedUnicodeCharacterSet();
        private List<String> tests = new ArrayList<>();
        private StringBuilder s;

        @Setup
        public void setUp() {
            // dummy text
            s =
                    new StringBuilder("You might consider using this policy when a large heap application can tolerate longer GC pauses to obtain better overall throughput. Unlike gencon, the optthruput policy does not use object access barriers. In some workloads, the cost of these barriers might be high enough to make optthruput preferable. However, in many situations, the default gencon policy offers better performance.");
            for (Character ch : chars) {
                StringBuilder t = new StringBuilder(s);
                t.append(ch);
                tests.add(s.toString()); // Good String
                tests.add(t.toString()); // Bad String
            }
        }

        public void runCurrent() {
            for (String t : tests) {
                try {
                    ValidationSupport.checkString(t);
                } catch (IllegalStateException ise) {
                    continue;
                }
            }
        }

        public void runOld() {
            for (String t : tests) {
                try {
                    ValidationSupport.checkStringInt(t);
                } catch (IllegalStateException ise) {
                    continue;
                }
            }
        }

        public void runOrig() {
            for (String t : tests) {
                try {
                    ValidationSupport.checkStringOrig(t);
                } catch (IllegalStateException ise) {
                    continue;
                }
            }
        }
    }

    /**
     * Builds a set of unsupported Unicode characters for fast lookup.
     *
     * @implNote Per the specification: Strings SHOULD not contain Unicode character points below 32
     *           except for u0009 (horizontal tab), u0010 (carriage return) and u0013 (line feed).
     */
    private static Set<Character> buildUnsupportedUnicodeCharacterSet() {
        Set<Character> chars = new HashSet<>();
        for (int i = 0; i < 32; i++) {
            if (i != 9 && i != 10 && i != 13) {
                chars.add(Character.valueOf((char) i));
            }
        }
        return chars;
    }

    @Benchmark
    public void benchmarkValidator1CurrentPass1(FHIRValidatorNullCharsState state) throws Exception {
        state.runCurrent();
    }

    @Benchmark
    public void benchmarkValidator1OldPass1(FHIRValidatorNullCharsState state) throws Exception {
        state.runOld();
    }

    @Benchmark
    public void benchmarkValidator1OrigPass1(FHIRValidatorNullCharsState state) throws Exception {
        state.runOrig();
    }

    @Benchmark
    public void benchmarkValidator2CurrentPass2(FHIRValidatorNullCharsState state) throws Exception {
        state.runCurrent();
    }

    @Benchmark
    public void benchmarkValidator2OldPass2(FHIRValidatorNullCharsState state) throws Exception {
        state.runOld();
    }

    @Benchmark
    public void benchmarkValidator2OrigPass2(FHIRValidatorNullCharsState state) throws Exception {
        state.runOrig();
    }

    @Benchmark
    public void benchmarkValidator3CurrentPass2(FHIRValidatorNullCharsState state) throws Exception {
        state.runCurrent();
    }

    @Benchmark
    public void benchmarkValidator3OldPass2(FHIRValidatorNullCharsState state) throws Exception {
        state.runOld();
    }

    @Benchmark
    public void benchmarkValidator3OrigPass2(FHIRValidatorNullCharsState state) throws Exception {
        state.runOrig();
    }

    @Benchmark
    public void benchmarkValidator4CurrentPass2(FHIRValidatorNullCharsState state) throws Exception {
        state.runCurrent();
    }

    @Benchmark
    public void benchmarkValidator4OldPass2(FHIRValidatorNullCharsState state) throws Exception {
        state.runOld();
    }

    @Benchmark
    public void benchmarkValidator4OrigPass2(FHIRValidatorNullCharsState state) throws Exception {
        state.runOrig();
    }

    public static void main(String[] args) throws Exception {
        new FHIRBenchmarkRunner(FHIRValidatorNullCharsBenchmark.class).run();
    }
}

1 With Integer and Byte Optimization (Current)

Code Pattern 1: Last Byte Code

The LOGIC ORDER IS IMPORTANT. These characters are LESS than 32 We need to shortcircuit very quickly. The RANGE must go first.
We also know it's not over one byte in size.

        int val = ch;
        byte b = (byte)(val & 0xFF);
        // 09 0x9
        // 10 0xA
        // 13 0xD
        if (val <= 31 && b != 0x9 && b != 0xA && b != 0xD) {
            throw new IllegalStateException(buildUnicodeException(s));
        }

Code Pattern 2: Character or Integer Range

The LOGIC ORDER IS IMPORTANT. These characters are LESS than 32 We need to shortcircuit very quickly. The RANGE must go first.

        int val = ch;
        if (val <= 31 && val != 9 && val != 10 && val != 13) {
            throw new IllegalStateException(buildUnicodeException(s));
        }

Benchmark Results

Benchmark                                                                Mode  Cnt     Score     Error  Units
FHIRValidatorNullCharsBenchmark.benchmarkValidator1CurrentPass1         thrpt   10  4050.684 ± 439.743  ops/s
FHIRValidatorNullCharsBenchmark.benchmarkValidator2CurrentPass2         thrpt   10  4635.965 ± 494.476  ops/s
FHIRValidatorNullCharsBenchmark.benchmarkValidator3CurrentPass2         thrpt   10  4413.468 ± 163.379  ops/s
FHIRValidatorNullCharsBenchmark.benchmarkValidator4CurrentPass2         thrpt   10  4965.682 ± 152.666  ops/s

2 Set Lookup (Old)

Code Pattern 3: Set Contains

    private static final Set<Character> UNSUPPORTED_UNICODE = buildUnsupportedUnicodeCharacterSet();
    private static Set<Character> buildUnsupportedUnicodeCharacterSet() {
        Set<Character> chars = new HashSet<>();
        for (int i = 0; i < 32; i++) {
            if (i != 9 && i != 10 && i != 13) {
                chars.add(Character.valueOf((char) i));
            }
        }
        return chars;
    }
    private static void checkUnsupportedUnicode(String s, char ch) {
        if (UNSUPPORTED_UNICODE.contains(ch)) {
            throw new IllegalStateException(buildUnicodeException(s));
        }
    }

Benchmark Results

Benchmark                                                                Mode  Cnt     Score     Error  Units
FHIRValidatorNullCharsBenchmark.benchmarkValidator1OldPass1             thrpt   10  6647.876 ± 779.563  ops/s
FHIRValidatorNullCharsBenchmark.benchmarkValidator2OldPass2             thrpt   10  6191.954 ± 103.326  ops/s
FHIRValidatorNullCharsBenchmark.benchmarkValidator3OldPass2             thrpt   10  6966.837 ±  89.390  ops/s
FHIRValidatorNullCharsBenchmark.benchmarkValidator4OldPass2             thrpt   10  6213.548 ±  89.264  ops/s

3 Original Code with no unicode checks (Orig)

Benchmark Results

Benchmark                                                                Mode  Cnt     Score     Error  Units
FHIRValidatorNullCharsBenchmark.benchmarkValidator1OrigPass1            thrpt   10  9548.768 ± 126.483  ops/s
FHIRValidatorNullCharsBenchmark.benchmarkValidator2OrigPass2            thrpt   10  9360.653 ± 595.367  ops/s
FHIRValidatorNullCharsBenchmark.benchmarkValidator3OrigPass2            thrpt   10  9717.007 ±  97.229  ops/s
FHIRValidatorNullCharsBenchmark.benchmarkValidator4OrigPass2            thrpt   10  9703.566 ± 135.052  ops/s

Overall #2 is the fastest peforming of the checks, and is approximately 30% lower in performance than the original check.

Note, slight tweaks to the base code was implemented around StringFormat coverted to a StringBuilder as the Performance jumped in Operations per Second when using StringBuilder.

Net, I've opted for #2.

Reject FHIR resources with control chars #2605

kmbarton423 · 2021-10-19T13:59:21Z

Confirmed functionality is working as expected. Request that the FHIR config parm and new error text be modified to indicate it is a check for "control characters" which includes not only Unicode but ASCII.

kmbarton423 · 2021-10-20T18:01:37Z

Retested after requested changes were made. Confirmed behavior of new config parm: fhirServer/core/checkControlCharacters and verified user's guide updates. Tested responses for u0000 thru u001F control character inclusion in a searchable string parm. Six of the control characters throw an error regardless of the setting for checkControlCharacters:

for u000B, u000C, u001C, u001D, u001E, u001F
{
"resourceType": "OperationOutcome",
"issue": [ {
"severity": "fatal",
"code": "invalid",
"details": {"text": "FHIRProvider: String value: 'nada bada' is not valid with respect to pattern: [\\r\\n\\t\\S]+ [Evidence]"},
"expression": ["Evidence"]
}]
}

when "checkControlCharacters": true or defaulted
validation of u0009, u000A, and u000D pass but the others throw an error
{
"resourceType": "OperationOutcome",
"issue": [ {
"severity": "fatal",
"code": "invalid",
"details": {"text": "FHIRProvider: String value contains unsupported control characters: decimal range=[\0000-0008,0011,0012,0014-0031] value=[nada bada] [Evidence]"},
"expression": ["Evidence"]
}]
}

when "checkControlCharacters": false
validation of the control characters pass except for u0000 case using Postgres persistence
{
"resourceType": "OperationOutcome",
"id": "ac-11-0-3-44bbb2b2-c143-4d62-8446-0e18a010b7d6",
"issue": [ {
"severity": "fatal",
"code": "exception",
"details": {"text": "FHIRPersistenceDataAccessException: SQLException encountered while inserting Resource."}
}]
}

prb112 added the enhancement New feature or request label Jul 13, 2021

lmsurpre added the P2 Priority 2 - Should Have label Sep 27, 2021

prb112 self-assigned this Sep 28, 2021

prb112 added this to the Sprint 2021-13 milestone Sep 28, 2021

prb112 added a commit that referenced this issue Oct 1, 2021

Reject FHIR resources with the control chars #2605

459d60a

Signed-off-by: Paul Bastide <pbastide@us.ibm.com>

prb112 mentioned this issue Oct 7, 2021

Reject FHIR resources with control chars #2605 #2820

Merged

lmsurpre added the configuration label Oct 8, 2021

prb112 added a commit that referenced this issue Oct 12, 2021

Merge pull request #2820 from IBM/issue-2605

37a0eb2

Reject FHIR resources with control chars #2605

kmbarton423 closed this as completed Oct 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reject FHIR resources with the null char \u0000 or similar control chars #2605

Reject FHIR resources with the null char \u0000 or similar control chars #2605

lmsurpre commented Jul 12, 2021 •

edited

Loading

lmsurpre commented Sep 27, 2021

prb112 commented Sep 30, 2021 •

edited

Loading

prb112 commented Oct 7, 2021

kmbarton423 commented Oct 19, 2021

kmbarton423 commented Oct 20, 2021

Reject FHIR resources with the null char \u0000 or similar control chars #2605

Reject FHIR resources with the null char \u0000 or similar control chars #2605

Comments

lmsurpre commented Jul 12, 2021 • edited Loading

lmsurpre commented Sep 27, 2021

prb112 commented Sep 30, 2021 • edited Loading

Test cases

prb112 commented Oct 7, 2021

1 With Integer and Byte Optimization (Current)

Code Pattern 1: Last Byte Code

Code Pattern 2: Character or Integer Range

Benchmark Results

2 Set Lookup (Old)

Code Pattern 3: Set Contains

Benchmark Results

3 Original Code with no unicode checks (Orig)

Benchmark Results

kmbarton423 commented Oct 19, 2021

kmbarton423 commented Oct 20, 2021

lmsurpre commented Jul 12, 2021 •

edited

Loading

prb112 commented Sep 30, 2021 •

edited

Loading