Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable reading non-utf-8 encodings for java pom.xml files #2047

Merged
merged 3 commits into from
Aug 23, 2023

Conversation

wagoodman
Copy link
Contributor

@wagoodman wagoodman commented Aug 21, 2023

Fixes #2204

Currently the pom.xml decoding will return an error when reading a document that does not have an encoding="..." attribute at the top of an XML document and there are non-utf-8 characters within the document. This PR adds encoding detection before using the XML decoder so that the reader can be wrapped with an adapter to transform the input to UTF-8 during Read() calls.

Fixes: #2044

CC: @westonsteimel

@wagoodman wagoodman requested a review from a team August 21, 2023 20:12
@wagoodman wagoodman self-assigned this Aug 21, 2023
@github-actions
Copy link

github-actions bot commented Aug 21, 2023

Benchmark Test Results

Benchmark results from the latest changes vs base branch
goos: linux%0Agoarch: amd64%0Apkg: github.com/anchore/syft/test/integration%0Acpu: Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz%0A                                                              │ ./.tmp/benchmark-101466e.txt │%0A                                                              │            sec/op            │%0AImagePackageCatalogers/alpmdb-cataloger-2                                       12.54m ±  3%25%0AImagePackageCatalogers/apkdb-cataloger-2                                        713.1µ ± 22%25%0AImagePackageCatalogers/binary-cataloger-2                                       206.2µ ±  2%25%0AImagePackageCatalogers/dpkgdb-cataloger-2                                       613.4µ ±  2%25%0AImagePackageCatalogers/dotnet-portable-executable-cataloger-2                   22.61µ ±  2%25%0AImagePackageCatalogers/go-module-binary-cataloger-2                             98.90µ ±  2%25%0AImagePackageCatalogers/java-cataloger-2                                         18.46m ±  2%25%0AImagePackageCatalogers/graalvm-native-image-cataloger-2                         96.26µ ±  1%25%0AImagePackageCatalogers/javascript-package-cataloger-2                           386.9µ ±  1%25%0AImagePackageCatalogers/nix-store-cataloger-2                                    280.7µ ±  1%25%0AImagePackageCatalogers/php-composer-installed-cataloger-2                       806.4µ ±  2%25%0AImagePackageCatalogers/portage-cataloger-2                                      489.7µ ±  1%25%0AImagePackageCatalogers/python-package-cataloger-2                               3.381m ±  3%25%0AImagePackageCatalogers/r-package-cataloger-2                                    204.3µ ±  1%25%0AImagePackageCatalogers/rpm-db-cataloger-2                                       563.7µ ±  3%25%0AImagePackageCatalogers/ruby-gemspec-cataloger-2                                 927.5µ ±  0%25%0AImagePackageCatalogers/sbom-cataloger-2                                         121.4µ ±  6%25%0Ageomean                                                                         503.0µ%0A%0A                                                              │ ./.tmp/benchmark-101466e.txt │%0A                                                              │             B/op             │%0AImagePackageCatalogers/alpmdb-cataloger-2                                       5.138Mi ± 0%25%0AImagePackageCatalogers/apkdb-cataloger-2                                        184.3Ki ± 0%25%0AImagePackageCatalogers/binary-cataloger-2                                       30.47Ki ± 0%25%0AImagePackageCatalogers/dpkgdb-cataloger-2                                       141.5Ki ± 0%25%0AImagePackageCatalogers/dotnet-portable-executable-cataloger-2                   3.695Ki ± 0%25%0AImagePackageCatalogers/go-module-binary-cataloger-2                             9.906Ki ± 0%25%0AImagePackageCatalogers/java-cataloger-2                                         3.064Mi ± 0%25%0AImagePackageCatalogers/graalvm-native-image-cataloger-2                         8.594Ki ± 0%25%0AImagePackageCatalogers/javascript-package-cataloger-2                           83.80Ki ± 0%25%0AImagePackageCatalogers/nix-store-cataloger-2                                    38.94Ki ± 0%25%0AImagePackageCatalogers/php-composer-installed-cataloger-2                       155.2Ki ± 0%25%0AImagePackageCatalogers/portage-cataloger-2                                      109.8Ki ± 0%25%0AImagePackageCatalogers/python-package-cataloger-2                               986.1Ki ± 0%25%0AImagePackageCatalogers/r-package-cataloger-2                                    42.91Ki ± 0%25%0AImagePackageCatalogers/rpm-db-cataloger-2                                       170.9Ki ± 0%25%0AImagePackageCatalogers/ruby-gemspec-cataloger-2                                 123.3Ki ± 0%25%0AImagePackageCatalogers/sbom-cataloger-2                                         14.20Ki ± 0%25%0Ageomean                                                                         92.99Ki%0A%0A                                                              │ ./.tmp/benchmark-101466e.txt │%0A                                                              │          allocs/op           │%0AImagePackageCatalogers/alpmdb-cataloger-2                                        88.07k ± 0%25%0AImagePackageCatalogers/apkdb-cataloger-2                                         4.034k ± 0%25%0AImagePackageCatalogers/binary-cataloger-2                                         848.0 ± 0%25%0AImagePackageCatalogers/dpkgdb-cataloger-2                                        2.911k ± 0%25%0AImagePackageCatalogers/dotnet-portable-executable-cataloger-2                     132.0 ± 0%25%0AImagePackageCatalogers/go-module-binary-cataloger-2                               281.0 ± 0%25%0AImagePackageCatalogers/java-cataloger-2                                          40.61k ± 0%25%0AImagePackageCatalogers/graalvm-native-image-cataloger-2                           228.0 ± 0%25%0AImagePackageCatalogers/javascript-package-cataloger-2                            1.264k ± 0%25%0AImagePackageCatalogers/nix-store-cataloger-2                                      820.0 ± 0%25%0AImagePackageCatalogers/php-composer-installed-cataloger-2                        3.846k ± 0%25%0AImagePackageCatalogers/portage-cataloger-2                                       2.194k ± 0%25%0AImagePackageCatalogers/python-package-cataloger-2                                16.13k ± 0%25%0AImagePackageCatalogers/r-package-cataloger-2                                      851.0 ± 0%25%0AImagePackageCatalogers/rpm-db-cataloger-2                                        3.914k ± 0%25%0AImagePackageCatalogers/ruby-gemspec-cataloger-2                                  2.291k ± 0%25%0AImagePackageCatalogers/sbom-cataloger-2                                           394.0 ± 0%25%0Ageomean                                                                          1.997k

Signed-off-by: Alex Goodman <wagoodman@users.noreply.github.com>
@wagoodman wagoodman force-pushed the fix-pom-xml-decoding branch from 1ac86cf to 7566b29 Compare August 22, 2023 00:27
Signed-off-by: Alex Goodman <wagoodman@users.noreply.github.com>
…test unknown encoding

Signed-off-by: Alex Goodman <wagoodman@users.noreply.github.com>
@wagoodman wagoodman merged commit 17d4203 into main Aug 23, 2023
@wagoodman wagoodman deleted the fix-pom-xml-decoding branch August 23, 2023 14:06
GijsCalis pushed a commit to GijsCalis/syft that referenced this pull request Feb 19, 2024
* fix reading non utf8 encodings

Signed-off-by: Alex Goodman <wagoodman@users.noreply.github.com>

* in cases where we cant tell the encoding use the UTF8 replacement char

Signed-off-by: Alex Goodman <wagoodman@users.noreply.github.com>

* decompose the xml decoding func to get a valid utf8 reader first and test unknown encoding

Signed-off-by: Alex Goodman <wagoodman@users.noreply.github.com>

---------

Signed-off-by: Alex Goodman <wagoodman@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Syft seems unable to parse non UTF-8 pom.xml files
3 participants