You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've added a workaround in c3ea1be to skip retrieving samples for these cases, but it would be best to attempt other encodings. The XML file clearly indicate UTF-8. This also appears to be an issue with older documents as I've been testing the binary search for pre-2006 records.
Example stacktrace:
2024-12-09 16:25:02,527 ERROR 3707466 [main] u.g.c.l.e.g.s.GeoBrowserImpl.fillDetails(601) | Error while processing MINiML for GSE1889, sample details will not be obtained
java.io.IOException: Document at https://ftp.ncbi.nlm.nih.gov/geo/series/GSE1nnn/GSE1889/miniml/GSE1889_family.xml.tgz appears to have invalid UTF-8 characters. at ubic.gemma.core.loader.expression.geo.service.GeoBrowserImpl.lambda$fetchDetailedGeoSeriesDocument$7(GeoBrowserImpl.java:791) ~[gemma-core-1.31.13-SNAPSHOT.jar:?]
at ubic.gemma.core.util.SimpleRetry.execute(SimpleRetry.java:48) ~[gemma-core-1.31.13-SNAPSHOT.jar:?]
at ubic.gemma.core.loader.expression.geo.service.GeoBrowserImpl.execute(GeoBrowserImpl.java:882) ~[gemma-core-1.31.13-SNAPSHOT.jar:?]
at ubic.gemma.core.loader.expression.geo.service.GeoBrowserImpl.fetchDetailedGeoSeriesDocument(GeoBrowserImpl.java:776) ~[gemma-core-1.31.13-SNAPSHOT.jar:?]
at ubic.gemma.core.loader.expression.geo.service.GeoBrowserImpl.fillDetails(GeoBrowserImpl.java:599) ~[gemma-core-1.31.13-SNAPSHOT.jar:?]
at ubic.gemma.core.loader.expression.geo.service.GeoBrowserImpl.retrieveGeoRecords(GeoBrowserImpl.java:471) ~[gemma-core-1.31.13-SNAPSHOT.jar:?]
at ubic.gemma.core.apps.GeoGrabberCli.lambda$getGeoRecords$1(GeoGrabberCli.java:516) ~[gemma-cli-1.31.13-SNAPSHOT.jar:?]
at ubic.gemma.core.util.SimpleRetry.execute(SimpleRetry.java:48) [gemma-core-1.31.13-SNAPSHOT.jar:?]
at ubic.gemma.core.apps.GeoGrabberCli.getGeoRecords(GeoGrabberCli.java:516) [gemma-cli-1.31.13-SNAPSHOT.jar:?]
at ubic.gemma.core.apps.GeoGrabberCli.browseDatasets(GeoGrabberCli.java:357) [gemma-cli-1.31.13-SNAPSHOT.jar:?]
at ubic.gemma.core.apps.GeoGrabberCli.doWork(GeoGrabberCli.java:288) [gemma-cli-1.31.13-SNAPSHOT.jar:?]
at ubic.gemma.core.util.AbstractCLI.executeCommand(AbstractCLI.java:209) [gemma-cli-1.31.13-SNAPSHOT.jar:?]
at ubic.gemma.core.apps.GemmaCLI.main(GemmaCLI.java:288) [gemma-cli-1.31.13-SNAPSHOT.jar:?]
arteymix
changed the title
Some series files from GEO FTP have invalid UTF-8 characters
Some series files from GEO FTP server have invalid UTF-8 characters
Dec 10, 2024
I've added a workaround in c3ea1be to skip retrieving samples for these cases, but it would be best to attempt other encodings. The XML file clearly indicate UTF-8. This also appears to be an issue with older documents as I've been testing the binary search for pre-2006 records.
Example stacktrace:
https://ftp.ncbi.nlm.nih.gov/geo/series/GSE1nnn/GSE1889/miniml/GSE1889_family.xml.tgz
An alternative would be to report these to GEO so they can get their files fixed for good.
The text was updated successfully, but these errors were encountered: