Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some series files from GEO FTP server have invalid UTF-8 characters #1303

Open
arteymix opened this issue Dec 10, 2024 · 1 comment
Open

Comments

@arteymix
Copy link
Member

arteymix commented Dec 10, 2024

I've added a workaround in c3ea1be to skip retrieving samples for these cases, but it would be best to attempt other encodings. The XML file clearly indicate UTF-8. This also appears to be an issue with older documents as I've been testing the binary search for pre-2006 records.

Example stacktrace:

2024-12-09 16:25:02,527 ERROR 3707466 [main] u.g.c.l.e.g.s.GeoBrowserImpl.fillDetails(601) | Error while processing MINiML for GSE1889, sample details will not be obtained 
java.io.IOException: Document at https://ftp.ncbi.nlm.nih.gov/geo/series/GSE1nnn/GSE1889/miniml/GSE1889_family.xml.tgz appears to have invalid UTF-8 characters.                                                           at ubic.gemma.core.loader.expression.geo.service.GeoBrowserImpl.lambda$fetchDetailedGeoSeriesDocument$7(GeoBrowserImpl.java:791) ~[gemma-core-1.31.13-SNAPSHOT.jar:?]                                      
        at ubic.gemma.core.util.SimpleRetry.execute(SimpleRetry.java:48) ~[gemma-core-1.31.13-SNAPSHOT.jar:?]                                                                                                      
        at ubic.gemma.core.loader.expression.geo.service.GeoBrowserImpl.execute(GeoBrowserImpl.java:882) ~[gemma-core-1.31.13-SNAPSHOT.jar:?]                                                                      
        at ubic.gemma.core.loader.expression.geo.service.GeoBrowserImpl.fetchDetailedGeoSeriesDocument(GeoBrowserImpl.java:776) ~[gemma-core-1.31.13-SNAPSHOT.jar:?]                                               
        at ubic.gemma.core.loader.expression.geo.service.GeoBrowserImpl.fillDetails(GeoBrowserImpl.java:599) ~[gemma-core-1.31.13-SNAPSHOT.jar:?]                                                                  
        at ubic.gemma.core.loader.expression.geo.service.GeoBrowserImpl.retrieveGeoRecords(GeoBrowserImpl.java:471) ~[gemma-core-1.31.13-SNAPSHOT.jar:?]                                                           
        at ubic.gemma.core.apps.GeoGrabberCli.lambda$getGeoRecords$1(GeoGrabberCli.java:516) ~[gemma-cli-1.31.13-SNAPSHOT.jar:?]                                                                                   
        at ubic.gemma.core.util.SimpleRetry.execute(SimpleRetry.java:48) [gemma-core-1.31.13-SNAPSHOT.jar:?]                                                                                                       
        at ubic.gemma.core.apps.GeoGrabberCli.getGeoRecords(GeoGrabberCli.java:516) [gemma-cli-1.31.13-SNAPSHOT.jar:?]                                                                                 
        at ubic.gemma.core.apps.GeoGrabberCli.browseDatasets(GeoGrabberCli.java:357) [gemma-cli-1.31.13-SNAPSHOT.jar:?]
        at ubic.gemma.core.apps.GeoGrabberCli.doWork(GeoGrabberCli.java:288) [gemma-cli-1.31.13-SNAPSHOT.jar:?]
        at ubic.gemma.core.util.AbstractCLI.executeCommand(AbstractCLI.java:209) [gemma-cli-1.31.13-SNAPSHOT.jar:?]
        at ubic.gemma.core.apps.GemmaCLI.main(GemmaCLI.java:288) [gemma-cli-1.31.13-SNAPSHOT.jar:?]

https://ftp.ncbi.nlm.nih.gov/geo/series/GSE1nnn/GSE1889/miniml/GSE1889_family.xml.tgz

An alternative would be to report these to GEO so they can get their files fixed for good.

@arteymix arteymix changed the title Some series files from GEO FTP have invalid UTF-8 characters Some series files from GEO FTP server have invalid UTF-8 characters Dec 10, 2024
@arteymix
Copy link
Member Author

It looks like we have a workaround in EutilsFetch that consists of stripping the <?xml header, might reuse that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant