Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

eQTL catalog IDs are not URL-safe #3619

Open
d0choa opened this issue Nov 8, 2024 · 3 comments · May be fixed by opentargets/gentropy#971
Open

eQTL catalog IDs are not URL-safe #3619

d0choa opened this issue Nov 8, 2024 · 3 comments · May be fixed by opentargets/gentropy#971
Assignees
Labels
Data Relates to Open Targets data team Genetics Relates to Open Targets genetics team

Comments

@d0choa
Copy link
Contributor

d0choa commented Nov 8, 2024

As reported by @carcruz, some study identifiers derived from the eQTL catalogue are not URL-safe.

For example:

https://ot-platform-partner.netlify.app/study/Sun_2018_aptamer_plasma_TNFRSF1A.2654.19.1..1

Currently, this is causing some processes to crash in localhost, but it's not a problem when deployed. Therefore, it's not a critical priority, but turning these IDs into URL-safe would be beneficial to avoid future surprises.

@carcruz can you confirm the . are the offending characters?

@d0choa d0choa added the Data Relates to Open Targets data team label Nov 8, 2024
@carcruz
Copy link

carcruz commented Dec 2, 2024

At the moment, this is not a bug in the PPP preview. The error happens in our local development environment; we cannot directly navigate to these pages because the development dev server responds with an error "bad URL provided."

As I can understand from the dev error and quick research, the cause is when there are double .

@DSuveges
Copy link

DSuveges commented Jan 13, 2025

When using standard urllib method in Python the the URL safe ids look like this:

-RECORD 0-----------------------------------------------------------------------------------------------
 studyId  | GTEx_leafcutter_fibroblast_14:29656499:29663699:clu_59352_-                                 
 url_safe | GTEx_leafcutter_fibroblast_14%3A29656499%3A29663699%3Aclu_59352_-                           
-RECORD 1-----------------------------------------------------------------------------------------------
 studyId  | FUSION_leafcutter_adipose_naive_1:22027534:22030280:clu_10161_-                             
 url_safe | FUSION_leafcutter_adipose_naive_1%3A22027534%3A22030280%3Aclu_10161_-                       
-RECORD 2-----------------------------------------------------------------------------------------------
 studyId  | GTEx_leafcutter_skin_not_sun_exposed_14:92117642:92118849:clu_62062_-                       
 url_safe | GTEx_leafcutter_skin_not_sun_exposed_14%3A92117642%3A92118849%3Aclu_62062_-                 
-RECORD 3-----------------------------------------------------------------------------------------------
 studyId  | GTEx_leafcutter_colon_sigmoid_19:9752440:9758059:clu_14145_-                                
 url_safe | GTEx_leafcutter_colon_sigmoid_19%3A9752440%3A9758059%3Aclu_14145_-                          
-RECORD 4-----------------------------------------------------------------------------------------------
 studyId  | GTEx_leafcutter_adrenal_gland_1:25304821:25318063:clu_9963_-                                
 url_safe | GTEx_leafcutter_adrenal_gland_1%3A25304821%3A25318063%3Aclu_9963_-                          
-RECORD 5-----------------------------------------------------------------------------------------------
 studyId  | GTEx_leafcutter_brain_nucleus_accumbens_6:32518666:32520164:clu_6141_-                      
 url_safe | GTEx_leafcutter_brain_nucleus_accumbens_6%3A32518666%3A32520164%3Aclu_6141_-                
-RECORD 6-----------------------------------------------------------------------------------------------
 studyId  | Nathan_2022_ge_CD4+_Th2_ENSG00000129521                                                     
 url_safe | Nathan_2022_ge_CD4%2B_Th2_ENSG00000129521                                                   
-RECORD 7-----------------------------------------------------------------------------------------------
 studyId  | Alasoo_2018_leafcutter_macrophage_IFNg+Salmonella_6:32580856:32581557:clu_23659_+           
 url_safe | Alasoo_2018_leafcutter_macrophage_IFNg%2BSalmonella_6%3A32580856%3A32581557%3Aclu_23659_%2B 
-RECORD 8-----------------------------------------------------------------------------------------------
 studyId  | BrainSeq_leafcutter_brain_7:158922538:158923571:clu_47112_+                                 
 url_safe | BrainSeq_leafcutter_brain_7%3A158922538%3A158923571%3Aclu_47112_%2B                         
-RECORD 9-----------------------------------------------------------------------------------------------
 studyId  | GTEx_leafcutter_skin_sun_exposed_5:57494225:57496825:clu_37602_+                            
 url_safe | GTEx_leafcutter_skin_sun_exposed_5%3A57494225%3A57496825%3Aclu_37602_%2B                    
-RECORD 10----------------------------------------------------------------------------------------------
 studyId  | GTEx_leafcutter_pancreas_7:6577498:6578566:clu_51411_+                                      
 url_safe | GTEx_leafcutter_pancreas_7%3A6577498%3A6578566%3Aclu_51411_%2B                              
-RECORD 11----------------------------------------------------------------------------------------------
 studyId  | ROSMAP_leafcutter_brain_naive_22:32368567:32377261:clu_53975_+                              
 url_safe | ROSMAP_leafcutter_brain_naive_22%3A32368567%3A32377261%3Aclu_53975_%2B                      
-RECORD 12----------------------------------------------------------------------------------------------
 studyId  | GTEx_leafcutter_artery_coronary_11:70153128:70156947:clu_58600_+                            
 url_safe | GTEx_leafcutter_artery_coronary_11%3A70153128%3A70156947%3Aclu_58600_%2B                    
only showing top 13 rows

Problematic characters are converted to their URL encoded form. eg : -> %3A or + -> %2B. Alternatively we can just remove these characters from the studyIds.

@DSuveges
Copy link

Update: After review-ing the resulting dataset after applying Python's built-in URL sanitser, it was clear that this approach would cause more pain in the long run, so decided to remove all non-alphanumeric characters (and -) and replace them with underscore. This method is now integrated into the QTL study ingestion. As this problem is quite specific to QTL studies, the sanitiser method was not made in the studyIndex dataclass.

@prashantuniyal02 prashantuniyal02 added the Genetics Relates to Open Targets genetics team label Jan 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Data Relates to Open Targets data team Genetics Relates to Open Targets genetics team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants