Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about .feather output of create_cistarget_motif_databases.py #550

Open
sknaack opened this issue Feb 27, 2025 · 2 comments
Open

Comments

@sknaack
Copy link

sknaack commented Feb 27, 2025

Dear SCENIC+ folks, I've a brief question, It pertains to the .feather outputs of the create_cistarget_motif_databases.py. In principle these contain information that map genomic loci to a certain set of motifs/TFs. I anticipated metadata reflecting this, nominally in the row names. However when I looked at the example hg38 files at https://resources.aertslab.org/cistarget/databases/homo_sapiens/hg38/screen/mc_v10_clust/region_based/ I don't find row-wise metadata relating to the specific motifs or TFs? Is that expected? Am I looking in the wrong place in these .feather files or any other output files? I'd like to trace the specific genomic loci to motifs and TFs at this (human readable) level for sanity checks. It didn't seem trivially possible at this step. Thanks in advance, Sara

@ghuls
Copy link
Member

ghuls commented Feb 28, 2025

Motif names are in the last column of the dataframe, other columns contain scores or rankings for a specific region (one value per motif).

import polars as pl

# Get schema of Feather file.
feather_schema = pl.read_ipc_schema("/databases/cistarget/databases/homo_sapiens/hg38/screen/mc_v10_clust/region_based/hg38_screen_v10_clust.regions_vs_motifs.scores.feather")

In [14]: list(feather_schema)[0:10]
Out[14]: 
['chr10:100000176-100000504',
 'chr10:100001759-100001930',
 'chr10:100004841-100005148',
 'chr10:100005876-100006219',
 'chr10:100006302-100006644',
 'chr10:100007267-100007608',
 'chr10:100007891-100008108',
 'chr10:100008250-100008544',
 'chr10:100008734-100009017',
 'chr10:100009359-100009597']

In [15]: list(feather_schema)[-10:]
Out[15]: 
['chrY:9623241-9623568',
 'chrY:9805269-9805610',
 'chrY:9812117-9812357',
 'chrY:9853800-9854136',
 'chrY:9920102-9920438',
 'chrY:9924284-9924623',
 'chrY:9954840-9955041',
 'chrY:9959132-9959359',
 'chrY:9997981-9998328',
 'motifs']

In [16]: motifs_df = pl.read_ipc("/databases/cistarget/databases/homo_sapiens/hg38/screen/mc_v10_clust/region_based/hg38_screen_v10_clust.regions_vs_motifs.scores.feather", columns=["motifs"])
Could not memory_map compressed IPC file, defaulting to normal read. Toggle off 'memory_map' to silence this warning.

In [17]: motifs_df
Out[17]: 
shape: (5_876, 1)
┌──────────────────────────┐
│ motifs                   │
│ ---                      │
│ str                      │
╞══════════════════════════╡
│ bergman__Su_H_           │
│ bergman__croc            │
│ bergman__pho             │
│ bergman__tll             │
│ c2h2_zfs__M0369          │
│ …                        │
│ yetfasco__TBP-TFIIA_1328 │
│ yetfasco__TBP-TFIIB_1329 │
│ yetfasco__YFL044C_1166   │
│ yetfasco__YGL192W_1000   │
│ yetfasco__YPR086W_1327   │
└──────────────────────────┘

@sknaack
Copy link
Author

sknaack commented Feb 28, 2025

Thank you so much! This is very helpful to confirm. I am indeed new to this feather format and I wondered if I might be missing something. This should resolve my questions. I imagine it can't hurt to have this documented here in an issue, either; someone will probably find it useful. =-) Thanks again! Have a nice day! Sara

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants