Skip to content

Latest commit

 

History

History
233 lines (195 loc) · 7.35 KB

data.md

File metadata and controls

233 lines (195 loc) · 7.35 KB
title datatable
Data Resources
true

{% assign back = "tuberculosis" %} {% assign next = "index" %} {% include_relative navigation.md %}

There is, however, a large amount of unlinked data available on the web. Here we have investigated two examples, sfaira and TCGA. Both are candidates for linking to other resources to make a single, queryable platform.

Contents:

  1. sfaira
  2. TCGA
    1. All breast cancer records (NEW)
    2. Mutations and slides for TNBC cases



sfaira

sfaira ties together single-cell data consists of over 170 datasets for numerous tissues types in both human and mouse. A https://theislab.github.io/sfaira-portal/ allows filtering the datasets and Python code is available for downloading and caching the various binary resources that make up a dataset. Downloading and loading the datasets can take significant time. Here metadata from the {{ site.data.sfaira.datasets | size }} human brain datasets has been parsed using the sfaira python library:

    ds = sfaira.data.Universe(data_path=datadir, meta_path=metadir, cache_path=cachedir)
    ds.subset(key="organism", values=["Homo sapiens"])
    ds.subset(key="organ", values=["brain"])
    ds.download()
    ds.load(verbose=1)
    ds.streamline_features(match_to_release="104", subset_genes_to_type="protein_coding")
    ds.streamline_metadata(schema="sfaira")

Each column in the table displays how many cells of a given type were found in each dataset. This is just one example of the type of queryable feature that one might want to extract from the datasets. In the table, the columns are represented by the following numbers for readability:

    {% for rec in site.data.sfaira["datasets"] %}
  1. {{ rec.dataset }}
  2. {% endfor %}

{% for rec in site.data.sfaira["datasets"] %} {% endfor %} {% for rec in site.data.sfaira["cell_types"] %} {% for dataset in rec.datasets %} {% endfor %} {% endfor %}
cell type{{ forloop.index }}
{{ rec.cell_type }}{{ dataset }}

Contents ↑




TCGA

The Cancer Genome Atlas Program (TCGA) data available from the Genomic Data Commons Data Portal poses a similar problem. Though there is a GraphQL API, not all metadata is accessible. For example, the table below is critical for identifying TNBC cases from other forms of breast cancer.

All breast cancer records {#brca}

This single TSV file nationwidechildrens.org_clinical_patient_brca.txt is attached to each of the over 1000 breast cancer cases in the GDC Portal but it must be separately downloaded to properly interpret the data. (Perhaps even more importantly, the interpretations here are those of the authors and may vary from those of domain experts!)

<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style> {% for field in site.data.brca.schema.fields %} {% endfor %} {% for rec in site.data.brca.data %} {% for field in site.data.brca.schema.fields %} {% endfor %} {% endfor %}
{{ field.name }}
{{ rec[field.name] }}

Interpretation:

  • bcr_patient_uuid is the GDC unique identifier for this case and is likely best suited for constructing a unique identifier for this patient.

  • bcr_patient_barcode is the TCGA-submitted identifier that is frequently used in file names, etc. associated with this patient.

  • 'er_status_by_ihc', 'pr_status_by_ihc', and 'her2_status_by_ihc' columns (or "breast_carcinoma_estrogen_receptor_status", "breast_carcinoma_progesterone_receptor_status", and "lab_proc_her2_neu_immunohistochemistry_receptor_status" respectively) take the values: Equivocal, Indeterminate, Negative, Positive, [Not Available], and [Not Evaluated]. Below we've taken those entries with three Negative values to be TNBC. (See https://www.biostars.org/p/279048/ for details)

  • surgical_procedure_first with values like "Lumpectomy" and "Modified Radical Mastectomy" can be mapped to SNOMED 392021009 and 172043006.

  • method_initial_path_dx values, however, can only partially be mapped to SNOMED. 'Core needle biopsy' is 9911007 and 'Fine needle aspiration biopsy' is 48635004 but others are less clear:

    • 'Cytology (e.g. Peritoneal or pleural fluid)'
    • 'Excisional Biopsy'
    • 'Incisional Biopsy'
    • 'Other method, specify:'
    • 'Tumor resection'
    • '[Discrepancy]'
    • '[Not Available]'
  • method_initial_path_dx_other values have similar issues plus misspellings ("Patey's Suregery" vs. -"Patey's Surgery") as well as differences in capitalization ('SKIN BIOPSY' vs. 'Skin biopsy').

Mutations and slides for TNBC cases {#tnbc}

Once the above table has been used to identify the {{ site.data.tcga | size }} TNBC cases, then the GraphQL API can be used to check for other data features, e.g., whether or not a high-impact somatic mutation was associated with the case or if there are images of tissue slides are available:

{% for rec in site.data.tcga %} {% endfor %}
TCGA Case High-impact somatic mutations Images available
{{rec.case}} {% for gene in rec.genes%} {{gene["symbol"]}} {% unless forloop.last %},{% endunless %} {% endfor %} {% if rec.slides %} yes {% else %} no {%endif %}

Contents ↑

{% include_relative navigation.md %}

<script> $(document).ready( function () { $('#tcga_table').DataTable(); } ); $(document).ready( function () { $('#sfaira_table').DataTable(); } ); $(document).ready( function () { $('#nb1_table').DataTable( { "scrollX": true }); } ); </script>