Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing NCBI Gene XML exports #30

Open
smoe opened this issue Dec 12, 2024 · 3 comments
Open

Parsing NCBI Gene XML exports #30

smoe opened this issue Dec 12, 2024 · 3 comments

Comments

@smoe
Copy link
Contributor

smoe commented Dec 12, 2024

Hello,
Thank you tons for what you have already done. My personal interest is in parsing XML exports from gene descriptions, like https://www.ncbi.nlm.nih.gov/gene?Db=gene&Cmd=DetailsSearch&Term=7161 (Send To File -> XML format).
I would happily contribute what is missing to help with the parsing, is there an easy way to get me going? Or shall I just come up with something that I hope to be compatible with your plans?
Many thanks!

This is how it looks:

<?xml version="1.0"?>
<!DOCTYPE Entrezgene-Set PUBLIC "-//NLM//DTD NCBI-Entrezgene, 21st January 2005//EN" "https://www.ncbi.nlm.nih.gov/data_specs/dtd/NCBI_Entrezgene.dtd">
<Entrezgene-Set>

<Entrezgene>
  <Entrezgene_track-info>
    <Gene-track>
      <Gene-track_geneid>7161</Gene-track_geneid>
      <Gene-track_status value="live">0</Gene-track_status>
      <Gene-track_create-date>
        <Date>
          <Date_std>
            <Date-std>
              <Date-std_year>1998</Date-std_year>
              <Date-std_month>8</Date-std_month>
              <Date-std_day>13</Date-std_day>
            </Date-std>
          </Date_std>
        </Date>
      </Gene-track_create-date>
      <Gene-track_update-date>
        <Date>
          <Date_std>
            <Date-std>
              <Date-std_year>2024</Date-std_year>
              <Date-std_month>12</Date-std_month>
              <Date-std_day>10</Date-std_day>
              <Date-std_hour>8</Date-std_hour>
              <Date-std_minute>46</Date-std_minute>
              <Date-std_second>0</Date-std_second>
            </Date-std>
          </Date_std>
        </Date>
      </Gene-track_update-date>
    </Gene-track>
  </Entrezgene_track-info>
  <Entrezgene_type value="protein-coding">6</Entrezgene_type>
  <Entrezgene_source>
    <BioSource>
      <BioSource_genome value="genomic">1</BioSource_genome>
      <BioSource_origin value="natural">1</BioSource_origin>
      <BioSource_org>
        <Org-ref>
          <Org-ref_taxname>Homo sapiens</Org-ref_taxname>
          <Org-ref_common>human</Org-ref_common>
          <Org-ref_db>
            <Dbtag>
              <Dbtag_db>taxon</Dbtag_db>
              <Dbtag_tag>
                <Object-id>
                  <Object-id_id>9606</Object-id_id>
                </Object-id>
              </Dbtag_tag>
            </Dbtag>
          </Org-ref_db>
          <Org-ref_orgname>
            <OrgName>
              <OrgName_name>
                <OrgName_name_binomial>
                  <BinomialOrgName>
                    <BinomialOrgName_genus>Homo</BinomialOrgName_genus>
                    <BinomialOrgName_species>sapiens</BinomialOrgName_species>
                  </BinomialOrgName>
                </OrgName_name_binomial>
              </OrgName_name>
              <OrgName_attrib>specified</OrgName_attrib>
              <OrgName_lineage>Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; Catarrhini; Hominidae; Homo</OrgName_lineage>
              <OrgName_gcode>1</OrgName_gcode>
              <OrgName_mgcode>2</OrgName_mgcode>
              <OrgName_div>PRI</OrgName_div>
            </OrgName>
          </Org-ref_orgname>
        </Org-ref>
      </BioSource_org>
      <BioSource_subtype>
        <SubSource>
          <SubSource_subtype value="chromosome">1</SubSource_subtype>
          <SubSource_name>1</SubSource_name>
        </SubSource>
      </BioSource_subtype>
    </BioSource>
  </Entrezgene_source>
  <Entrezgene_gene>
    <Gene-ref>
      <Gene-ref_locus>TP73</Gene-ref_locus>
      <Gene-ref_desc>tumor protein p73</Gene-ref_desc>
      <Gene-ref_maploc>1p36.32</Gene-ref_maploc>
      <Gene-ref_db>
        <Dbtag>
          <Dbtag_db>HGNC</Dbtag_db>
          <Dbtag_tag>
            <Object-id>
              <Object-id_str>HGNC:12003</Object-id_str>
            </Object-id>
          </Dbtag_tag>
        </Dbtag>
        <Dbtag>
          <Dbtag_db>Ensembl</Dbtag_db>
          <Dbtag_tag>
            <Object-id>
              <Object-id_str>ENSG00000078900</Object-id_str>
            </Object-id>
          </Dbtag_tag>
        </Dbtag>
        <Dbtag>
          <Dbtag_db>MIM</Dbtag_db>
          <Dbtag_tag>
            <Object-id>
              <Object-id_id>601990</Object-id_id>
            </Object-id>
          </Dbtag_tag>
        </Dbtag>
      <Dbtag>
          <Dbtag_db>AllianceGenome</Dbtag_db>
          <Dbtag_tag>
            <Object-id>
              <Object-id_str>HGNC:12003</Object-id_str>
            </Object-id>
          </Dbtag_tag>
        </Dbtag>
      </Gene-ref_db>
      <Gene-ref_syn>
        <Gene-ref_syn_E>P73</Gene-ref_syn_E>
        <Gene-ref_syn_E>CILD47</Gene-ref_syn_E>
      </Gene-ref_syn>
      <Gene-ref_formal-name>
        <Gene-nomenclature>
          <Gene-nomenclature_status value="official"/>
          <Gene-nomenclature_symbol>TP73</Gene-nomenclature_symbol>
          <Gene-nomenclature_name>tumor protein p73</Gene-nomenclature_name>
          <Gene-nomenclature_source>
            <Dbtag>
              <Dbtag_db>HGNC</Dbtag_db>
              <Dbtag_tag>
                <Object-id>
                  <Object-id_str>HGNC:12003</Object-id_str>
                </Object-id>
              </Dbtag_tag>
            </Dbtag>
          </Gene-nomenclature_source>
        </Gene-nomenclature>
      </Gene-ref_formal-name>
    </Gene-ref>
  </Entrezgene_gene>
  <Entrezgene_prot>
    <Prot-ref>
      <Prot-ref_name>
        <Prot-ref_name_E>p53-like transcription factor</Prot-ref_name_E>
        <Prot-ref_name_E>p53-related protein</Prot-ref_name_E>
      </Prot-ref_name>
      <Prot-ref_desc>tumor protein p73</Prot-ref_desc>
    </Prot-ref>
  </Entrezgene_prot>
  <Entrezgene_summary>This gene encodes a member of the p53 family of transcription factors involved in cellular responses to stress and development. It maps to a region on chromosome 1p36 that is frequently deleted in neuroblastoma and other tumors, and thought to contain multiple tumor suppressor genes. The demonstration that this gene is monoallelically expressed (likely from the maternal allele), supports the notion that it is a candidate gene for neuroblastoma. Many transcript variants resulting from alternative splicing and/or use of alternate promoters have been found for this gene, but the biological validity and the full-length nature of some variants have not been determined. [provided by RefSeq, Feb 2011]</Entrezgene_summary>
  <Entrezgene_location>
    <Maps>
      <Maps_display-str>1p36.32</Maps_display-str>
      <Maps_method>
        <Maps_method_map-type value="cyto"/>
      </Maps_method>
    </Maps>
  </Entrezgene_location>
  <Entrezgene_gene-source>
    <Gene-source>
    <Gene-source_src>LocusLink</Gene-source_src>
      <Gene-source_src-int>7161</Gene-source_src-int>
      <Gene-source_src-str2>7161</Gene-source_src-str2>
    </Gene-source>
  </Entrezgene_gene-source>
  <Entrezgene_locus>
    <Gene-commentary>
      <Gene-commentary_type value="genomic">1</Gene-commentary_type>
      <Gene-commentary_heading>Reference GRCh38.p14 Primary Assembly</Gene-commentary_heading>
      <Gene-commentary_label>Chromosome 1 Reference GRCh38.p14 Primary Assembly</Gene-commentary_label>
      <Gene-commentary_accession>NC_000001</Gene-commentary_accession>
      <Gene-commentary_version>11</Gene-commentary_version>
      <Gene-commentary_seqs>
        <Seq-loc>
          <Seq-loc_int>
            <Seq-interval>
              <Seq-interval_from>3652515</Seq-interval_from>
              <Seq-interval_to>3736200</Seq-interval_to>
              <Seq-interval_strand>
                <Na-strand value="plus"/>
              </Seq-interval_strand>
              <Seq-interval_id>
                <Seq-id>
                  <Seq-id_gi>568815597</Seq-id_gi>
                </Seq-id>
              </Seq-interval_id>
            </Seq-interval>
          </Seq-loc_int>
        </Seq-loc>
      </Gene-commentary_seqs>
      <Gene-commentary_products>
        <Gene-commentary>
          <Gene-commentary_type value="mRNA">3</Gene-commentary_type>
          <Gene-commentary_heading>Reference</Gene-commentary_heading>
          <Gene-commentary_label>transcript variant 1</Gene-commentary_label>
          <Gene-commentary_accession>NM_005427</Gene-commentary_accession>
          <Gene-commentary_version>4</Gene-commentary_version>
          <Gene-commentary_genomic-coords>
            <Seq-loc>
              <Seq-loc_mix>
                <Seq-loc-mix>
                  <Seq-loc>
                    <Seq-loc_int>
                      <Seq-interval>
                        <Seq-interval_from>3652515</Seq-interval_from>
                        <Seq-interval_to>3652640</Seq-interval_to>
                        <Seq-interval_strand>
                          <Na-strand value="plus"/>
                        </Seq-interval_strand>
                        <Seq-interval_id>
                          <Seq-id>
                            <Seq-id_gi>568815597</Seq-id_gi>
                          </Seq-id>
                        </Seq-interval_id>
                      </Seq-interval>
                    </Seq-loc_int>
                  </Seq-loc>
...
@PoorRican
Copy link
Owner

@smoe I will welcome any changes you see fit. I tried to adapt the original ASN files as close as possible. Many of the Entrezgene tags are not included in the Bioseq, but the general.rs and the sequence tags will apply.

Just in case, here are the files you'll need: https://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/lxr/source/src/objects/entrezgene/

I'll be happy to answer any questions you might have.

@smoe
Copy link
Contributor Author

smoe commented Dec 13, 2024

@PoorRican Thank you so much. I gave it what I could give on #33 which is all of https://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/lxr/source/src/objects/entrezgene/entrezgene.asn minus the XML implementation bits. As I said in the PR, maybe a few magic words of yours would help me to get on track with what is missing.

@smoe
Copy link
Contributor Author

smoe commented Dec 14, 2024

Gave scoremat.asn as shot at #34 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants