Parsing NCBI Gene XML exports #30

smoe · 2024-12-12T16:56:40Z

Hello,
Thank you tons for what you have already done. My personal interest is in parsing XML exports from gene descriptions, like https://www.ncbi.nlm.nih.gov/gene?Db=gene&Cmd=DetailsSearch&Term=7161 (Send To File -> XML format).
I would happily contribute what is missing to help with the parsing, is there an easy way to get me going? Or shall I just come up with something that I hope to be compatible with your plans?
Many thanks!

This is how it looks:

<?xml version="1.0"?>
<!DOCTYPE Entrezgene-Set PUBLIC "-//NLM//DTD NCBI-Entrezgene, 21st January 2005//EN" "https://www.ncbi.nlm.nih.gov/data_specs/dtd/NCBI_Entrezgene.dtd">
<Entrezgene-Set>

<Entrezgene>
  <Entrezgene_track-info>
    <Gene-track>
      <Gene-track_geneid>7161</Gene-track_geneid>
      <Gene-track_status value="live">0</Gene-track_status>
      <Gene-track_create-date>
        <Date>
          <Date_std>
            <Date-std>
              <Date-std_year>1998</Date-std_year>
              <Date-std_month>8</Date-std_month>
              <Date-std_day>13</Date-std_day>
            </Date-std>
          </Date_std>
        </Date>
      </Gene-track_create-date>
      <Gene-track_update-date>
        <Date>
          <Date_std>
            <Date-std>
              <Date-std_year>2024</Date-std_year>
              <Date-std_month>12</Date-std_month>
              <Date-std_day>10</Date-std_day>
              <Date-std_hour>8</Date-std_hour>
              <Date-std_minute>46</Date-std_minute>
              <Date-std_second>0</Date-std_second>
            </Date-std>
          </Date_std>
        </Date>
      </Gene-track_update-date>
    </Gene-track>
  </Entrezgene_track-info>
  <Entrezgene_type value="protein-coding">6</Entrezgene_type>
  <Entrezgene_source>
    <BioSource>
      <BioSource_genome value="genomic">1</BioSource_genome>
      <BioSource_origin value="natural">1</BioSource_origin>
      <BioSource_org>
        <Org-ref>
          <Org-ref_taxname>Homo sapiens</Org-ref_taxname>
          <Org-ref_common>human</Org-ref_common>
          <Org-ref_db>
            <Dbtag>
              <Dbtag_db>taxon</Dbtag_db>
              <Dbtag_tag>
                <Object-id>
                  <Object-id_id>9606</Object-id_id>
                </Object-id>
              </Dbtag_tag>
            </Dbtag>
          </Org-ref_db>
          <Org-ref_orgname>
            <OrgName>
              <OrgName_name>
                <OrgName_name_binomial>
                  <BinomialOrgName>
                    <BinomialOrgName_genus>Homo</BinomialOrgName_genus>
                    <BinomialOrgName_species>sapiens</BinomialOrgName_species>
                  </BinomialOrgName>
                </OrgName_name_binomial>
              </OrgName_name>
              <OrgName_attrib>specified</OrgName_attrib>
              <OrgName_lineage>Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; Catarrhini; Hominidae; Homo</OrgName_lineage>
              <OrgName_gcode>1</OrgName_gcode>
              <OrgName_mgcode>2</OrgName_mgcode>
              <OrgName_div>PRI</OrgName_div>
            </OrgName>
          </Org-ref_orgname>
        </Org-ref>
      </BioSource_org>
      <BioSource_subtype>
        <SubSource>
          <SubSource_subtype value="chromosome">1</SubSource_subtype>
          <SubSource_name>1</SubSource_name>
        </SubSource>
      </BioSource_subtype>
    </BioSource>
  </Entrezgene_source>
  <Entrezgene_gene>
    <Gene-ref>
      <Gene-ref_locus>TP73</Gene-ref_locus>
      <Gene-ref_desc>tumor protein p73</Gene-ref_desc>
      <Gene-ref_maploc>1p36.32</Gene-ref_maploc>
      <Gene-ref_db>
        <Dbtag>
          <Dbtag_db>HGNC</Dbtag_db>
          <Dbtag_tag>
            <Object-id>
              <Object-id_str>HGNC:12003</Object-id_str>
            </Object-id>
          </Dbtag_tag>
        </Dbtag>
        <Dbtag>
          <Dbtag_db>Ensembl</Dbtag_db>
          <Dbtag_tag>
            <Object-id>
              <Object-id_str>ENSG00000078900</Object-id_str>
            </Object-id>
          </Dbtag_tag>
        </Dbtag>
        <Dbtag>
          <Dbtag_db>MIM</Dbtag_db>
          <Dbtag_tag>
            <Object-id>
              <Object-id_id>601990</Object-id_id>
            </Object-id>
          </Dbtag_tag>
        </Dbtag>
      <Dbtag>
          <Dbtag_db>AllianceGenome</Dbtag_db>
          <Dbtag_tag>
            <Object-id>
              <Object-id_str>HGNC:12003</Object-id_str>
            </Object-id>
          </Dbtag_tag>
        </Dbtag>
      </Gene-ref_db>
      <Gene-ref_syn>
        <Gene-ref_syn_E>P73</Gene-ref_syn_E>
        <Gene-ref_syn_E>CILD47</Gene-ref_syn_E>
      </Gene-ref_syn>
      <Gene-ref_formal-name>
        <Gene-nomenclature>
          <Gene-nomenclature_status value="official"/>
          <Gene-nomenclature_symbol>TP73</Gene-nomenclature_symbol>
          <Gene-nomenclature_name>tumor protein p73</Gene-nomenclature_name>
          <Gene-nomenclature_source>
            <Dbtag>
              <Dbtag_db>HGNC</Dbtag_db>
              <Dbtag_tag>
                <Object-id>
                  <Object-id_str>HGNC:12003</Object-id_str>
                </Object-id>
              </Dbtag_tag>
            </Dbtag>
          </Gene-nomenclature_source>
        </Gene-nomenclature>
      </Gene-ref_formal-name>
    </Gene-ref>
  </Entrezgene_gene>
  <Entrezgene_prot>
    <Prot-ref>
      <Prot-ref_name>
        <Prot-ref_name_E>p53-like transcription factor</Prot-ref_name_E>
        <Prot-ref_name_E>p53-related protein</Prot-ref_name_E>
      </Prot-ref_name>
      <Prot-ref_desc>tumor protein p73</Prot-ref_desc>
    </Prot-ref>
  </Entrezgene_prot>
  <Entrezgene_summary>This gene encodes a member of the p53 family of transcription factors involved in cellular responses to stress and development. It maps to a region on chromosome 1p36 that is frequently deleted in neuroblastoma and other tumors, and thought to contain multiple tumor suppressor genes. The demonstration that this gene is monoallelically expressed (likely from the maternal allele), supports the notion that it is a candidate gene for neuroblastoma. Many transcript variants resulting from alternative splicing and/or use of alternate promoters have been found for this gene, but the biological validity and the full-length nature of some variants have not been determined. [provided by RefSeq, Feb 2011]</Entrezgene_summary>
  <Entrezgene_location>
    <Maps>
      <Maps_display-str>1p36.32</Maps_display-str>
      <Maps_method>
        <Maps_method_map-type value="cyto"/>
      </Maps_method>
    </Maps>
  </Entrezgene_location>
  <Entrezgene_gene-source>
    <Gene-source>
    <Gene-source_src>LocusLink</Gene-source_src>
      <Gene-source_src-int>7161</Gene-source_src-int>
      <Gene-source_src-str2>7161</Gene-source_src-str2>
    </Gene-source>
  </Entrezgene_gene-source>
  <Entrezgene_locus>
    <Gene-commentary>
      <Gene-commentary_type value="genomic">1</Gene-commentary_type>
      <Gene-commentary_heading>Reference GRCh38.p14 Primary Assembly</Gene-commentary_heading>
      <Gene-commentary_label>Chromosome 1 Reference GRCh38.p14 Primary Assembly</Gene-commentary_label>
      <Gene-commentary_accession>NC_000001</Gene-commentary_accession>
      <Gene-commentary_version>11</Gene-commentary_version>
      <Gene-commentary_seqs>
        <Seq-loc>
          <Seq-loc_int>
            <Seq-interval>
              <Seq-interval_from>3652515</Seq-interval_from>
              <Seq-interval_to>3736200</Seq-interval_to>
              <Seq-interval_strand>
                <Na-strand value="plus"/>
              </Seq-interval_strand>
              <Seq-interval_id>
                <Seq-id>
                  <Seq-id_gi>568815597</Seq-id_gi>
                </Seq-id>
              </Seq-interval_id>
            </Seq-interval>
          </Seq-loc_int>
        </Seq-loc>
      </Gene-commentary_seqs>
      <Gene-commentary_products>
        <Gene-commentary>
          <Gene-commentary_type value="mRNA">3</Gene-commentary_type>
          <Gene-commentary_heading>Reference</Gene-commentary_heading>
          <Gene-commentary_label>transcript variant 1</Gene-commentary_label>
          <Gene-commentary_accession>NM_005427</Gene-commentary_accession>
          <Gene-commentary_version>4</Gene-commentary_version>
          <Gene-commentary_genomic-coords>
            <Seq-loc>
              <Seq-loc_mix>
                <Seq-loc-mix>
                  <Seq-loc>
                    <Seq-loc_int>
                      <Seq-interval>
                        <Seq-interval_from>3652515</Seq-interval_from>
                        <Seq-interval_to>3652640</Seq-interval_to>
                        <Seq-interval_strand>
                          <Na-strand value="plus"/>
                        </Seq-interval_strand>
                        <Seq-interval_id>
                          <Seq-id>
                            <Seq-id_gi>568815597</Seq-id_gi>
                          </Seq-id>
                        </Seq-interval_id>
                      </Seq-interval>
                    </Seq-loc_int>
                  </Seq-loc>
...

The text was updated successfully, but these errors were encountered:

PoorRican · 2024-12-12T23:11:08Z

@smoe I will welcome any changes you see fit. I tried to adapt the original ASN files as close as possible. Many of the Entrezgene tags are not included in the Bioseq, but the general.rs and the sequence tags will apply.

Just in case, here are the files you'll need: https://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/lxr/source/src/objects/entrezgene/

I'll be happy to answer any questions you might have.

smoe · 2024-12-13T17:57:01Z

@PoorRican Thank you so much. I gave it what I could give on #33 which is all of https://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/lxr/source/src/objects/entrezgene/entrezgene.asn minus the XML implementation bits. As I said in the PR, maybe a few magic words of yours would help me to get on track with what is missing.

smoe · 2024-12-14T17:36:58Z

Gave scoremat.asn as shot at #34 .

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parsing NCBI Gene XML exports #30

Parsing NCBI Gene XML exports #30

smoe commented Dec 12, 2024

PoorRican commented Dec 12, 2024

smoe commented Dec 13, 2024

smoe commented Dec 14, 2024

Parsing NCBI Gene XML exports #30

Parsing NCBI Gene XML exports #30

Comments

smoe commented Dec 12, 2024

PoorRican commented Dec 12, 2024

smoe commented Dec 13, 2024

smoe commented Dec 14, 2024