Skip to content

Latest commit

 

History

History
93 lines (51 loc) · 4.73 KB

README.md

File metadata and controls

93 lines (51 loc) · 4.73 KB

EzBioCloud

Python script designed to streamline bioinformatics analysis and facilitate data extraction from EzBioCloud

EzBioCloud is a bioscience's public data and analytics portal focusing on taxonomy, ecology, genomics, metagenomics, and microbiome of Bacteria and Archaea.

Unfortunatelly Ezbiocloud does not provide any API keys. Because of that, here I present a solution to automate processing of big scale microbiome analysis using original approach -automatic webdriver for Chrome Selenium.

The programe download from Ezbiocloud crucial data in ordered way and extract some specific data eg. total valid reads, percentage valid reads, species, percentage etc.

1. Input

Firstly interpreter ask User for:

  • path where experiment folder might be created and experiment name,
  • login and password to EZBioCloud,
  • all samples IDs,

All samples' fastq files have to be already be uploaded to EZBioCloud

2. Downloading and file management

Webdriver enter EZBioCloud, login and search first given sample ID.

.xlsx files and .png charts for genus and species are downloaded and moved into a given folder.

Because of the fact that changing download folder location in Chrome using `` Webdriver is problematic - the files are first downloaded into Users Download folder by default and then renamed and moved into a given folder. You can change a path of a download folder location here:

source_folder = r'C:\Users\Asus\Downloads'

Remember that download folder MUST be empty.

Sample file after this step:

image

3. Create INFO.txt file

Total valid reads and percentage valid reads values are taken and INFO.txt file is created.

image

4. Create details.xlsx

The main goal is to create a single details.xlsx file based on files downloaded and EZBiocloud app for every sample. The excel sheet provide all microbiome genuses types sorted by percetage and create separated Details column for species detected in a sample for each genus.

The threshold is set on 1% and only genus types and species with percentage more than 1% are processed and then shown in final excel sheet. #BEFORE

  • Genus file example:

image

...

  • Species file example:

image

... #AFTER

  • Output details.xlsx file example(final excel):

image

5. Comparing contig similarity in a taxonomic group

During alignment, EZBioCloud sometimes assign reads to a taxonomic group instead of specific species. A taxonomic group is defined as a group of taxa (species/subspecies) that cannot be differentiated solely by 16S rRNA sequences. A typical example is the case of Escherichia coli and Shigella spp., which show almost identical 16S rRNA sequences. It is safer to identify such 16S rRNA sequences as a member of a species group that contains very similar 16S rRNA sequences, rather than to potentially wrongly assign them as E. coli. For example:

image

In this situation, contig data is used (contig is a set of identical and sometimes overlapping sequences that together represent a consensus region of DNA) in order to show the most likely species. Webdriver make a set of activities:

  1. Find taxonomic group in EZBiocloud Taxonomic hierarchy:

image

  1. Take first contig top hit

image

  1. Compare similarity percentage of all 5 Hit Species Name:

image

In above example first four species names will be taken, written in organized way together with taxonomic group percentage and added to detail.xlsx file:

image

Rules of extracting Hit Species Name:

  • Take all Hit Species Name with 100% Similarity
  • If there is no such Hit Species Name with 100% Similarity, then take Hit Species Name with Similarity above 99%