Skip to content

rajasankar/protein-clustering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

protein-clustering

Dataset for Hierarchical clustering of NCBI protein data

This is an approximate Hierarchical clustering created using Machine Learning. Comparing all-against-all is practically not possible, approximate methods are used to generate this dataset. This is not based on edit distance or relative score. Score is calculated using streaming methods.

Full explanation will be released later.

Hierarchical Clustering of 7 Million proteins from NCBI data (ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/env_nr.gz) can be downloaded from https://www.kaggle.com/rajasankar/hierarchical-clustering-of-7-million-proteins

This dataset can be used for only non commercial use and research purposes. Contact author for using commerical use.

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

About

Hierarchical clustering of NCBI protein dataset

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages