Dataset for Hierarchical clustering of NCBI protein data
This is an approximate Hierarchical clustering created using Machine Learning. Comparing all-against-all is practically not possible, approximate methods are used to generate this dataset. This is not based on edit distance or relative score. Score is calculated using streaming methods.
Full explanation will be released later.
Hierarchical Clustering of 7 Million proteins from NCBI data (ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/env_nr.gz) can be downloaded from https://www.kaggle.com/rajasankar/hierarchical-clustering-of-7-million-proteins
This dataset can be used for only non commercial use and research purposes. Contact author for using commerical use.
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.