protein-clustering

Dataset for Hierarchical clustering of NCBI protein data

This is an approximate Hierarchical clustering created using Machine Learning. Comparing all-against-all is practically not possible, approximate methods are used to generate this dataset. This is not based on edit distance or relative score. Score is calculated using streaming methods.

Full explanation will be released later.

Hierarchical Clustering of 7 Million proteins from NCBI data (ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/env_nr.gz) can be downloaded from https://www.kaggle.com/rajasankar/hierarchical-clustering-of-7-million-proteins

This dataset can be used for only non commercial use and research purposes. Contact author for using commerical use.

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
License.md		License.md
README.md		README.md
cluster27k.tar.gz		cluster27k.tar.gz
flare_27k.json		flare_27k.json
graph.html		graph.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

protein-clustering

About

Releases

Packages

Languages

License

rajasankar/protein-clustering

Folders and files

Latest commit

History

Repository files navigation

protein-clustering

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages