protclust is a Python library for protein sequence analysis that integrates MMseqs2 for fast clustering and provides tools for creating robust machine learning datasets. It offers cluster-aware data splitting to prevent sequence similarity bias in model evaluation, along with comprehensive protein embedding capabilities for feature generation.
-
Updated
Mar 21, 2025 - Python