Skip to content

Latest commit

 

History

History
439 lines (320 loc) · 18.1 KB

Documentation.md

File metadata and controls

439 lines (320 loc) · 18.1 KB

OptimalCluster.opticlust.Optimal

To find optimal number of clusters using different optimal clustering algorithms such as elbow, elbow-k_factor, silhouette, gap statistics, gap statistics with standard error, and gap statistics without log.

Parameters

----------

kmeans_kwargs : dict
    Arguements to be passed to the KMeans algorithm 
    which will be used to determine the optimal number of clusters.

----------

Methods

----------

elbow : Determines optimal number of clusters using elbow method. 

    Wikipedia - In cluster analysis, the elbow method is a heuristic 
    used in determining the number of clusters in a data set. The 
    method consists of plotting the explained variation as a function
    of the number of clusters, and picking the elbow of the curve as 
    the number of clusters to use. 
    https://en.wikipedia.org/wiki/Elbow_method_(clustering).
    
    The variations are plotted as a scree plot, and the optimal cluster
    value is taken as either the inflection point (where the angle of the 
    scree plot changes the most), or the point after which the scree plot 
    becomes linear. Variations can be measures using inertia or distortions.

elbow_kf : Determines optimal number of clusters by measuring linearity 
    along with a k factor analysis.
    
    Uses scree plot as descibed in elbow method. After standardizing to a 
    fixed scale, the linearity of the slopes between the points is 
    caluculated by measuring their relative means and standard deviations. 
    The last point where the mean becomes greater than the product of the 
    standard deviation and the standard error weight (which is called the 
    k criteria), is taken as the optimal number of clusters. k factor is 
    measured by finding the percentage of points before the optimal cluster 
    vaue that satisfy the k criteria. In general, overlapping clusters 
    result in a lower k factor value, but this can be remedied by 
    increasing the standard error weight up until the optimal cluster value 
    does not change.  
    
    Link with more details will be added soon
    
silhouette : Determines optimal number of clusters by finding the cluster 
    value that gives the maximum silhouette score (using sklearn's 
    silhouette_score).
    
    For more details on sklearn's silhouette_score visit 
    https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html
    
    Wikipedia - The silhouette value is a measure of how similar an object 
    is to its own cluster (cohesion) compared to other clusters (separation).
    The silhouette ranges from −1 to +1, where a high value indicates that 
    the object is well matched to its own cluster and poorly matched to 
    neighboring clusters.
    https://en.wikipedia.org/wiki/Silhouette_(clustering)
    
gap_stat : Determine the optimal number of clusters using the gap statistic 
    as descibed by Tibshirani, Walther and Hastie.
    
    For more details about gap statistic visit
    http://www.web.stanford.edu/~hastie/Papers/gap.pdf
    
    This method determines the optimal cluster value by finding the 
    cluster that provides the highest gap statistic value.
    
gap_stat_se : Determine the optimal number of clusters using the gap statistic 
    along with standard error as descibed by Tibshirani, Walther and Hastie.
    
    For more details about gap statistic with standard error visit
    http://www.web.stanford.edu/~hastie/Papers/gap.pdf
    
    The gap_stat method is not always accurate in cases where clusters are not 
    well defined. This method applies a standard error approach where the first 
    point that satisfies the standard error condition is taken to be the optimal 
    cluster value. 
    
gap_stat_wolog : Determines the optimal cluster value in a similar fashion to 
    the gap statistic method, but without caluclating logarithms as descirbed by 
    Mohajer, Englmeier and Schmid.
    https://core.ac.uk/reader/12172514
    
----------

Example:
>>> from OptimalCluster.opticlust import Optimal
>>> opt = Optimal({'max_iter':200})

----------

Optimal.elbow

Determines optimal number of clusters using elbow method.

     Parameters

    ----------

    df : pandas DataFrame (ideally)
        DataFrame with n features upon which the optimal cluster value needs 
        to be determined.
        
    upper : int, default = 15
        Upper limit of cluster number to be checked (exclusive).
        
    display : boolean, default = False
        If True then a matplotlib plot is displayed. It contains the scree 
        plot with the inertia/distortion values (standardized to a fixed value) 
        on the Y-axis and the corresponding cluster number on the X-axis.
        
    visualize : boolean, default = False
        If True then the _visualize method is called.
    
    function : {'inertia','distortion'}, default = 'inertia'
        The function used to calculate Variation
        
        'inertia' : Variation used in KMeans is inertia
        
        'distortion' : Variation used in KMeans is distortion
        
        For more information visit
        https://www.geeksforgeeks.org/elbow-method-for-optimal-value-of-k-in-kmeans/
        
    method : {'angle','lin'}, default = 'angle'
        The method used to calculate the optimal cluster value.
        
        'angle' : The point which has the largest angle of inflection is taken as 
            the optimal cluster value. Works well for lesser number of actual clusters.
            
        'lin' : The first point where the sum of the squares of the difference in 
            slopes of every point after it is less than the sq_er parameter is 
            taken as the optimal cluster value. Works well for a larger range of 
            actual cluster values since the sq_er parameter is adjustable. 
            For more details visit *link*
            
    sq_er : int, default = 1
        Only used when method parameter is 'lin'. The first point where the sum 
        of the squares of the difference in slopes of every point after it is 
        less than the sq_er parameter is taken as the optimal cluster value. When 
        there is suspected to be overlapping/not well defined clusters, a lower 
        value of this parameter can help separate these clusters and give a better 
        value for optimal cluster.
        For more details visit *link*
        
    ----------
    
    Example:
    >>> from OptimalCluster.opticlust import Optimal
    >>> opt = Optimal()
    >>> from sklearn.datasets.samples_generator import make_blobs
    >>> x, y = make_blobs(1000, n_features=2, centers=3)
    >>> df = pd.DataFrame(x)
    >>> opt_value = opt.elbow(df)
    Optimal number of clusters is:  3  

    ----------

Optimal.elbow_kf

Determines optimal number of clusters by measuring linearity along with a k factor analysis.

    Parameters

    ----------

    df : pandas DataFrame (ideally)
        DataFrame with n features upon which the optimal cluster value needs 
        to be determined.
        
    upper : int, default = 15
        Upper limit of cluster number to be checked (exclusive).
        
    display : boolean, default = False
        If True then a matplotlib plot is displayed. It contains the scree 
        plot with the inertia/distortion values (standardized to a fixed value) 
        on the Y-axis and the corresponding cluster number on the X-axis.
        
    visualize : boolean, default = False
        If True then the _visualize method is called.
    
    function : {'inertia','distortion'}, default = 'inertia'
        The function used to calculate Variation
        
        'inertia' : Variation used in KMeans is inertia
        
        'distortion' : Variation used in KMeans is distortion
        
        For more information visit
        https://www.geeksforgeeks.org/elbow-method-for-optimal-value-of-k-in-kmeans/
           
    se_weight : int, default = 0
        The standard error weight parameter used to find the k criteria. In general,
        this parameter is to be increased until a satisfactory k factor is reached, 
        and the corresponding value for cluster number is the optimal cluster. When 
        there are overlapping/not well defined clusters it is better to increase the 
        value of this parameter.
        Note - the value of se_weight will automatically increase by increments of 0.5
        if the k criteria is not satisfied in that iteration.
        For more details visit *link*
        
    ----------
    
    Example:
    >>> from OptimalCluster.opticlust import Optimal
    >>> opt = Optimal()
    >>> from sklearn.datasets.samples_generator import make_blobs
    >>> x, y = make_blobs(1000, n_features=2, centers=5)
    >>> df = pd.DataFrame(x)
    >>> opt_value = opt.elbow_kf(df)
    Optimal number of clusters is:  5  with k_factor: 0.6 .

    ----------

Optimal.silhouette

Determines optimal number of clusters by finding the cluster value that gives the maximum silhouette score (using sklearn's silhouette_score).

    Parameters

    ----------

    df : pandas DataFrame (ideally)
        DataFrame with n features upon which the optimal cluster value needs 
        to be determined.
        
    lower : int, default = 2
        Lower limit of cluster number to be checked (inclusive).
        
    upper : int, default = 15
        Upper limit of cluster number to be checked (exclusive).
        
    display : boolean, default = False
        If True then a matplotlib plot is displayed. It contains the scree 
        plot with the inertia/distortion values (standardized to a fixed value) 
        on the Y-axis and the corresponding cluster number on the X-axis.
        
    visualize : boolean, default = False
        If True then the _visualize method is called.
        
    ----------
    
    Example:
    >>> from OptimalCluster.opticlust import Optimal
    >>> opt = Optimal()
    >>> from sklearn.datasets.samples_generator import make_blobs
    >>> x, y = make_blobs(1000, n_features=3, centers=6)
    >>> df = pd.DataFrame(x)
    >>> opt_value = opt.silhouette(df)
    Optimal number of clusters is:  6

    ----------

Optimal.gap_stat

Determine the optimal number of clusters using the gap statistic as descibed by Tibshirani, Walther and Hastie. Finds optimal cluster value by finding clutser number that produces maximum value of gap statistic.

    Parameters

    ----------

    df : pandas DataFrame (ideally)
        DataFrame with n features upon which the optimal cluster value needs 
        to be determined.
        
    B : int, default = 5
        The number of samples references. 
        For more details about B parameter visit
        http://www.web.stanford.edu/~hastie/Papers/gap.pdf
        
    lower : int, default = 1
        Lower limit of cluster number to be checked (inclusive).
        
    upper : int, default = 15
        Upper limit of cluster number to be checked (exclusive).
        
    display : boolean, default = False
        If True then a matplotlib plot is displayed. It contains the scree 
        plot with the inertia/distortion values (standardized to a fixed value) 
        on the Y-axis and the corresponding cluster number on the X-axis.
        
    visualize : boolean, default = False
        If True then the _visualize method is called.
        
    ----------
    
    Example:
    >>> from OptimalCluster.opticlust import Optimal
    >>> opt = Optimal()
    >>> from sklearn.datasets.samples_generator import make_blobs
    >>> x, y = make_blobs(1000, n_features=3, centers=6)
    >>> df = pd.DataFrame(x)
    >>> opt_value = opt.gap_stat(df)
    6

    ----------

Optimal.gap_stat_se

Determine the optimal number of clusters using the gap statistic along with standard error as descibed by Tibshirani, Walther and Hastie. This method works better than gap_stat when there are custers that are not well defined.

    Parameters

    ----------

    df : pandas DataFrame (ideally)
        DataFrame with n features upon which the optimal cluster value needs 
        to be determined.
        
    B : int, default = 5
        The number of samples references. 
        For more details about B parameter visit
        http://www.web.stanford.edu/~hastie/Papers/gap.pdf
        
    lower : int, default = 1
        Lower limit of cluster number to be checked (inclusive).
        
    upper : int, default = 15
        Upper limit of cluster number to be checked (exclusive).
        
    display : boolean, default = False
        If True then a matplotlib plot is displayed. It contains the scree 
        plot with the inertia/distortion values (standardized to a fixed value) 
        on the Y-axis and the corresponding cluster number on the X-axis.
        
    visualize : boolean, default = False
        If True then the _visualize method is called.
        
    se_weight : int, default = 1
        The standard error weight to be multiplied with the standard error. Although
        Tibshirani, Walther and Hastie descibe a '1 - standard error' approach, this
        weight can be adjusted. Generally, when then there are overapping/not-well 
        defined clusters, an increase in this parameter can give a more accurate 
        optimal cluster value.
        
    ----------
    
    Example:
    >>> from OptimalCluster.opticlust import Optimal
    >>> opt = Optimal()
    >>> from sklearn.datasets.samples_generator import make_blobs
    >>> x, y = make_blobs(1000, n_features=3, centers=6)
    >>> df = pd.DataFrame(x)
    >>> opt_value = opt.gap_stat_se(df)
    6

    ----------

Optimal.gap_stat_wolog

Determines the optimal cluster value in a similar fashion to the gap statistic method, but without caluclating logarithms as descirbed by Mohajer, Englmeier and Schmid. This method may work better when the gap_stat method tends to overestimate.

    Parameters

    ----------

    df : pandas DataFrame (ideally)
        DataFrame with n features upon which the optimal cluster value needs 
        to be determined.
        
    B : int, default = 5
        The number of samples references. 
        For more details about B parameter visit
        http://www.web.stanford.edu/~hastie/Papers/gap.pdf
        
    upper : int, default = 15
        Upper limit of cluster number to be checked (exclusive).
        
    display : boolean, default = False
        If True then a matplotlib plot is displayed. It contains the scree 
        plot with the inertia/distortion values (standardized to a fixed value) 
        on the Y-axis and the corresponding cluster number on the X-axis.
        
    visualize : boolean, default = False
        If True then the _visualize method is called.
        
    method : {'angle','lin','max'}, default = 'lin'
        The method used to calculate the optimal cluster value.
        
        'angle' : The point which has the largest angle of inflection is taken as 
            the optimal cluster value. Works well for lesser number of actual clusters.
            
        'lin' : The first point where the sum of the squares of the difference in 
            slopes of every point after it is less than the sq_er parameter is 
            taken as the optimal cluster value. Works well for a larger range of 
            actual cluster values since the sq_er parameter is adjustable. 
            For more details visit *link*
            
        'max' : max value (not recommended)
        
    se_weight : int, default = 1
        The standard error weight to be multiplied with the standard error. Although
        Tibshirani, Walther and Hastie descibe a '1 - standard error' approach, this
        weight can be adjusted. Generally, when then there are overapping/not-well 
        defined clusters, an increase in this parameter can give a more accurate 
        optimal cluster value.
        
    sq_er : int, default = 1
        Only used when method parameter is 'lin'. The first point where the sum 
        of the squares of the difference in slopes of every point after it is 
        less than the sq_er parameter is taken as the optimal cluster value. When 
        there is suspected to be overlapping/not well defined clusters, a lower 
        value of this parameter can help separate these clusters and give a better 
        value for optimal cluster.
        For more details visit *link*
        
    ----------
    
    Example:
    >>> from OptimalCluster.opticlust import Optimal
    >>> opt = Optimal()
    >>> from sklearn.datasets.samples_generator import make_blobs
    >>> x, y = make_blobs(1000, n_features=3, centers=6)
    >>> df = pd.DataFrame(x)
    >>> opt_value = opt.gap_stat_wolog(df)
    6

    ----------