- Minor: Fixed uncatched potential error in tests for CRAN "additional issues" via 75cb7d5760be08b0bb
- The commercial Gurobi solver is now available as backened for the exact anticlustering methods
- The documentation refers to a new anticlustering paper, available from bioRxiv
- Minor fix in
three_phase_search_anticlustering()
forobjective = "dispersion"
categories_to_binary()
no longer uses dummy coding with a reference category, but instead codes each levels of a categorical variable with a separate variable (thanks to Gunnar Klau for spotting potential problems with dummy coding).anticlustering()
has newmethod = "2PML"
, which is an improved heuristic when using must-link constraints
three_phase_search_anticlustering()
implements the three phase search algorithm by Yang et al., contributed by Hannah Hengelbrock (@HanneyAI)
anticlustering()
now has an argumentmust_link
, which can be used force elements into the same cluster- It is now possible that a cluster (e.g., in
anticlustering()
) only has 1 member (this threw an error before)
diversity_objective()
is now computed correctly when a cluster only has one member (fixed via 8403fab1461b2cda8)- Fixed a memory leak in
anticlustering(..., objective = "diversity")
thanks to @HanneyAI (via 24c244faf8b2c0774071)
bicriterion_anticlustering()
has new arguments:dispersion_distances
,average_diversity
,init_partitions
,return
.anticlustering()
now has newobjective = "average-diversity"
- In
anticlustering()
,method = "brusco"
now works forobjective = "variance"
and"objective = kplus"
- Added
lpSolve
solver as backend for the optimal methods, and it is now the default solver optimal_anticlustering()
andoptimal_dispersion()
now have an additional argumenttime_limit
anticlustering()
now has an argumentcannot_link
, which can be used to forbid pairs of elements from being assigned to the same cluster. When used, this solves the same (NP hard) graph coloring problem asoptimal_dispersion()
. Unlike the other optimal methods, it uses the Symphony solver with priority, when it is found (otherwise thelpSolve
)
- Bug fix in
optimal_dispersion()
: Output element$edges
no longer includes edges that were investigated in the last iteration of the algorithm (and which are not relevant for finding the optimal dispersion)
- Speed improvements for
anticlustering(..., objective = "diversity")
when usingmethod = "local-maximum"
andrepetitions
(the restart algorithm is now entirely implemented in C and does not callmethod = "exchange"
repeatedly from R) anticlust
now depends on packagelpSolve
optimal_anticlustering()
is a new exported function that gathers all currently (and in the future) implemented optimal algorithms for anticlusteringbalanced_clustering()
now has an argumentsolver
, which can be used to specify the ILP solver when usingmethod = "ilp"
- The default selection of ILP solvers in
anticlustering()
,balanced_clustering()
andoptimal_dispersion()
was changed due to a reoccurring CRAN issue: If both the Rglpk and the Rsymphony packages are available, the GLPK will now be prioritized. This is because the SYMPHONY solver sometimes crashes on Macs (or at least on one CRAN test station). Theoptimal_anticlustering()
,optimal_dispersion()
, andbalanced_clustering()
functions have an argumentsolver
that can be used to circumvent this default behaviour. anticlust
now usestinytest
instead oftestthat
for unit tests.
- Some minor updates to documentation and vignettes
- Updating all references to the k-plus anticlustering paper after its "actual" publication:
Papenberg, M. (2024). K-plus Anticlustering: An Improved k-means Criterion for Maximizing Between-Group Similarity. British Journal of Mathematical and Statistical Psychology, 77(1), 80--102. https://doi.org/10.1111/bmsp.12315
fast_anticlustering()
received another internal change to improve the speed of the re-computation of the objective during the optimization. In particular, updating the objective is now done by only inspecting the two clusters between which an exchange actually took place, instead of re-computing a sum across all clusters.
- (Regression)
anticlustering(..., objective = "variance")
uses pre 0.8.0 implementation to fix some CRAN issues
fast_anticlustering()
now has an additional argumentexchange_partners
, which can be used to pass custom exchange partners instead of using the default nearest neighbour search.generate_exchange_partners()
is a new exported function that can be used to address the new argumentexchange_partners
infast_anticlustering()
.
anticlustering()
received internal changes to ensure that it no longer crashes the computer for about N > 250000 elements.fast_anticlustering()
has been re-implemented in C, which is much faster than the previous R implementation.fast_anticlustering()
now uses an alternative computation of the k-means objective, which reduces run time by an order of magnitude as compared to before.
- Expanded documentation of
fast_anticlustering()
. - The vignette "Speeding up anticlustering" has been rewritten to reflect that
fast_anticlustering()
is now again the best choice for processing (very) large data sets.
- An exact ILP method is now available for maximizing the dispersion, contributed by Max Diekhoff.
optimal_dispersion()
is a new exported function implementing the methodanticlustering()
makes it available when usingmethod = "ilp"
andobjective = "dispersion"
kplus_moment_variables()
is a new exported function that generates k-plus variables from a data set
- Offers some additional flexibility as compared to calling
kplus_anticlustering()
, which generates these variables internally (e.g., use k-plus augmentation on some variables but not all -- such as binary variables)
categories_to_binary()
is a new exported function that converts one or several categorical variables into a binary representation- Can be used to include categorical variables as part of the optimization criterion in k-means / k-plus anticlustering, see new vignette "Using categorical variables with anticlustering"
- 3 new vignettes have been added to the
anticlust
documentation - Fixed a bug in
kplus_anticlustering()
that did not correctly implementpreclustering = TRUE
- It is now possible to use the SYMPHONY solver as backend for the optimal ILP methods.
- Implements
some
fixes in the
internal function
gdc_set()
that finds the greatest common denominator in a set of numbers. The fixes preventcategorical_sampling()
(which is also called byanticlustering()
when using thecategories
argument) from potentially running into an infinite loop when combining uneven group sizes viaK
with acategories
argument.
kplus_anticlustering()
now has an argumentT
instead ofmoments
, whereT
denotes the number of distribution moments considered during k-plus anticlustering (moments
was an integer vector specifying each individual moment that should be considered)- Explanation: Lower order moments should be skipped in favour of higher order moments, so the new interface makes more sense.
Major changes
- This release adds a new exported function and removes two others (I very much doubt anyone used those, though -- see below -- if your code is affected, please email me).
kplus_anticlustering()
is a new exported function: A new interface function to k-plus anticlustering, implementing the k-plus method as described in "K-plus Anticlustering: An Improved K-means Criterion for Maximizing Between-Group Similarity" (Papenberg, 2023; https://doi.org/10.1111/bmsp.12315). Usinganticlustering(x, K, objective = "kplus")
is still supported and remains unchanged. The new functionkplus_anticlustering()
, however, offers more functionality and nuance with regard to optimizing the k-plus objective family.- The function
kplus_objective()
was removed. - The function
mean_sd_obj()
was removed.
Explanations for the rather drastic changes, i.e., removing instead of deprecating functions (that very likely do not affect anyone):
-
Given the advanced theoretical background for k-plus anticlustering, the function
kplus_objective()
no longer makes any sense. Given that the k-plus objective is a family of objectives, keeping the function that computes one special case is more harmful to keep it than to just remove it now. As the k-plus objective basically re-uses the k-means criterion, maintaining a function such askplus_objective()
was questionable to begin with. -
Since there is the k-plus anticlustering method now, I did not want to keep the "hacky" way to optimize similarity with regard to means and standard deviations, i.e., using the
mean_sd_obj()
function asobjective
in anticlustering. Please use the k-plus method to optimize similarity with regard to means and standard deviations (you can even extend to skewness, kurtosis, and other higher order moments; see the newkplus_anticlustering()
function).
Minor changes
- Finally added Marie Luisa Schaper as contributor for contributing her data set
- Some work on documentation
- Some work on docs and examples
- Minor bug fix in C code base via c1a5604f
anticlust
now includes the bicriterion algorithm for simultaneously maximizing diversity and dispersion, proposed by Brusco et al. (doi:10.1111/bmsp.12186) and implemented by Martin Breuer (for details see his bachelor thesis)- It can be called from the main function
anticlustering()
by settingmethod = "brusco"
; in this case only either dispersion or diversity is maximized bicriterion_anticlustering()
-- newly exported in this version -- can be used for a more fine grained usage of the Brusco et al. algorithm, fully using its main functionality to optimize both dispersion as well as diversity
- It can be called from the main function
- Just an update to the documentation: Updating all references to the Papenberg & Klau paper after its "actual" publication in Psychological Methods:
Papenberg, M., & Klau, G. W. (2021). Using anticlustering to partition data sets into equivalent parts. Psychological Methods, 26(2), 161–174. https://doi.org/10.1037/met0000301
- Minor bug fix in
plot_clusters()
(via 87f585798)
plot_clusters()
now uses the default color palette to highlight the different clustersplot_clusters()
now uses differentpch
symbols when the number of clusters is low (K < 8)
-
anticlustering()
andcategorical_sampling()
now better balance categorical variables when the output groups require different sizes (i.e., if the group sizes do not have any common denominator) -
Some additional input validations for more useful error messages when arguments in
anticlustering()
are not correctly specified
anticlustering()
has a new argumentstandardize
to standardize the data input before the optimization starts. This is useful to give all variables the same weight in the anticlustering process, irregardless of the scaling of the variables. Especially useful forobjective = "kplus"
to ensure that both minimizing differences with regard to means and variance is equally important.
- Fixes a memory leak in the C code base, via 2c4fe6d
- Internal change:
anticlustering()
withobjective = "dispersion"
now implements the local updating procedure proposed by Martin Breuer. This leads to a considerable speedup when maximizing the dispersion, enabling the fast processing of large data sets.
anticlustering()
now has native support for the maximizing the dispersion objective, settingobjective = "dispersion"
. The dispersion is the minimum distance between any two elements within the same cluster, see?dispersion_objective
.
- The exchange optimization algorithm for anticlustering has been reimplemented in C, leading to a substantial boost in performance when using one of the supported objectives "diversity", "variance", "dispersion", or "kplus". (Optimizing user-defined objective functions still has to be done in plain R and therefore has not been sped up.)
-
kplus_objective()
is a new function to compute the value of the k-plus criterion given a clustering. See?kplus_objective
for details. -
In
anticlustering()
andcategorical_sampling()
, the argumentK
can now be used to specify the size of the groups, not just the number of groups. This way, it is easy to request groups of different size. See the help pages?anticlustering
and?categorical_sampling
for examples.
- Fixed two minor bugs that prevented the correct transformation of class
dist
to classmatrix
when using the repeated exchange (or "local-maximum") method, see c42e136 and e6fdae5.
-
In
anticlustering()
, there is a new option for the argumentmethod
: "local-maximum". When usingmethod = "local-maximum"
, the exchange method is repeated until an local maximum is reached. That means after the exchange process has been conducted for each data point, the algorithm restarts with the first element and proceeds to conduct exchanges until the objective cannot be improved. This procedure is more in line with classical neighbourhood search that only terminates when a local optimum is reached. -
In
anticlustering()
, there is now a new argumentrepetitions
. It can be used to specify the number of times the exchange procedure (eithermethod = "exchange"
ormethod = "local-maximum"
) is called.anticlustering()
returns the best partitioning found across all repetitions. -
anticlustering()
now implements a new objective function, extending the classical k-means criterion, given byobjective = "kplus"
. Usingobjective = "kplus"
will minimize differences with regard to both means and standard deviations of the input variables, whereas k-means only focuses on the means. Details on this objective will follow.
- Fixes a bug in
anticlustering()
, that led to an incorrect computation of cluster centers with optionobjective = "variance"
for unequal cluster sizes, see 2ef6547
-
A new exported function:
categorical_sampling()
. Categorical sampling can be used to obtain a stratified split of a data set. Using this function is like callinganticlustering()
with argumentcategories
, but no clustering objective is maximized. The categories are just evenly split between samples, which is very fast (in contrast to the exchange optimization that may take some time for large data sets). Apart from the categorical restriction that balances the frequency of categories between samples, the split is random. -
The function
distance_objective()
was renamed intodiversity_objective()
because there are several clustering objectives based on pairwise distances, e.g. see the new functiondispersion_objective()
. -
dispersion_objective()
is a new function to compute the dispersion of a given clustering, i.e., the minimum distance between two elements within the same group. Maximizing the dispersion is an anticlustering task, see the help page ofdispersion_objective()
for an example.
-
Several changes to the documentation, in particular now highlighting the publication of the paper "Using Anticlustering to Partition Data Sets Into Equivalent Parts" (https://doi.org/10.1037/met0000301) describing the algorithms and criteria used in the package
anticlust
-
In
anticlustering()
, anticluster editing is now by default requested usingobjective = "diversity"
(butobjective = "distance"
is still supported and leads to the same behaviour). This change was done because there are several anticlustering objectives based on pairwise distances. -
anticlustering()
can no longer use an argumentK
of length > 1 withpreclustering = TRUE
because this resulted in undocumented behaviour (this is a good change because it does not make sense to specify an initial assignment of elements to groups viaK
and at the same time request that preclustering handles the initial assignment) -
When using a custom objective function, the order of the required arguments is now reversed: The data comes first, the clustering second.
-
Because the order of arguments in custom objective functions was reversed, the function
mean_sd_obj()
now has reversed arguments as well. -
The package vignettes are no longer distributed with the package itself because rendering R Markdown resulted in an error with the development version of R. This may change again in the future when R Markdown no longer throws an error with R devel. The vignette is currently available via the package website (https://m-py.github.io/anticlust/).
- Improved running speed of generating constraints in integer linear programming variant of (anti)clustering, via 0a870240f8
-
In
anticlustering()
, preclustering and categorical constraints can now be used at the same time. In this case, exchange partners are clustered within the same category, using a call tomatching()
passingcategories
to argumentmatch_within
. -
In
anticlustering()
, it is now possible to usepreclustering = TRUE
for unbalanced data size (e.g., if N = 9 and K = 2). -
In
matching()
, it is now possible to prevent sorting the output by similarity using a new argumentsort_output
. Its default isTRUE
, setting it toFALSE
prevents sorting. This prevents some extra computation that is necessary to determine similarity for each cluster.
-
Some changes to documentation
-
There is now a package website at https://m-py.github.io/anticlust/
-
Additional error handling
- Improvements to implementation of k-means anticlustering (i.e., in
anticlustering()
withobjective == "variance"
or infast_anticlustering()
)- on each exchange iteration, only recomputes distances from clusters whose elements have been swapped (improves run time relevant for larger K).
- Previously, there were only as many exchange partners per element as
members in the least frequent category if argument
categories
was passed). This was not documented behavior and is undesirable. Now, all members from a category may serve as exchange partners, even if the categories have different size.
-
matching()
is a new function for unrestricted or K-partite matching to finds groups of similar elements. -
plot_similarity()
is a new function to plot similarity by cluster (according to the cluster editing criterion) -
All clustering and anticlustering functions now only take one data argument (called
x
) instead of eitherfeatures
ordistances
. -
The argument
iv
was removed fromanticlustering()
because it does not fit the anticlustering semantic (anticlustering should make sets «similar» and not dissimilar). -
The random sampling method for anticlustering was removed. This implies that the
anticlustering()
function no longer has an argumentnrep
. -
The functions
initialize_K()
andgenerate_exchange_partners()
were removed. -
Dropped support for the commercial integer linear programming solvers CPLEX and gurobi for exact (anti)cluster editing. If this functionality is needed, install version 0.3.0 from Github:
remotes::install_github("m-Py/anticlust", ref = "v0.3.0")
-
mean_sd_obj()
no longer computes the discrepancy of medians, only in means and standard deviations (as the name also suggests). -
In
plot_clusters()
, the argumentscol
andpch
were removed. -
In
plot_clusters()
, the argumentclustering
was renamed toclusters
. -
In
generate_partitions()
, the order of the argumentsN
andK
was switched (the order is now consistent withn_partitions()
). -
In
balanced_clustering()
, the defaultmethod
was renamed to"centroid"
from"heuristic"
.
- Release of the package version used in the manuscript »Using anticlustering to partition a stimulus pool into equivalent parts« (Papenberg & Klau, 2019; https://doi.org/10.31234/osf.io/3razc)