scikit-learn_bench/cuml_bench at main · leriomaggio/scikit-learn_bench · GitHub

Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
__init__.py		__init__.py
dbscan.py		dbscan.py
df_clsf.py		df_clsf.py
df_regr.py		df_regr.py
elasticnet.py		elasticnet.py
kmeans.py		kmeans.py
knn_clsf.py		knn_clsf.py
lasso.py		lasso.py
linear.py		linear.py
log_reg.py		log_reg.py
pca.py		pca.py
ridge.py		ridge.py
svm.py		svm.py
svr.py		svr.py
train_test_split.py		train_test_split.py
tsne.py		tsne.py

README.md

How to create conda environment for benchmarking

conda create -n bench -c rapidsai -c conda-forge python=3.7 scikit-learn cuml pandas cudf tqdm

Algorithms parameters

You can launch benchmarks for each algorithm separately. The tables below list all supported parameters for each algorithm:

General
DBSCAN
RandomForestClassifier
RandomForestRegressor
pairwise_distances
KMeans
KNeighborsClassifier
LinearRegression
LogisticRegression
PCA
Ridge Regression
SVC
TSNE
train_test_split

General

Parameter Name	Type	Default Value	Description
num-threads	int	-1	The number of threads to use
arch	str	?	Achine architecture, for bookkeeping
batch	str	?	Batch ID, for bookkeeping
prefix	str	sklearn	Prefix string, for bookkeeping
header	action	False	Output CSV header
verbose	action	False	Output extra debug messages
data-format	str	numpy	Data formats: numpy, pandas or cudf
data-order	str	C	Data order: C (row-major, default) or F (column-major)
dtype	np.dtype	np.float64	Data type: float64 (default) or float32
check-finiteness	action	False	Check finiteness in sklearn input check(disabled by default)
output-format	str	csv	Output format: csv (default) or json'
time-method	str	mean_min	Method used for time mesurements
box-filter-measurements	int	100	Maximum number of measurements in box filter
inner-loops	int	100	Maximum inner loop iterations. (we take the mean over inner iterations)
outer-loops	int	100	Maximum outer loop iterations. (we take the min over outer iterations)
time-limit	float	10	Target time to spend to benchmark
goal-outer-loops	int	10	The number of outer loops to aim while automatically picking number of inner loops. If zero, do not automatically decide number of inner loops
seed	int	12345	Seed to pass as random_state
dataset-name	str	None	Dataset name

DBSCAN

parameter Name	Type	default value	description
epsilon	float	10	Radius of neighborhood of a point
min_samples	int	5	The minimum number of samples required in a 'neighborhood to consider a point a core point

RandomForestClassifier

parameter Name	Type	default value	description
criterion	str	gini	gini or entropy. The function to measure the quality of a split
split-algorithm	str	hist	hist or global_quantile. The algorithm to determine how nodes are split in the tree
num-trees	int	100	The number of trees in the forest
max-features	float_or_int	None	Upper bound on features used at each split
max-depth	int	None	Upper bound on depth of constructed trees
min-samples-split	float_or_int	2	Minimum samples number for node splitting
max-leaf-nodes	int	None	Maximum leaf nodes per tree
min-impurity-decrease	float	0	Needed impurity decrease for node splitting
no-bootstrap	store_false	True	Don't control bootstraping

RandomForestRegressor

parameter Name	Type	default value	description
criterion	str	gini	gini or entropy. The function to measure the quality of a split
split-algorithm	str	hist	hist or global_quantile. The algorithm to determine how nodes are split in the tree
num-trees	int	100	The number of trees in the forest
max-features	float_or_int	None	Upper bound on features used at each split
max-depth	int	None	Upper bound on depth of constructed trees
min-samples-split	float_or_int	2	Minimum samples number for node splitting
max-leaf-nodes	int	None	Maximum leaf nodes per tree
min-impurity-decrease	float	0	Needed impurity decrease for node splitting
no-bootstrap	action	True	Don't control bootstraping

KMeans

parameter Name	Type	default value	description
init	str		Initial clusters
tol	float	0	Absolute threshold
maxiter	int	100	Maximum number of iterations
samples-per-batch	int	32768	The number of samples per batch
n-clusters	int		The number of clusters

KNeighborsClassifier

parameter Name	Type	default value	description
n-neighbors	int	5	The number of neighbors to use
weights	str	uniform	Weight function used in prediction
method	str	brute	Algorithm used to compute the nearest neighbors
metric	str	euclidean	Distance metric to use

LinearRegression

parameter Name	Type	default value	description
no-fit-intercept	action	True	Don't fit intercept (assume data already centered)
solver	str	eig	eig or svd. Solver used for training

LogisticRegression

parameter Name	Type	default value	description
no-fit-intercept	action	True	Don't fit intercept
solver	str	qn	qn, owl. Solver to use
maxiter	int	100	Maximum iterations for the iterative solver
C	float	1.0	Regularization parameter
tol	float	None	Tolerance for solver

PCA

parameter Name	Type	default value	description
svd-solver	str	full	auto, full or jacobi. SVD solver to use
n-components	int	None	The number of components to find
whiten	action	False	Perform whitening

Ridge

parameter Name	Type	default value	description
no-fit-intercept	action	True	Don't fit intercept (assume data already centered)
solver	str	eig	eig, cd or svd. Solver used for training
alpha	float	1.0	Regularization strength

SVC

parameter Name	Type	default value	description
C	float	0.01	SVM slack parameter
kernel	str	linear	linear or rbf. SVM kernel function
gamma	float	None	Parameter for kernel="rbf"
max-cache-size	int	64	Maximum cache size for SVM.
tol	float	1e-16	Tolerance passed to sklearn.svm.SVC
probability	action	True	Use probability for SVC

TSNE

parameter Name	Type	default value	description
n-components	int	2	Dimension of the embedded space
early-exaggeration	float	12.0	This factor increases the attractive forces between points and allows points to move around more freely finding their nearest neighbors more easily
learning-rate	float	200.0	The learning rate for t-SNE is usually in the range [10.0, 1000.0]
angle	float	0.5	Angular size. This is the trade-off between speed and accuracy
min-grad-norm	float	1e-7	If the gradient norm is below this threshold, the optimization is stopped
random-state	int	1234	Determines the random number generator

train_test_split

parameter Name	Type	default value	description
train-size	float	0.75	Size of training subset
test-size	float	0.25	Size of testing subset
do-not-shuffle	action	False	Do not perform data shuffle before splitting