-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question] How do I fix this issue? #1716
Comments
Hi @jordannelson0, Have you tried with putting your code in a |
Im not ln windows |
Oh sorry, that initial error looks very much like it's a windows one, i.e. based on this:
By default, we use |
Auto-SKL works fine using the datasets the API has integrated. But not with this dataset. |
The dataset itself, while large is extremely clean. Using standard scikit learn/keras for example you can expect results close to 100% (accuracy metric), as a testament to the fidelity of the dataset. So despite its size, I don't consider the dataset an issue. |
Using all defaults returns the same error(s) |
My best advice is see if you can subsample 100 rows or so and see if that causes the issues, still ... and if so, subsample down to 50 and so on... If you can construct artificial data that causes this issue then maybe I can help, but otherwise it seems like it's dataset related. There's not much I can go off of based on what's provided. This part of the traceback:
Is just due to the Just to be clear, have you tried using the |
In regards to your last comment, I haven't tried. I'm sorry to admit I'm overloaded with other work atm (im doing a phd). If you have time, and are kind enough to provide me with some code samples to c+p and test, id be more than willing. |
I took your sample and added the small bit to take 100 samples. If you can provide the prints, that might help. Don't worry, I also work in a research lab and understand it can be busy. Let me know when you can try it from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from autosklearn.classification import AutoSklearnClassifier
dataframe = read_csv("Spy.csv", skiprows=0)
dataset = dataframe.values
N_SAMPLES = 100
x = dataset[:N_SAMPLES, 0:9503]
y = dataset[:N_SAMPLES, 9503]
print(x, y)
print(x.dtype, y.dtype)
# Split the dataset into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
# define search
model = AutoSklearnClassifier(ensemble_kwargs={'ensemble_size': 1},
initial_configurations_via_metalearning=0,
memory_limit=2000,
time_left_for_this_task=10 * 60,
per_run_time_limit=60,
n_jobs=24)
# perform the search
model.fit(x_train, y_train)
# summarize
print(model.sprint_statistics())
# evaluate best model
y_hat = model.predict(x_test)
acc = accuracy_score(y_test, y_hat)
print("Accuracy: %.3f" % acc) |
Thanks, I'll try this tomorrow and get back to you. I'm in GMT timezone. For reference Thursday 4th Jan GMT. |
`[[ 6 0 0 ... 0 0 0]
During handling of the above exception, another exception occurred: Traceback (most recent call last): During handling of the above exception, another exception occurred: Traceback (most recent call last): |
I ran this with 100, 50, 10 & 5 sample size. Same output each time |
I also ran this with an alternate dataset which has the same datatypes & properties. A dataset with a label in the final column, both datasets are used for binary classification. Each dataset is from a cyber security background relating to malware on the android platform, each column represents a different permission an app does/doesn't have access to, 1 representing access to that permission, 0 the opposite. The final label column has the value of 1 or 0, 1 representing malicious application 0 representing non-malicious. I hope this provides some insight into the datasets I'm using |
And this? With the main guard included? from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from autosklearn.classification import AutoSklearnClassifier
if __name__ == "__main__":
dataframe = read_csv("Spy.csv", skiprows=0)
dataset = dataframe.values
N_SAMPLES = 100
x = dataset[:N_SAMPLES, 0:9503]
y = dataset[:N_SAMPLES, 9503]
print(x, y)
print(x.dtype, y.dtype)
# Split the dataset into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
# define search
model = AutoSklearnClassifier(ensemble_kwargs={'ensemble_size': 1},
initial_configurations_via_metalearning=0,
memory_limit=2000,
time_left_for_this_task=10 * 60,
per_run_time_limit=60,
n_jobs=24)
# perform the search
model.fit(x_train, y_train)
# summarize
print(model.sprint_statistics())
# evaluate best model
y_hat = model.predict(x_test)
acc = accuracy_score(y_test, y_hat)
print("Accuracy: %.3f" % acc) |
`[[ 6 0 0 ... 0 0 0] Process finished with exit code 1 |
Okay, so that's a lot more helpful of an error. My guess is that since you have 9000+ features and they are all integers, autosklearn is trying to one-hot encode them. This effectively adds X new columns per column, where X is the number of unique integer values in that column. Multiply that by ~9000 and it's likely the dataset size explodes. Estimators like a hist gradient boosting classifiers do not really care about one hot encoded variables while something like an MLP will. The only thing I could suggest is to try disable auto-sklearn/autosklearn/estimators.py Lines 180 to 190 in 6732112
Maybe another alternative is to convert the data into float dtypes, as then autosklearn wont try to one-hot encode them, but I do not know your data and whether these values represent categoricals. |
`from pandas import read_csv if name == "main":
|
Hmmm sorry, I wished that would have worked, you'll likely have to try this example then: |
I tried, the memory issue persisted unfortunately |
from typing import Optional import autosklearn.classification from autosklearn.askl_typing import FEAT_TYPE_TYPE class NoPreprocessing(AutoSklearnPreprocessingAlgorithm):
Add NoPreprocessing component to auto-sklearn.autosklearn.pipeline.components.data_preprocessing.add_preprocessor(NoPreprocessing) dataframe = read_csv("adware1.csv", skiprows=0) N_SAMPLES = 100 Split the dataset into training and testing setsx_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2) clf = autosklearn.classification.AutoSklearnClassifier( To check that models were found without issue when running examplesassert len(clf.get_models_with_weights()) > 0 summarizeprint(clf.sprint_statistics()) evaluate best modely_hat = clf.predict(x_test) I do have this example working with a different smaller dataset. |
Here is my code:
Here is the warning I'm receiving:
Followed by:
I have no idea how to fix this, I have been looking for hours and trying different things - even changing datasets and nothings worked. Can anyone help with code snippets preferably?
Expected behaviour
For it to run as normal
Environment and installation:
Please give details about your installation:
The text was updated successfully, but these errors were encountered: