Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Imbalance dataset detection enhancement #276

Closed
sepandhaghighi opened this issue Jan 16, 2020 · 12 comments · Fixed by #395
Closed

Imbalance dataset detection enhancement #276

sepandhaghighi opened this issue Jan 16, 2020 · 12 comments · Fixed by #395
Assignees
Labels
discussion enhancement New feature or request question Further information is requested
Milestone

Comments

@sepandhaghighi
Copy link
Owner

sepandhaghighi commented Jan 16, 2020

Description

PyCM imbalance detection is weak, example :

>>> cm = ConfusionMatrix(matrix={0:{0:60,1:0,2:0,3:0},1:{0:0,1:20,2:0,3:0},2:{0:0,1:0,2:20,3:0},3:{0:0,1:0,2:0,3:20}})
>>> cm.print_matrix()
Predict  0        1        2        3        
Actual
0        60       0        0        0        

1        0        20       0        0        

2        0        0        20       0        

3        0        0        0        20       


>>> cm.imbalance
False
@sepandhaghighi
Copy link
Owner Author

sepandhaghighi commented Jan 16, 2020

There is no standard method for imbalance dataset detection in multi-class mode, but I suggest the following method :

image

image

Any idea ?

Note 1 : P : Condition Positive , C : All Classes
Note 2 : 20% comes from 40/60 in binary mode

@sepandhaghighi
Copy link
Owner Author

@alirezazolanvari @sadrasabouri
Please join this discussion

@sepandhaghighi sepandhaghighi added question Further information is requested enhancement New feature or request labels Jan 16, 2020
@sepandhaghighi
Copy link
Owner Author

@ssheikholeslami
I also invite you to this discussion ;-)

@sadrasabouri
Copy link
Collaborator

There is no standard method for imbalance dataset detection in multi-class mode, but I suggest the following method :

image

image

Any idea ?

Note 1 : P : Condition Positive , C : All Classes
Note 2 : 20% comes from 40/60 in binary mode

How about defining a variable for each class that present it's weight in main data set and calculate the variance for it to determine if the data set is imbalance or not.

@alirezazolanvari
Copy link
Collaborator

There is no standard method for imbalance dataset detection in multi-class mode, but I suggest the following method :

image

image

Any idea ?

Note 1 : P : Condition Positive , C : All Classes
Note 2 : 20% comes from 40/60 in binary mode

I think only the population of the most and the least populated classes play role in imbalance detection.
What is the added value of calculating the average or the variance of the population of other classes?

@sepandhaghighi
Copy link
Owner Author

I think only the population of the most and the least populated classes play role in imbalance detection.
What is the added value of calculating the average or the variance of the population of other classes?

I think the method that you mentioned here is not accurate enough at all cases.

  1. Its hard to select a general scaling threshold
  2. This method bypass equal distribution definition

@sepandhaghighi
Copy link
Owner Author

How about defining a variable for each class that present it's weight in main data set and calculate the variance for it to determine if the data set is imbalance or not.

Weight is a good idea!
We should consider a method to control threshold by this weight vector

@sadrasabouri
Copy link
Collaborator

How about defining a variable for each class that present it's weight in main data set and calculate the variance for it to determine if the data set is imbalance or not.

Weight is a good idea!
We should consider a method to control threshold by this weight vector

We can calculate E, weighted and then divide E by sum of the wights then E should be normaled respected to weights.

and about @alirezazolanvari 's idea, i couldn't find any exception (at least at first view) could you please write down an example that doesn't seems Imbalance but has large difference between the min and max and vice versa.

@sepandhaghighi
Copy link
Owner Author

and about @alirezazolanvari 's idea, i couldn't find any exception (at least at first view) could you please write down an example that doesn't seems Imbalance but has large difference between the min and max and vice versa.

There is no standard method for imbalance dataset detection and @alirezazolanvari idea is not wrong , but take a look at this example :

Class 1 : 3900
Class 2 : 2700
Class 3 : 2700
Class 4 : 2700

The current method recognizes this distribution as balanced !! (even with ratio of 1.5)

@alirezazolanvari
Copy link
Collaborator

We can also define an is_imbalanced flag so that the user can indicate whether the concerned dataset is imbalanced or not. As long as the user does not provide any information in this regard, the automatic detection algorithm will be used.

@sadrasabouri
Copy link
Collaborator

If it's OK I can work on this issue for version 3.3.

@sepandhaghighi
Copy link
Owner Author

If it's OK I can work on this issue for version 3.3.

🥇

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion enhancement New feature or request question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants