Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: HeadTailBreaks raise RecursionError #45

Closed
martinfleis opened this issue Sep 10, 2019 · 1 comment · Fixed by #46
Closed

BUG: HeadTailBreaks raise RecursionError #45

martinfleis opened this issue Sep 10, 2019 · 1 comment · Fixed by #46

Comments

@martinfleis
Copy link
Member

HeadTailBreaks raises RecursionError if the maximum value within data is there twice (or more). Then it simply locks itself in the loop in values[values >= mean] - mean will be always the same and both values will always returned.

Steps to reproduce:

data = np.random.pareto(2, 1000)
data = np.append(data, data.max())

mc.HeadTailBreaks(data)

I assume that once there are only the same values within remaining values, head_tail_breaks should stop:

def head_tail_breaks(values, cuts):
    """
    head tail breaks helper function
    """
    values = np.array(values)
    mean = np.mean(values)
    cuts.append(mean)
    if len(values) > 1:
        if len(set(values)) > 1:  #this seems to fix the issue
            return head_tail_breaks(values[values >= mean], cuts)
    return cuts

However, I am not sure if it is the intended behaviour to stop and keep multiple values in the last bin as it does not reflect the definition of HeadTailBreaks algorithm (but I cannot see another solution). Happy to do a PR if this is how you want to fix that.

@weikang9009
Copy link
Member

Hi @martinfleis, thank you for reporting the bug and suggesting the solution.

I think having multiple values in the last bin is inevitable if these values are identical - it is not possible to allocate several identical values into different bins. The example you gave is a nice illustration - there are essentially two maximum values in the input data:

data = np.random.pareto(2, 1000)
data = np.append(data, data.max()) #now we have two maximum values in the input data

Your solution looks good to me! I think we can safely get rid of if len(values) > 1: and just replace it with if len(set(values)) > 1: . So it will be like:

def head_tail_breaks(values, cuts):
    """
    head tail breaks helper function
    """
    values = np.array(values)
    mean = np.mean(values)
    cuts.append(mean)
    if len(set(values)) > 1:
            return head_tail_breaks(values[values >= mean], cuts)
    return cuts

After making the changes, I rerun the example you provided:

np.random.seed(0)
data = np.random.pareto(2, 1000)
data = np.append(data, data.max())
mc.HeadTailBreaks(data)

The output would have two values in the last bin.
image

If I twist the appended maximum value slightly - add a small value (0.00001) to it to break the tie:

np.random.seed(0)
data = np.random.pareto(2, 1000)
data = np.append(data, data.max()+0.00001)
mc.HeadTailBreaks(data)

The only change to the output classification is an additional bin.
image

So if you'd like to open an PR, I would happy to review and merge it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants