BUG: HeadTailBreaks raise RecursionError #45

martinfleis · 2019-09-10T14:42:36Z

HeadTailBreaks raises RecursionError if the maximum value within data is there twice (or more). Then it simply locks itself in the loop in values[values >= mean] - mean will be always the same and both values will always returned.

Steps to reproduce:

data = np.random.pareto(2, 1000)
data = np.append(data, data.max())

mc.HeadTailBreaks(data)

I assume that once there are only the same values within remaining values, head_tail_breaks should stop:

def head_tail_breaks(values, cuts):
    """
    head tail breaks helper function
    """
    values = np.array(values)
    mean = np.mean(values)
    cuts.append(mean)
    if len(values) > 1:
        if len(set(values)) > 1:  #this seems to fix the issue
            return head_tail_breaks(values[values >= mean], cuts)
    return cuts

However, I am not sure if it is the intended behaviour to stop and keep multiple values in the last bin as it does not reflect the definition of HeadTailBreaks algorithm (but I cannot see another solution). Happy to do a PR if this is how you want to fix that.

The text was updated successfully, but these errors were encountered:

weikang9009 · 2019-09-10T17:40:54Z

Hi @martinfleis, thank you for reporting the bug and suggesting the solution.

I think having multiple values in the last bin is inevitable if these values are identical - it is not possible to allocate several identical values into different bins. The example you gave is a nice illustration - there are essentially two maximum values in the input data:

data = np.random.pareto(2, 1000)
data = np.append(data, data.max()) #now we have two maximum values in the input data

Your solution looks good to me! I think we can safely get rid of if len(values) > 1: and just replace it with if len(set(values)) > 1: . So it will be like:

def head_tail_breaks(values, cuts):
    """
    head tail breaks helper function
    """
    values = np.array(values)
    mean = np.mean(values)
    cuts.append(mean)
    if len(set(values)) > 1:
            return head_tail_breaks(values[values >= mean], cuts)
    return cuts

After making the changes, I rerun the example you provided:

np.random.seed(0)
data = np.random.pareto(2, 1000)
data = np.append(data, data.max())
mc.HeadTailBreaks(data)

The output would have two values in the last bin.

If I twist the appended maximum value slightly - add a small value (0.00001) to it to break the tie:

np.random.seed(0)
data = np.random.pareto(2, 1000)
data = np.append(data, data.max()+0.00001)
mc.HeadTailBreaks(data)

The only change to the output classification is an additional bin.

So if you'd like to open an PR, I would happy to review and merge it!

martinfleis mentioned this issue Sep 10, 2019

BUG: RecursiveError in HeadTailBreaks #46

Merged

weikang9009 closed this as completed in #46 Sep 10, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: HeadTailBreaks raise RecursionError #45

BUG: HeadTailBreaks raise RecursionError #45

martinfleis commented Sep 10, 2019

weikang9009 commented Sep 10, 2019

BUG: HeadTailBreaks raise RecursionError #45

BUG: HeadTailBreaks raise RecursionError #45

Comments

martinfleis commented Sep 10, 2019

weikang9009 commented Sep 10, 2019