You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
HeadTailBreaks raises RecursionError if the maximum value within data is there twice (or more). Then it simply locks itself in the loop in values[values >= mean] - mean will be always the same and both values will always returned.
Steps to reproduce:
data = np.random.pareto(2, 1000)
data = np.append(data, data.max())
mc.HeadTailBreaks(data)
I assume that once there are only the same values within remaining values, head_tail_breaks should stop:
def head_tail_breaks(values, cuts):
"""
head tail breaks helper function
"""
values = np.array(values)
mean = np.mean(values)
cuts.append(mean)
if len(values) > 1:
if len(set(values)) > 1: #this seems to fix the issue
return head_tail_breaks(values[values >= mean], cuts)
return cuts
However, I am not sure if it is the intended behaviour to stop and keep multiple values in the last bin as it does not reflect the definition of HeadTailBreaks algorithm (but I cannot see another solution). Happy to do a PR if this is how you want to fix that.
The text was updated successfully, but these errors were encountered:
Hi @martinfleis, thank you for reporting the bug and suggesting the solution.
I think having multiple values in the last bin is inevitable if these values are identical - it is not possible to allocate several identical values into different bins. The example you gave is a nice illustration - there are essentially two maximum values in the input data:
data=np.random.pareto(2, 1000)
data=np.append(data, data.max()) #now we have two maximum values in the input data
Your solution looks good to me! I think we can safely get rid of if len(values) > 1: and just replace it with if len(set(values)) > 1: . So it will be like:
defhead_tail_breaks(values, cuts):
""" head tail breaks helper function """values=np.array(values)
mean=np.mean(values)
cuts.append(mean)
iflen(set(values)) >1:
returnhead_tail_breaks(values[values>=mean], cuts)
returncuts
After making the changes, I rerun the example you provided:
np.random.seed(0)
data = np.random.pareto(2, 1000)
data = np.append(data, data.max())
mc.HeadTailBreaks(data)
The output would have two values in the last bin.
If I twist the appended maximum value slightly - add a small value (0.00001) to it to break the tie:
HeadTailBreaks raises RecursionError if the maximum value within data is there twice (or more). Then it simply locks itself in the loop in
values[values >= mean]
- mean will be always the same and both values will always returned.Steps to reproduce:
I assume that once there are only the same values within remaining values,
head_tail_breaks
should stop:However, I am not sure if it is the intended behaviour to stop and keep multiple values in the last bin as it does not reflect the definition of HeadTailBreaks algorithm (but I cannot see another solution). Happy to do a PR if this is how you want to fix that.
The text was updated successfully, but these errors were encountered: