Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

partialsort! is broken on 1.6 #40338

Closed
rafaqz opened this issue Apr 4, 2021 · 13 comments
Closed

partialsort! is broken on 1.6 #40338

rafaqz opened this issue Apr 4, 2021 · 13 comments

Comments

@rafaqz
Copy link
Contributor

rafaqz commented Apr 4, 2021

Using Julia 1.6.0 partialsort! sorting with rev breaks above a certain vector length >= ~22 and value of k >= 9, although these may interact.

It's not always clearly reproducible with rand, so I'm spelling this out with the data that triggered it.

With k = 9 and above, the vector is sorted from the end back to 9. At k = 8, it is sorted from the start to 8.

v = [
    2.6137104f0,
    1.0f0,
    0.5f0,
    0.33333334f0,
    0.25f0,
    1.0f0,
    0.70710677f0,
    0.4472136f0,
    0.31622776f0,
    0.24253564f0,
    0.5f0,
    0.4472136f0,
    0.35355338f0,
    0.2773501f0,
    0.2236068f0,
    0.33333334f0,
    0.31622776f0,
    0.2773501f0,
    0.23570228f0,
    0.2f0,
    0.25f0,
    0.24253564f0,
    0.2236068f0,
    0.0f0,
    0.17677669f0,
]
s8 = deepcopy(v)
s9 = deepcopy(v)

partialsort!(s8, 8, rev=true)
partialsort!(s9, 9, rev=true)
julia> # k = 8 is fine? sorted backwards from the start to 8

julia> s8
25-element Vector{Float32}:
 2.6137104
 1.0
 1.0
 0.70710677
 0.5
 0.5
 0.4472136
 0.4472136
 0.35355338
 0.24253564
 0.31622776
 0.25
 0.33333334
 0.2773501
 0.2236068
 0.33333334
 0.31622776
 0.2773501
 0.23570228
 0.2
 0.25
 0.24253564
 0.2236068
 0.0
 0.17677669

julia> # k = 9 is broken, it's sorted backwards from the end to 9

julia> s9
25-element Vector{Float32}:
 0.5
 1.0
 0.5
 2.6137104
 0.4472136
 1.0
 0.70710677
 0.4472136
 0.35355338
 0.33333334
 0.33333334
 0.31622776
 0.31622776
 0.2773501
 0.2773501
 0.25
 0.25
 0.24253564
 0.24253564
 0.23570228
 0.2236068
 0.2236068
 0.2
 0.17677669
 0.0

julia> # Shortening the vector seems to be a full sort

julia> s9_21 = v[1:21];

julia> partialsort!(s9_21, 9, rev=true);

julia> s9_21
21-element Vector{Float32}:
 2.6137104
 1.0
 1.0
 0.70710677
 0.5
 0.5
 0.4472136
 0.4472136
 0.35355338
 0.33333334
 0.33333334
 0.31622776
 0.31622776
 0.2773501
 0.2773501
 0.25
 0.25
 0.24253564
 0.23570228
 0.2236068
 0.2
@KristofferC
Copy link
Member

KristofferC commented Apr 4, 2021

Naively, reading the docs:

partialsort!(v, k; by=<transform>, lt=<comparison>, rev=false)

Partially sort the vector v in place, according to the order specified by by, lt and rev so
that the value at index k (or range of adjacent values if k is a range) occurs at the position
where it would appear if the array were fully sorted via a non-stable algorithm.

And trying this out:

for k in 1:length(v)
     vc = copy(v)
     partialsort!(vc, k, rev=true)
     @assert vc[k] == sort(v; rev=true)[k]
 end

looks like it is ok. So, could you be a bit more specific about what the bug is? Maybe 1:k should have been passed to partialsort! and not just k?

@rafaqz
Copy link
Contributor Author

rafaqz commented Apr 4, 2021

Look at the difference between k = 8 and k = 9... they are sorting from the opposite end of the vector.

Both are in reverse, but:

  • k=9 is sorted from s9[end] = lowest to s9[9] = ninth_highest, leaving the highest values unsorted between 1 and 9.
  • k=8 sorts the first 8 values from s8[1] = highest to s8[8] = eighth_highest, leaving the rest unsorted.

Your assert just happens to be for k, which is correct for both - it the values either side of k that are wrong.

Look at the outputs in this sequence:

for k in 7:10
    vc = copy(v)
    partialsort!(vc, k, rev=true)
    @show k
    display(vc)
    println()
end

There is a break between 8 and 9 where the behaviour switches. The problem is the docs don't say which way it should be sorted! My assumption was from the start of the vector.

Probably it's being tested with k, like your assert, which is correct for both implementations (I'm assuming there are at least 2 optimisations here).

@rafaqz
Copy link
Contributor Author

rafaqz commented Apr 4, 2021

Heres the output, to save you the hassle

k = 7
25-element Vector{Float32}:
 2.6137104
 1.0
 1.0
 0.70710677
 0.5
 0.5
 0.4472136
 0.4472136
 0.35355338
 0.24253564
 0.31622776
 0.25
 0.33333334
 0.2773501
 0.2236068
 0.33333334
 0.31622776
 0.2773501
 0.23570228
 0.2
 0.25
 0.24253564
 0.2236068
 0.0
 0.17677669

k = 8
25-element Vector{Float32}:
 2.6137104
 1.0
 1.0
 0.70710677
 0.5
 0.5
 0.4472136
 0.4472136
 0.35355338
 0.24253564
 0.31622776
 0.25
 0.33333334
 0.2773501
 0.2236068
 0.33333334
 0.31622776
 0.2773501
 0.23570228
 0.2
 0.25
 0.24253564
 0.2236068
 0.0
 0.17677669

k = 9
25-element Vector{Float32}:
 0.5
 1.0
 0.5
 2.6137104
 0.4472136
 1.0
 0.70710677
 0.4472136
 0.35355338
 0.33333334
 0.33333334
 0.31622776
 0.31622776
 0.2773501
 0.2773501
 0.25
 0.25
 0.24253564
 0.24253564
 0.23570228
 0.2236068
 0.2236068
 0.2
 0.17677669
 0.0

k = 10
25-element Vector{Float32}:
 0.5
 1.0
 0.5
 2.6137104
 0.4472136
 1.0
 0.70710677
 0.4472136
 0.35355338
 0.33333334
 0.33333334
 0.31622776
 0.31622776
 0.2773501
 0.2773501
 0.25
 0.25
 0.24253564
 0.24253564
 0.23570228
 0.2236068
 0.2236068
 0.2
 0.17677669
 0.0

@KristofferC
Copy link
Member

Your assert just happens to be for k, which is correct for both - it the values either side of k that are wrong.

Again, looking at the docs (as I wrote in #40338 (comment)) the function only seems to make a claim about what will happen at k and nothing about the other values, or? So if you want to have the first k values sorted, you would pass in 1:k?

@rafaqz
Copy link
Contributor Author

rafaqz commented Apr 4, 2021

You want the e.g. highest 1:k not 1:k sorted.

That's how it worked previously - I know because 1.6 has broken a published algorithm.

@KristofferC
Copy link
Member

Just so I understand, could you point out to me what the result is that is contrary to the documentation.

@rafaqz
Copy link
Contributor Author

rafaqz commented Apr 4, 2021

The docs were always bad - this has changed the behaviour.

I assume what we want is: https://en.wikipedia.org/wiki/Partial_sorting

We don't have that anymore, but we used to

@rafaqz
Copy link
Contributor Author

rafaqz commented Apr 4, 2021

This algorithm now gives either the smallest sorted over k:end or the largest sorted over 1:k, depending on an arbitrary cutoff. Thats not what a partial sort is. It should be that 1:k are the always the largest k values sorted high to low, and the rest can be whatever.

@KristofferC
Copy link
Member

KristofferC commented Apr 4, 2021

It should be that 1:k are the always larges k values, and the rest can be whatever.

Yeah, if you pass in 1:k, no? Like the doc says:

julia> for k in 1:length(v)
            vc = copy(v)
            partialsort!(vc, 1:k, rev=true)
            @assert vc[1:k] == sort(v; rev=true)[1:k]
        end

@rafaqz
Copy link
Contributor Author

rafaqz commented Apr 4, 2021

Ehhhh ok. The docs specify what is returned, and don't really mention what happens to the vector.

I've never actually used the return value... instead reading the sorted section of the underlying array. I guess my misconception of that behavior was reinforced by it actually working how I thought it should for a few years... - it just sorted up to k if you passed k (not 1:k), and whatever the return value was wasn't that relevant.

We should probably make it clear that the algorithm is free to sort the array any way that it likes to get that return value.

Its also odd that it's sorting the long end of the array now, but maybe there's a reason

@rafaqz
Copy link
Contributor Author

rafaqz commented Apr 4, 2021

Maybe, to clarify:

Note that partialsort! does not fully sort the input array, and can sort it in either direction.

@rafaqz rafaqz closed this as completed Apr 4, 2021
@rafaqz
Copy link
Contributor Author

rafaqz commented Apr 4, 2021

And @KristofferC apologies for just not getting the range part!! Somehow it couldn't click that it was just luck that the algorithm had been passing tests for a few years, not actually design.

This was the end of a long afternoon just trying to find how this had broken...

@kmsquire
Copy link
Member

kmsquire commented Apr 5, 2021

Just as a point of reference: partialsort used to be called select and was based on the select algorithm. It was renamed in #23501 to partialsort to avoid confusion with similar terms (e.g., SELECT in SQL, which is a bit different). See also #22791.

With the original name, the meaning of k probably makes more sense (selection of the kth element in the sorted list). Using a range for k was initially implemented as an extension to the original algorithm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants