Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

incorrect statistics for sprand #6714

Closed
stevengj opened this issue May 1, 2014 · 30 comments
Closed

incorrect statistics for sprand #6714

stevengj opened this issue May 1, 2014 · 30 comments
Labels
domain:arrays:sparse Sparse arrays kind:bug Indicates an unexpected problem or unintended behavior

Comments

@stevengj
Copy link
Member

stevengj commented May 1, 2014

When looking into #6708, I noticed that the sprand function seems to generate sparse matrices with slightly the wrong density. For example:

julia> mean([nfilled(sprand(100,10,0.1)) for i=1:100])
99.81

julia> mean([nfilled(sprand(100,10,0.1)) for i=1:1000])
99.833

julia> mean([nfilled(sprand(100,10,0.1)) for i=1:10000])
99.7546

julia> mean([nfilled(sprand(100,10,0.1)) for i=1:100000])
99.72917

julia> mean([nfilled(sprand(100,10,0.1)) for i=1:1000000])
99.722943

does not seem to be converging to 100.

@stevengj
Copy link
Member Author

stevengj commented May 1, 2014

Basically, the problem seems to lie in the algorithm for computing a random subset of n elements from 1:N. This is an important general problem (also called reservoir sampling), and it would be nice to have a function in Base that solves this problem by itself (which could be called from sprand).

One possible algorithm is Algorithm S from Knuth volume II, section 3.4.2, which I think is the same as Vitter's algorithm. However, this method requires O(N) time if I remember correctly, so if n << N you want to switch to a different algorithm.

@timholy
Copy link
Sponsor Member

timholy commented May 1, 2014

I spent some time looking into sampling algorithms, and this was the best I could come up with. (I made it up myself, because I was unhappy with some aspect of everything else I found.) Something that's O(N) seems like a nonstarter---I think we have to take seriously the case where density is 0.0001 and the matrix is 10^5 by 10^5.

Did you try increasing the number of iterations on the Newton's method inversion of the birthday problem? Maybe the error is there, or is just due to something dumb like an off-by-one bug (well, here, an off-by-0.3 bug 😄).

But I have to confess that I don't personally see this as a particularly big problem; we're already so much closer to the target density than, say, Matlab, that I wonder this actually matters or is largely an intellectual exercise? But if you think it's important, by all means make improvements.

@stevengj
Copy link
Member Author

stevengj commented May 1, 2014

I think that if I were doing statistical experiments with sparse random matrices, I would be unhappy to be limited to 2 significant digits. If we have a general randsubset function, then it will be even more important to have the right statistics.

I certainly agree that we can't use O(N) algorithms for very sparse subsets. A randsubset function would want to switch between algorithms depending on the parameters.

@timholy
Copy link
Sponsor Member

timholy commented May 1, 2014

Steps that might fix this:

  • Seeing what happens when you increase the number of iterations on Newton's method. I just truncated at 2 iterations by fiat, and a 0.3% error does not seem unreasonable given that no attempt was made to hit the exact target.
  • Rather than rounding the answer to the nearest integer, you'll also have to dither the result by the fractional portion of the answer, generating ifloor(k)+1 random numbers instead of ifloor(k) random numbers 13% of the time if the fractional part is 0.13.

@stevengj
Copy link
Member Author

stevengj commented May 2, 2014

More Newton iterations doesn't help, but dithering seems to help, i.e.:

ik = ifloor(k)
if rand() < k - ik
    ik += 1
end

However, it looks like you are using only an approximate solution to the birthday problem, so there may be some intrinsic error. Is that right?

@timholy
Copy link
Sponsor Member

timholy commented May 2, 2014

It's been so long I really can't remember, but certainly I don't recall (intending) to use an approximation. That doesn't mean I didn't screw it up somehow 😄.

@stevengj
Copy link
Member Author

stevengj commented May 2, 2014

It looks like you are using something like approximation 10 or 11 from this article?. The actual birthday-problem probability is a much more complicated recurrence relation that doesn't look anything like what your Newton iteration could be solving.

@timholy
Copy link
Sponsor Member

timholy commented May 2, 2014

Well, what you want is the expected number of unique values, not the probability of duplications. Let's see if I've done this correctly. If for any specific realization of sampling you let theta_i be 1 if i is chosen at least once, and 0 if it is never chosen, then the number of unique choices is sum(theta_i). Therefore, the expected value K of unique choices is just

N*Pr(theta_1=1),

where N is the number of choices available. Since Pr(theta_1=1) = 1-(1-1/N)^k, where k is the number of samples drawn, it seems the recurrence should indeed be fairly simple. I double-checked to see if I'd done the Newton's method calculation correctly, and to me it looks right.

@timholy
Copy link
Sponsor Member

timholy commented May 2, 2014

@stevengj, it occurs to me that if you asked because you're still seeing some tiny error in the mean number of nonzeros, it could be because we didn't get the dithering quite right. I'll bet the problem arises from the function's curvature between ik and ik+1. Once you've computed ik, you could use the forward expression N*(1-(1-1/N)^z) to calculate the mean number of unique samples for both z=ik and z=ik+1. Then find the linear combination of these two that hits the target mean exactly, and use that as the fractional value in your rand() comparison rather than the fractional value of k.

@stevengj
Copy link
Member Author

stevengj commented May 2, 2014

@timholy, looks like the corrected dithering does the trick!

@timholy
Copy link
Sponsor Member

timholy commented May 2, 2014

Great to hear!

@stevengj
Copy link
Member Author

stevengj commented May 2, 2014

Do you know of any good way to compute 1-(1-x)^m that is accurate when x is small and m is large? In my suggested patch for #6708 I use the binomial expansion when m*x < 1, but I'm wondering if there is a standard name for this kind of function in statistics, ala expm1.

@timholy
Copy link
Sponsor Member

timholy commented May 2, 2014

Good catch, that's indeed something to worry about. How about -expm1(m*log(1-x))?

@timholy
Copy link
Sponsor Member

timholy commented May 2, 2014

Or rather -expm1(m*log1p(-x)).

@stevengj
Copy link
Member Author

stevengj commented May 2, 2014

Right, that should do it. Does this function have a name?

@timholy
Copy link
Sponsor Member

timholy commented May 2, 2014

If it does, I don't know it. But it does seem common enough to be in some math library.

@stevengj
Copy link
Member Author

stevengj commented May 2, 2014

Since the random-subset problem shows up so often and is subtle to implement well, it would be good to put this in an exported subroutine of some sort. (If we adopt my sprand algorithm, it pretty much has to be in a subroutine anyway since I need to construct multiple sorted subsets, and we might as well export it.)

However, there are several variants, so it's not clear what the interface should look like. Given a container A and a desired number m of elements in the subset, one could ask:

  • Does one need exactly m elements, or only m elements on average?
  • Is the output ordered in the same order as A? (A Vector?)
  • Is the length of A known, or is A an iterable where only a single pass is desired (the reservoir problem)?

We could have an interface like:

randsubset(A::AbstractArray, m::Integer; ordered=false, exact=true) = ....
randsubset(itr, m::Integer) = ....reservoir algorithm (always ordered, always exact)....

(For the case of exact=true, we could use e.g. Floyd's algorithm, which is unordered by default if we use a hash table. Or we could use our PriorityQueue and just make it always ordered.)

@timholy
Copy link
Sponsor Member

timholy commented May 2, 2014

The notion of how to "bill" this is some of what held me back from splitting this off from sprand in the first place. But I like your description in the comment in #6708; it makes it clear that it's not designed to give a precise number of outputs.

For hitting the target exactly, Floyd's algorithm looks good (it's basically what Stefan was using, but using a Set should speed it up). It's still vulnerable to the problem that as the density rises it will slow down a lot, because you'll have to reject many of your random numbers. But I suppose we could use the same flip-on-the-halfway point trick so that we only use Floyd for densities less than 0.5.

Regarding order, I used sort in this problem because I knew sparse was going to need to have its inputs sorted anyway. In the more general case, that may not be necessary, and one might boost the performance further using a Set (though we have to figure out how to make it suitable for use in sprand---perhaps the de-duplication can be separated from the generation as separate non-exported functions).

I haven't thought about the reservoir problem at all, so I won't even venture an opinion.

@andreasnoack
Copy link
Member

@stevengj Is reservoir sampling different from usual random sampling without replacement? If not, @lindahua has probably done most of the work already.

using StatsBase
julia> sample(1:10,5,replace=false,ordered=true)
5-element Array{Int64,1}:
  2
  4
  6
  7
 10

see also #6003

@stevengj
Copy link
Member Author

stevengj commented May 2, 2014

@andreasnoackjensen, the reservoir problem is when you don't know the length of A in advance, and you can only do a single pass over the array.

But you're right, I should look at the algorithms in StatsBase. And yes, we are talking about random sampling without replacement.

However, at first glance it looks like Floyd's algorithm is better than self_avoid_sample! in StatsBase, and is simpler.

@stevengj
Copy link
Member Author

stevengj commented May 2, 2014

@timholy, Floyd's algorithm doesn't reject any random numbers. It is just:

# Set S to a random length-m subset of the integers 1:n
function randsubset!(S::Union(Set{Int},IntSet), n::Integer, m::Integer)
    0 <= m <= n || throw(DomainError())
    # Floyd's Õ(m) algorithm from J. Bentley and B. Floyd, "Programming pearls: A sample of brilliance,"
    # Commun. ACM Magazine 30, no. 9, 754--757 (1987).
    empty!(S)
    for j = n-m+1:n
        i = rand(1:j)
        push!(S, i in S ? j : i)
    end
    return S
end

(I also suspect that it is not worth optimizing the density > 0.5 case. Of course, you want it to still be correct, but it should be fine to use an O(n) algorithm in that case, like Knuth's Algorithm S.)

@timholy
Copy link
Sponsor Member

timholy commented May 2, 2014

Gotcha, thanks.

@stevengj
Copy link
Member Author

stevengj commented May 2, 2014

It looks like you don't actually need Newton's method. The number of expected trials k to generate m distinct integers in 1:n is simply k = log1p(-m/n) / log1p(-1/n).

@timholy
Copy link
Sponsor Member

timholy commented May 2, 2014

Nice catch, indeed.

@stevengj
Copy link
Member Author

stevengj commented May 2, 2014

Hmm, if m > n/20 (roughly) it looks like Knuth's Algorithm S is substantially faster than either your algorithm or Floyd's algorithm, and produces ordered output. And according to exercise 8 of Knuth section 3.4.2, it may be possible to substantially speed up algorithm S for the case of m << n, simply by calculating how many elements to skip rather than testing the inputs one at a time. Something like this may end up being the way to go, and apparently it's described in detail by Faster methods for random sampling, Commun. ACM 27, 7 (July 1984). 703-718. Worth looking into, at any rate.

@timholy
Copy link
Sponsor Member

timholy commented May 2, 2014

Sounds good, let's go with whatever works best.

@stevengj
Copy link
Member Author

stevengj commented May 3, 2014

The more I think about it, the more I think we really just want two functions:

randsubset!(S, A) = ... set S to a length(S) random sequential subset of A ...
randsubset!(S, A, p) = ... set S to a random sequential subset of A, including A[i] with independent probability p...

The first can use Vitter's algorithm. The latter is what I need for my column-by-column sprand algorithm to have reasonable statistics, but it is not the statistical distribution produced by your algorithm. The problem I have with your algorithm is, while it produces the desired mean, the distribution around that mean does not seem to be either easily described or likely to be useful in applications.

The good news is that the second function looks quite straightforward to implement efficiently. Basically, you just need to skip sequentially through the array A, adding elements to S one at a time, with the skip interval following a discrete exponential probability distribution, which is easy to generate by the inverse-transform method. And this should hopefully be more efficient than your algorithm, especially since it does not require any auxiliary arrays.

@stevengj
Copy link
Member Author

stevengj commented May 3, 2014

The following randsubset function illustrates what I mean. It seems to be a bit faster than your version, is quite short and simple, does not allocate any temporary arrays, and includes each element of A independently with probability p:

# fill S (resized as needed) with a random sequential subset of A,
# where each element of A is included in S with independent probability p.
function randsubset!(S::AbstractArray, A::AbstractArray, p::Real)
    0 <= p <= 1 || throw(ArgumentError("probability $p not in [0,1]"))
    n = length(A)
    p == 1 && return copy!(resize!(S, n), A)
    empty!(S)
    p == 0 && return S
    sizehint(S, int(p * length(A)))
    # Skip through A, in order, from each element i to the next element i+s
    # included in S. The probability that the next included
    # element is the k-th (k > 0) included element is (1-p)^(k-1) * p, and hence
    # the probability (CDF) that the next element is in {1,...,k} is 1-(1-p)^k = F(k).
    # Hence, we can draw the skip s from this probability distribution via the
    # discrete inverse-transform method: s = iceil(F^{-1}(u)) where u = rand(),
    # which is simply s = iceil(log(rand()) / log1p(-p)).
    i = 0
    L = 1 / log1p(-p)
    while true
        s = log(rand()) * L # note that rand() < 1, so s > 0
        s >= n - i && return S # compare before iceil to avoid overflow
        push!(S, A[i += iceil(s)])
    end
    return S
end
randsubset{T}(A::AbstractArray{T}, p::Real) = randsubset!(T[], A, p)

(I've done some basic statistical tests, but an extra pair of eyeballs would be appreciated.)

@timholy
Copy link
Sponsor Member

timholy commented May 4, 2014

I like it! Is it faster than your speed-improved version, or just the one in base? I imagine the main disadvantage of this approach is the number of times log gets called.

Is the distribution of number produced exactly Poisson? I would guess that this must be approximately true (to reasonably high precision) of my algorithm as well, but I agree that yours seems to have a distribution that would be more easily analyzed.

A couple of minor comments:

nexpected = p*length(A)
sizehint(S, round(nexpected + 5*sqrt(nexpected)))

would greatly reduce the frequency with which one will have to do a memcpy for the last few elements, and increases the size negligibly for large nexpected.

Also, since rand() uses dsfmt_gv_genrand_close_open, the algorithm can return a value of 0, and that would cause havoc. You'd want to check for that and generate a new number if you get 0 (or directly call dsfmt_gv_genrand_open_open, if it's not problematic to mix calls to the two functions).

@stevengj
Copy link
Member Author

stevengj commented May 4, 2014

@timholy, my version in #6726 is faster than all of the versions above.

The distribution is exactly a binomial distribution, which asymptotes to a Poisson distribution.

I thought about allocating a few extra standard deviations as you suggest. I don't think it will actually make much difference in practice because sizehint typically allocates about 1.5× the requested size, on average (since jl_array_grow_end works by successive doubling), but for very large arrays this is no longer the case. In any case, as you point out, the extra few stddevs is a negligible overallocation and is probably worth doing.

My version in #6726 correctly handles the case where rand() returns zero (and log(rand()) is -Inf), so there should be no problem there.

Probably further discussion should happen in #6726.

stevengj added a commit that referenced this issue May 6, 2014
faster sprand, new randsubseq (fixes #6714 and #6708)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:arrays:sparse Sparse arrays kind:bug Indicates an unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

3 participants