-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make covariance and correlation work for any iterators #30
Conversation
Thanks. That implementation is reasonable, but I wonder whether using the existing methods based on Could you run a few benchmarks since you've already written the code? For example, testing on generators and
Why would it break? |
It won't "break" but it won't give you the matrix variance-covariance matrix from treating each column in the matrix as a separate vector. It will fall back on |
One thing I wasn't expecting was that this fails:
Whereas
|
Here are the results of my benchmark. I added a method so that the above covariance between vectors of vectors works. This makes benchmarking easier but will also have to be added since iterators and vectors should behave similarly. Highlights:
Unfortunately given these difficulties this PR is unlikely to be merged before the 1.5 feature freeze.
|
src/Statistics.jl
Outdated
# core functions | ||
|
||
unscaled_covzm(x::AbstractVector{<:Number}) = sum(abs2, x) | ||
unscaled_covzm(x::AbstractVector) = sum(t -> t*t', x) | ||
unscaled_covzm(x::AbstractVector{<:Number}) = sum(_abs2, x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method isn't needed if you dispatch in _abs2
.
unscaled_covzm(x::AbstractVector{<:Number}) = sum(_abs2, x) |
src/Statistics.jl
Outdated
# Base.IteratorEltype(x)) / 0 | ||
return NaN | ||
end | ||
f = let xmean = xmean, ymean = ymean |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd just have f
take xmean
and ymean
, and pass that explicitly in the two call places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have to write (x, y) -> _conj(x - xmean, y - ymean)
. Wouldn't that cause the closure bug?
src/Statistics.jl
Outdated
@@ -504,6 +525,28 @@ function covzm(x::AbstractMatrix, vardim::Int=1; corrected::Bool=true) | |||
A .= A .* b | |||
return A | |||
end | |||
function covzm(x::Any, y::Any; corrected::Bool=true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this could just call covm(x, 0, y, 0)
? Check whether the compiler is able to optimize for 0
statically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tried this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just did.
julia> x = randn(10_000); y = randn(10_000);
julia> x = x .- Statistics.mean(x); y = y .- Statistics.mean(y);
julia> gx = (xi for xi in x); gy = (yi for yi in y);
julia> @btime Statistics.covzm($gx, $gy)
11.820 μs (0 allocations: 0 bytes)
0.009480946018135657
julia> @btime Statistics.covm($gx, 0, $gy, 0)
14.768 μs (0 allocations: 0 bytes)
0.009480946018135657
Let me know what you think of that difference. It doesn't seem very important to me. The covzm
version takes 80% as long as covm
with 0
s.
src/Statistics.jl
Outdated
# TODO: Understand how to improve this error. | ||
#return Base.mapreduce_empty_iter(t -> _conj(t[2])*t[1]', Base.add_sum, itr, | ||
# Base.IteratorEltype(x)) / 0 | ||
return NaN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think mapreduce_empty_iter
is really useful outside standard reductions. Here you just need to find the best way to compute NaN
so that it works for all types. Maybe this?
# TODO: Understand how to improve this error. | |
#return Base.mapreduce_empty_iter(t -> _conj(t[2])*t[1]', Base.add_sum, itr, | |
# Base.IteratorEltype(x)) / 0 | |
return NaN | |
v = conj(zero(eltype(y)))*zero(eltype(x))' | |
return (v + v) / 0 |
src/Statistics.jl
Outdated
count += 1 | ||
z_itr = iterate(z, state) | ||
end | ||
return total ./ (count - Int(corrected)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return total ./ (count - Int(corrected)) | |
return total / (count - Int(corrected)) |
src/Statistics.jl
Outdated
return NaN | ||
end | ||
count = 1 | ||
value, state = z_itr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like this would be more readable (if you adapt code elsewhere):
value, state = z_itr | |
(xi, yi), state = z_itr |
src/Statistics.jl
Outdated
function covzm(itr::Any; corrected::Bool=true) | ||
y = iterate(itr) | ||
if y === nothing | ||
return Base.mapreduce_empty_iter(_abs2, Base.add_sum, itr, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+
is used instead of add_sum
:
return Base.mapreduce_empty_iter(_abs2, Base.add_sum, itr, | |
return Base.mapreduce_empty_iter(_abs2, +, itr, |
But maybe it would be clearer to use the same approach for all methods (see suggestion for the two-argument case).
covm(x::AbstractVector, xmean, y::AbstractVector, ymean; corrected::Bool=true) = | ||
covzm(map(t -> t - xmean, x), map(t -> t - ymean, y); corrected=corrected) | ||
covm(x::AbstractVecOrMat, xmean, y::AbstractVecOrMat, ymean, vardim::Int=1; corrected::Bool=true) = | ||
covzm(x .- xmean, y .- ymean, vardim; corrected=corrected) | ||
|
||
# cov (API) | ||
""" | ||
cov(x::Any; corrected::Bool=true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better adapt the existing docstring (and method) to only mention iterators, since vectors are just a special case. Same for others.
src/Statistics.jl
Outdated
@@ -517,17 +560,75 @@ end | |||
|
|||
# covm (with provided mean) | |||
## Use map(t -> t - xmean, x) instead of x .- xmean to allow for Vector{Vector} | |||
## which can't be handled by broadcast | |||
## which can't be handled by broadcastz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
## which can't be handled by broadcastz | |
## which can't be handled by broadcast |
(but this comment just be moved below)
@nalimilan if you are happy with this I can move on to |
I think it's OK. We could get rid of that small overhead by defining an internal helper with |
Working on correlation now. Shouldn't be too hard. |
I am working through this. I want to add There seems to be no Let me know if I should cc Andreas or someone else who has worked with this code more. |
I think that method isn't needed since this definition works both for vectors and matrices: corm(x::AbstractVecOrMat, xmean, y::AbstractVecOrMat, ymean, vardim::Int=1) =
corzm(x .- xmean, y .- ymean, vardim) Looking at how tricky the changes need to be to support iterators, I'm more and more inclined to just add methods to |
I am starting to agree. I was doing well using the
Tests currently fail due to bug 2, and writing generic code means adding a lot of methods to helper functions. It probably isn't worth it. Take a look at the most recent commit pushed to this branch to make the final call based off of implementation that is pretty much what we want if we avoid |
I vote for keeping things simple, at least for now. :-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments that probably apply also to cor
, etc.
src/Statistics.jl
Outdated
return (v + v) / 0 | ||
end | ||
count = 1 | ||
itri, state = y |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use value
as elsewhere?
@@ -518,20 +527,37 @@ end | |||
# covm (with provided mean) | |||
## Use map(t -> t - xmean, x) instead of x .- xmean to allow for Vector{Vector} | |||
## which can't be handled by broadcast | |||
covm(itr::Any, itrmean; corrected::Bool=true) = | |||
covm(map(t -> t - itrmean, x); corrected = corrected) | |||
covm(x::AbstractVector, xmean; corrected::Bool=true) = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is identical to the previous one so it's no longer needed.
""" | ||
function cov(x::Any, corrected::Bool=true) | ||
cx = collect(x) | ||
covm(cx, mean(cx); corrected=corrected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to make another copy of cx
. Better call covzm
directly.
_abs2(x::Number) = abs2(x) | ||
_abs2(x) = x*x' | ||
|
||
_conjmul(x::Number, y::Number) = x * conj(y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still needed?
This PR is in reference to #35050 in the Julia Repo here. The goal is to make working with iterators more convenient. In particular, this PR will make working with iterators that skip over missing values more convenient.
This PR will improve both
cov
andcor
. However it starts withcov
,covm
andcovzm
. I think I went overboard in avoiding allocations becausecovm
andcovzm
are handeled viaiterate
only, the waymean(itr)
is handled in the same file. Howevercovm
forAbstractArrays
is fine to usemap
and allocate a new vector. The decision on how to proceed will be based on how we would like to handle stateful iterators.Currently there are 4 new methods added.
covzm(itr::Any; corrected::Bool=true)
covzm(x::Any, y::Any, corrected::Bool=true)
covm(itr::Any, itrmean; corrected::Bool=true)
covm(x::Any, xmean, y::Any, ymean; corrected::Bool=true)
As a consequence, something like
will break.
This is my first PR to a stdlib, so let me know if I am missing any conventions. Thanks!