-
Notifications
You must be signed in to change notification settings - Fork 11
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this! Looks good in general, in particular I like that NullableCategoricalArray
is consistently replaced with CategoricalArray{?T}
even when there are no nulls so that we exercize the same code paths as before.
docs/src/man/getting_started.md
Outdated
In many cases we're willing to just ignore missing values and remove them from our vector. We can do that using the `dropnull` function: | ||
|
||
```julia | ||
dropnull(nv) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably should keep a short introduction to null
and how to skip them. Even if technically that type lives in another package, people won't necessarily be familiar with it (especially those used to NA
/DataArray
).
|
||
```julia | ||
cv = CategoricalArray(["Group A", "Group A", "Group A", | ||
"Group B", "Group B", "Group B"]) | ||
``` | ||
|
||
A companion type, `NullableCategoricalArray`, allows storing missing values in the array: is to `CategoricalArray` what `NullableArray` is to the standard `Array` type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just mention that CategoricalArray
supports null values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added an example
nullcount = count(isnull, X) | ||
pnull = 100 * nullcount/length(X) | ||
if pnull != 100 # describe will fail if dropnull returns an empty vector | ||
describe(io, collect(Nulls.skip(X))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's annoying that we have to make a copy just to call describe
. Is there a chance we could get it to work on general iterables? Also, ideally this shouldn't live in DataTables/DataFrames, but maybe for now there's no better solution.
return | ||
end | ||
|
||
function StatsBase.describe{T}(io::IO, X::CategoricalVector{T}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this function needed? Same issue as above: ideally, this shouldn't live in this package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will have to add StatsBase as a dependency to CategoricalArrays in order to move it there, but we could add that to JuliaData/CategoricalArrays.jl#70? I agree though, we just removed StatsBase.describe code from DataTables a few months ago. Same dependency issue for why the StatsBase.describe{T}(io::IO, X::Vector{Union{T, Null}})
version is needed above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A dependency on StatsBase wouldn't be great for CategoricalArrays. But my question was about why the generic method isn't enough. What's so special about CategoricalArrays?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah. Mainly because I hadn't noticed I had made the generic method too restrictive with function StatsBase.describe{T}(io::IO, X::Vector{Union{T, Null}})
when it could/should be X::AbstractVector{Union{T, Null}})
. I'll push an update
@@ -391,7 +429,7 @@ function _nonnull!(res, col) | |||
end | |||
end | |||
|
|||
function _nonnull!(res, col::NullableCategoricalArray) | |||
function _nonnull!(res, col::CategoricalArray{T}) where {T >: Null} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While you're at it, use AbstractCategoricalArray
. You also don't need to define T
, just do AbstractCategoricalArray{>: Null}
.
test/datatable.jl
Outdated
@@ -148,7 +148,7 @@ module TestDataTable | |||
# @test size(dt, 2) == 3 | |||
# @test typeof(dt[:, 1]) == Vector{?Int} | |||
# @test typeof(dt[:, 2]) == Vector{?Float64} | |||
# @test typeof(dt[:, 3]) == NullableCategoricalVector{String,UInt32} | |||
# @test typeof(dt[:, 3]) == CategoricalVector{?String,UInt32} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be fixed too, using <:
, since it wouldn't work in its current form.
test/io.jl
Outdated
answer = Sys.WORD_SIZE == 64 ? 0x3bd4a4a099ae8aac : 0x5ecdc4bb | ||
@test hash(sprint(printtable, dt)) == answer | ||
Sys.WORD_SIZE == 64 && @test hash(sprint(printtable, dt)) == 0x3031952855e1ebb8 | ||
# answer = Sys.WORD_SIZE == 64 ? 0x875219f56dd55294 : 0x5ecdc4bb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why comment this out?
@@ -102,7 +102,7 @@ module TestJoin | |||
categorical!(dt1, :B) | |||
join(dt1, dt1, on = [:A, :B], kind = :inner) | |||
|
|||
# Test that Array{Nullable} works when combined with NullableArray (#1088) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could still say that this mixes nullable and non-nullable array (the old description looks incorrect?). Same below.
test/sort.jl
Outdated
@@ -4,7 +4,7 @@ module TestSort | |||
dv1 = [9, 1, 8, null, 3, 3, 7, null] | |||
dv2 = [9, 1, 8, null, 3, 3, 7, null] | |||
dv3 = Vector{?Int}(1:8) | |||
cv1 = NullableCategoricalArray(dv1, ordered=true) | |||
cv1 = CategoricalArray{?eltype(dv1)}(dv1, ordered=true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to specific the element type AFAICT.
test/utils.jl
Outdated
Length: 4 | ||
Type: Union{Int64, Nulls.Null} | ||
Number Unique: 4 | ||
Type: Int64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be ?Int64
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, in #68 you used $Int
: can you do the same here?
test/join.jl
Outdated
|
||
@test join(dt1, dt2, kind=:cross) == | ||
DataTable(Any[repeat([1, 3, 5], inner = 5), | ||
repeat([1, 3, 5], inner = 5), | ||
repeat([0, 1, 2, 3, 4], outer = 3), | ||
repeat([0, 1, 2, 3, 4], outer = 3)], | ||
[:id, :fid, :id_1, :fid_1]) | ||
@test typeof.(join(dt1, dt2, kind=:cross).columns) == | ||
[CategoricalVector{i, DRT} for i in [Int, Float64, Int, Float64]] | ||
@test all(map(<:, typeof.(join(dt1, dt2, kind=:cross).columns), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This allows us to avoid checking for equality with the DefaultRefType in CategoricalArrays along with the 3rd and 4th parameter type, but strangely just using all
results in an error. It requires map
then all
. Not sure what the bug is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
julia> @test all(map(<:, typeof.(join(dt1, dt2, kind=:cross).columns),
CategoricalVector{T} for T in (Int, Float64, Int, Float64)))
Test Passed
julia> all(<:, typeof.(join(dt1, dt2, kind=:cross).columns), CategoricalVector{T} for T in (Int, Float64, Int, Float64))
ERROR: ArgumentError: reduced dimension(s) must be integers
Stacktrace:
[1] reduced_indices(::Tuple{Base.OneTo{Int64}}, ::Base.Generator{NTuple{4,DataType},##3#4}) at ./reducedim.jl:35
[2] reducedim_initarray at ./reducedim.jl:73 [inlined]
[3] reducedim_initarray at ./reducedim.jl:74 [inlined]
[4] reducedim_init(::Function, ::Base.#&, ::Array{DataType,1}, ::Base.Generator{NTuple{4,DataType},##3#4}) at ./reducedim.jl:114
[5] mapreducedim(::Function, ::Function, ::Array{DataType,1}, ::Base.Generator{NTuple{4,DataType},##3#4}) at ./reducedim.jl:242
[6] all(::Function, ::Array{DataType,1}, ::Base.Generator{NTuple{4,DataType},##3#4}) at ./reducedim.jl:583
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all
with more than two arguments isn't even documented, not sure what it's trying to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, it's really nice to have an introduction to Null
. I've a few more stylistic comments and it should be good.
docs/src/man/getting_started.md
Outdated
|
||
To get started, let's examine the `Nullable` type. Objects of this type can either hold a value, or represent a missing value (`null`). For example, this is a `Nullable` holding the integer `1`: | ||
To get started, let's examine the `Null` type. `Null` is a type implemented by [Nulls.jl](https://github.com/JuliaData/Nulls.jl) to represent missing data. `null` is an instance of the type `Null` used to represent a missing value. Note that Nulls.jl is not loaded by default when using DataTables and must be imported seperately. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"separately"
docs/src/man/getting_started.md
Outdated
julia> Null | ||
Nulls.Null | ||
|
||
julia> typeof(Null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd skip that line, it's confusing if you don't know what DataType
is.
docs/src/man/getting_started.md
Outdated
``` | ||
|
||
`Nullable` objects support all standard operators, which return another `Nullable`. One of the essential properties of `null` values is that they poison other items. To see this, try to add something like `Nullable(1)` to `Nullable()`: | ||
The `Null` type allows users to create `Vectors` and `DataTable` columns that are of type `Union{T, Null}` where `T` can be any type, e.g. `Int`, `Float64`, `String`, etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add something like "allowing for missing values" somewhere in the sentence?
docs/src/man/getting_started.md
Outdated
``` | ||
|
||
The `get` function can be used to extract the value from the [`Nullable`](http://docs.julialang.org/en/stable/manual/types/#nullable-types-representing-missing-values) wrapper when it is not null. For example: | ||
In the above example, you'll see that the element-type of the vector is `?Int64`. The `?T` is used as a shorthand to represent `Union{T, Null}`, and in this case `?Int64` is equivalent to `Union{Int64, Null}` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd move this sentence before the example.
docs/src/man/getting_started.md
Outdated
|
||
## The `NullableArray` Type | ||
julia> Nulls.skip(X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To illustrate how this can be used, show something like sum(Nulls.skip(X))
?
docs/src/man/getting_started.md
Outdated
``` | ||
|
||
But arrays of `Nullable` are inefficient, both in terms of computation costs and of memory use. `NullableArrays` provide a more efficient storage, and behave like `Array{Nullable}` objects. | ||
`null` elements can be replaced with a user-defined value using `Nulls.replace`, and the companion type `T` in `Union{T, Null}` types can be conveniently accessed using `Nulls.T`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't use the term "companion".
docs/src/man/getting_started.md
Outdated
``` | ||
|
||
Instead of removing `null` values, you can try to convert the `NullableArray` into a normal Julia `Array` using `convert`: | ||
Use `nulls` to generate `Vectors` and `Arrays`, using the optional first argument to specify a companion `Union` type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"to generate nullable ...". Again, I wouldn't say "companion" and just say "to specify the element type".
docs/src/man/getting_started.md
Outdated
## The `DataTable` Type | ||
|
||
The `DataTable` type can be used to represent data tables, each column of which is an array (by default, a `NullableArray`). You can specify the columns using keyword arguments: | ||
The `DataTable` type can be used to represent data tables, each column of which is an Array. You can specify the columns using keyword arguments: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"is a vector" (without upper case)
test/datatable.jl
Outdated
@@ -346,7 +346,8 @@ module TestDataTable | |||
# @test udt == unstack(dt, :variable, :value) == unstack(dt, :id, :variable, :value) | |||
@test udt == DataTable(Any[[1, 2], [1, 5], [2, 6], | |||
[3, 7], [4, 8]], [:id, :a, :b, :c, :d]) | |||
@test all(typeof.(udt.columns) .== NullableCategoricalVector{Int, UInt32}) | |||
|
|||
@test all(c -> c <: CategoricalVector{?Int}, typeof.(udt.columns)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor point, but could use isa
rather than typeof to make the code simpler (would also work elsewhere).
test/io.jl
Outdated
F = Vector{?Int}(1:3), | ||
G = nulls(3), | ||
H = fill(null, 3)) | ||
if Sys.WORD_SIZE == 64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Testing this only on 64-bit is weird. Better compare with the string directly (as I've argued elsewhere).
docs/src/man/getting_started.md
Outdated
|
||
To get started, let's examine the `Nullable` type. Objects of this type can either hold a value, or represent a missing value (`null`). For example, this is a `Nullable` holding the integer `1`: | ||
To get started, let's examine the `Null` type. `Null` is a type implemented by [Nulls.jl](https://github.com/JuliaData/Nulls.jl) to represent missing data. `null` is an instance of the type `Null` used to represent a missing value. Note that Nulls.jl is not loaded by default when using DataTables and must be imported separately. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably reexport everything Nulls.jl provide, as calling using Nulls
all the time is going to be annoying (and a regression from DataArrays).
docs/src/man/getting_started.md
Outdated
``` | ||
|
||
The `get` function can be used to extract the value from the [`Nullable`](http://docs.julialang.org/en/stable/manual/types/#nullable-types-representing-missing-values) wrapper when it is not null. For example: | ||
The `Null` type lets users create `Vector`s and `DataTable` columns with missing values. Here we create a vector with a null value and the element-type of the returned vector is `?Int64`. The syntax `?T`, where `T` can be any type, represents `Union{T, Null}`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@quinnj has removed uses of ?T
in the CategoricalArrays PR since it's deprecated on 0.7. Better stop using it too for now, until T?
is available in 1.0.
Though we'll have to remove the show
methods from Nulls.jl too if we don't want it to appear in the output.
test/io.jl
Outdated
DRT = CategoricalArrays.DefaultRefType | ||
@test sprint(printtable, dt) == """ | ||
"A","B","C","D","E","F","G","H" | ||
1,"a","A","CategoricalArrays.CategoricalValue{Char,$DRT} 'a'","CategoricalArrays.CategoricalValue{String,$DRT} "A"","1",NULL,NULL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we need to fix printing of CategoricalValue
from printtable
: it should definitely use the compact printing. Also, NULL
should be null
.
BTW, better add a line break before starting the string so that you can drop the excess indentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would you like the CategoricalValue printing to be formatted? In both master and the nl/null branch of the Nullable -> Null PR in CategoricalArrays I get the following printing behavior in 0.7
julia> x = CategoricalVector(["s1", "s2", "s3"])
3-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
CategoricalArrays.CategoricalValue{String,UInt32} "s1"
CategoricalArrays.CategoricalValue{String,UInt32} "s2"
CategoricalArrays.CategoricalValue{String,UInt32} "s3"
julia> show(x)
CategoricalArrays.CategoricalValue{String,UInt32}["s1", "s2", "s3"]
julia> showcompact(x)
CategoricalArrays.CategoricalValue{String,UInt32}["s1", "s2", "s3"]
julia> print(x)
CategoricalArrays.CategoricalValue{String,UInt32}["s1", "s2", "s3"]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, there's been a change in 0.7 recently. Array printing used to set :compact => true
, and that's no longer the case. The 0.6 behavior is better: it only prints the string when printing a categorical array, but prints the full details when showing a single value. Better tackle this as a separate issue. For now, probably better base print tests on the 0.6 output, which is more or less what we should be aiming for in 0.7 too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The show
change was introduced in JuliaLang/julia#22981, see discussion there.
Ok, I think I've taken care of all of the ?T references. Thanks for the reviews @nalimilan!
|
test/join.jl
Outdated
@test typeof.(l(on).columns) == [Vector{?Int}, | ||
Vector{?Float64}, | ||
Vector{?Float64}] | ||
@test typeof.(l(on).columns) == [Vector{Union{T, Null}} for T in (Int, Float64, Float64)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add line break here and below
Yes please, feel free to run w/ this & that PR. I'm tied up with some other DataStreams tagging issues currently. |
src/abstractdatatable/join.jl
Outdated
|
||
similar_nullable{T >: Null}(dv::AbstractArray{T}, dims::Union{Int, Tuple{Vararg{Int}}}) = | ||
(v = Vector{?Nulls.T(T)}(dims); fill!(v, null); return v) | ||
(v = Vector{Union{Nulls.T(T), Null}}(dims); fill!(v, null); return v) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can just be Vector{T} since T >: Null
test/data.jl
Outdated
@@ -266,7 +266,7 @@ module TestData | |||
# m2 = join(dt1, dt2, on = ["a","b"], kind = :outer) | |||
# @test m2[10,:v2] == null | |||
# @test m2[:a] == | |||
# (?String)["x", "x", "y", "y", | |||
# Union{String, Null}["x", "x", "y", "y", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix spacing below
Batch/Row -> Field/Column Update CategoricalArray tests remove references to Nullables and NullableArrays Add check for whether Base.unique! is defined (0.6 vs. 0.7) Update StatsBase.describe code to work with Nulls Update documentation and examples for working with nulls Update printtable to output "null" rather than "NULL" Remove use of ?T to represent Union{T, Null}
@@ -4,16 +4,13 @@ | |||
|
|||
# Like similar, but returns a array that can have nulls and is initialized with nulls | |||
similar_nullable{T}(dv::AbstractArray{T}, dims::Union{Int, Tuple{Vararg{Int}}}) = | |||
(v = Vector{?T}(dims); fill!(v, null); return v) | |||
(v = Vector{Union{T, Null}}(dims); fill!(v, null); return v) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we use similar(dv, Union{T, Null})
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that should work, I'll link to this in #66 and test it when I make updates there. Good suggestion!
Batch/Row -> Field/Column
Update CategoricalArray tests
remove references to Nullables and NullableArrays
Add check for whether Base.unique! is defined (0.6 vs. 0.7)
Update StatsBase.describe code to work with Nulls
Update documentation and examples for working with nulls
This is passing for me locally using @nalimilan's JuliaData/CategoricalArrays.jl#70 and @quinnj's null branch on DataStreams. Note this PR is targetted to merge into @quinnj's branch under review in #66 rather than master so we don't lose track of the things we haven't addressed yet in that PR