Skip to content
This repository has been archived by the owner on May 5, 2019. It is now read-only.

Get PR 66 passing tests on Julia 0.7 #83

Merged
merged 1 commit into from
Aug 28, 2017
Merged

Get PR 66 passing tests on Julia 0.7 #83

merged 1 commit into from
Aug 28, 2017

Conversation

cjprybol
Copy link
Contributor

@cjprybol cjprybol commented Aug 3, 2017

Batch/Row -> Field/Column
Update CategoricalArray tests
remove references to Nullables and NullableArrays
Add check for whether Base.unique! is defined (0.6 vs. 0.7)
Update StatsBase.describe code to work with Nulls
Update documentation and examples for working with nulls

This is passing for me locally using @nalimilan's JuliaData/CategoricalArrays.jl#70 and @quinnj's null branch on DataStreams. Note this PR is targetted to merge into @quinnj's branch under review in #66 rather than master so we don't lose track of the things we haven't addressed yet in that PR

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! Looks good in general, in particular I like that NullableCategoricalArray is consistently replaced with CategoricalArray{?T} even when there are no nulls so that we exercize the same code paths as before.

In many cases we're willing to just ignore missing values and remove them from our vector. We can do that using the `dropnull` function:

```julia
dropnull(nv)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably should keep a short introduction to null and how to skip them. Even if technically that type lives in another package, people won't necessarily be familiar with it (especially those used to NA/DataArray).


```julia
cv = CategoricalArray(["Group A", "Group A", "Group A",
"Group B", "Group B", "Group B"])
```

A companion type, `NullableCategoricalArray`, allows storing missing values in the array: is to `CategoricalArray` what `NullableArray` is to the standard `Array` type.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just mention that CategoricalArray supports null values?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an example

nullcount = count(isnull, X)
pnull = 100 * nullcount/length(X)
if pnull != 100 # describe will fail if dropnull returns an empty vector
describe(io, collect(Nulls.skip(X)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's annoying that we have to make a copy just to call describe. Is there a chance we could get it to work on general iterables? Also, ideally this shouldn't live in DataTables/DataFrames, but maybe for now there's no better solution.

return
end

function StatsBase.describe{T}(io::IO, X::CategoricalVector{T})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this function needed? Same issue as above: ideally, this shouldn't live in this package.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will have to add StatsBase as a dependency to CategoricalArrays in order to move it there, but we could add that to JuliaData/CategoricalArrays.jl#70? I agree though, we just removed StatsBase.describe code from DataTables a few months ago. Same dependency issue for why the StatsBase.describe{T}(io::IO, X::Vector{Union{T, Null}}) version is needed above.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A dependency on StatsBase wouldn't be great for CategoricalArrays. But my question was about why the generic method isn't enough. What's so special about CategoricalArrays?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. Mainly because I hadn't noticed I had made the generic method too restrictive with function StatsBase.describe{T}(io::IO, X::Vector{Union{T, Null}}) when it could/should be X::AbstractVector{Union{T, Null}}). I'll push an update

@@ -391,7 +429,7 @@ function _nonnull!(res, col)
end
end

function _nonnull!(res, col::NullableCategoricalArray)
function _nonnull!(res, col::CategoricalArray{T}) where {T >: Null}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While you're at it, use AbstractCategoricalArray. You also don't need to define T, just do AbstractCategoricalArray{>: Null}.

@@ -148,7 +148,7 @@ module TestDataTable
# @test size(dt, 2) == 3
# @test typeof(dt[:, 1]) == Vector{?Int}
# @test typeof(dt[:, 2]) == Vector{?Float64}
# @test typeof(dt[:, 3]) == NullableCategoricalVector{String,UInt32}
# @test typeof(dt[:, 3]) == CategoricalVector{?String,UInt32}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be fixed too, using <:, since it wouldn't work in its current form.

test/io.jl Outdated
answer = Sys.WORD_SIZE == 64 ? 0x3bd4a4a099ae8aac : 0x5ecdc4bb
@test hash(sprint(printtable, dt)) == answer
Sys.WORD_SIZE == 64 && @test hash(sprint(printtable, dt)) == 0x3031952855e1ebb8
# answer = Sys.WORD_SIZE == 64 ? 0x875219f56dd55294 : 0x5ecdc4bb
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why comment this out?

@@ -102,7 +102,7 @@ module TestJoin
categorical!(dt1, :B)
join(dt1, dt1, on = [:A, :B], kind = :inner)

# Test that Array{Nullable} works when combined with NullableArray (#1088)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could still say that this mixes nullable and non-nullable array (the old description looks incorrect?). Same below.

test/sort.jl Outdated
@@ -4,7 +4,7 @@ module TestSort
dv1 = [9, 1, 8, null, 3, 3, 7, null]
dv2 = [9, 1, 8, null, 3, 3, 7, null]
dv3 = Vector{?Int}(1:8)
cv1 = NullableCategoricalArray(dv1, ordered=true)
cv1 = CategoricalArray{?eltype(dv1)}(dv1, ordered=true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to specific the element type AFAICT.

test/utils.jl Outdated
Length: 4
Type: Union{Int64, Nulls.Null}
Number Unique: 4
Type: Int64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be ?Int64?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, in #68 you used $Int: can you do the same here?

test/join.jl Outdated

@test join(dt1, dt2, kind=:cross) ==
DataTable(Any[repeat([1, 3, 5], inner = 5),
repeat([1, 3, 5], inner = 5),
repeat([0, 1, 2, 3, 4], outer = 3),
repeat([0, 1, 2, 3, 4], outer = 3)],
[:id, :fid, :id_1, :fid_1])
@test typeof.(join(dt1, dt2, kind=:cross).columns) ==
[CategoricalVector{i, DRT} for i in [Int, Float64, Int, Float64]]
@test all(map(<:, typeof.(join(dt1, dt2, kind=:cross).columns),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allows us to avoid checking for equality with the DefaultRefType in CategoricalArrays along with the 3rd and 4th parameter type, but strangely just using all results in an error. It requires map then all. Not sure what the bug is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

julia> @test all(map(<:, typeof.(join(dt1, dt2, kind=:cross).columns),
                         CategoricalVector{T} for T in (Int, Float64, Int, Float64)))
Test Passed

julia> all(<:, typeof.(join(dt1, dt2, kind=:cross).columns), CategoricalVector{T} for T in (Int, Float64, Int, Float64))
ERROR: ArgumentError: reduced dimension(s) must be integers
Stacktrace:
 [1] reduced_indices(::Tuple{Base.OneTo{Int64}}, ::Base.Generator{NTuple{4,DataType},##3#4}) at ./reducedim.jl:35
 [2] reducedim_initarray at ./reducedim.jl:73 [inlined]
 [3] reducedim_initarray at ./reducedim.jl:74 [inlined]
 [4] reducedim_init(::Function, ::Base.#&, ::Array{DataType,1}, ::Base.Generator{NTuple{4,DataType},##3#4}) at ./reducedim.jl:114
 [5] mapreducedim(::Function, ::Function, ::Array{DataType,1}, ::Base.Generator{NTuple{4,DataType},##3#4}) at ./reducedim.jl:242
 [6] all(::Function, ::Array{DataType,1}, ::Base.Generator{NTuple{4,DataType},##3#4}) at ./reducedim.jl:583

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all with more than two arguments isn't even documented, not sure what it's trying to do.

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, it's really nice to have an introduction to Null. I've a few more stylistic comments and it should be good.


To get started, let's examine the `Nullable` type. Objects of this type can either hold a value, or represent a missing value (`null`). For example, this is a `Nullable` holding the integer `1`:
To get started, let's examine the `Null` type. `Null` is a type implemented by [Nulls.jl](https://github.com/JuliaData/Nulls.jl) to represent missing data. `null` is an instance of the type `Null` used to represent a missing value. Note that Nulls.jl is not loaded by default when using DataTables and must be imported seperately.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"separately"

julia> Null
Nulls.Null

julia> typeof(Null)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd skip that line, it's confusing if you don't know what DataType is.

```

`Nullable` objects support all standard operators, which return another `Nullable`. One of the essential properties of `null` values is that they poison other items. To see this, try to add something like `Nullable(1)` to `Nullable()`:
The `Null` type allows users to create `Vectors` and `DataTable` columns that are of type `Union{T, Null}` where `T` can be any type, e.g. `Int`, `Float64`, `String`, etc.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add something like "allowing for missing values" somewhere in the sentence?

```

The `get` function can be used to extract the value from the [`Nullable`](http://docs.julialang.org/en/stable/manual/types/#nullable-types-representing-missing-values) wrapper when it is not null. For example:
In the above example, you'll see that the element-type of the vector is `?Int64`. The `?T` is used as a shorthand to represent `Union{T, Null}`, and in this case `?Int64` is equivalent to `Union{Int64, Null}`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd move this sentence before the example.


## The `NullableArray` Type
julia> Nulls.skip(X)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To illustrate how this can be used, show something like sum(Nulls.skip(X))?

```

But arrays of `Nullable` are inefficient, both in terms of computation costs and of memory use. `NullableArrays` provide a more efficient storage, and behave like `Array{Nullable}` objects.
`null` elements can be replaced with a user-defined value using `Nulls.replace`, and the companion type `T` in `Union{T, Null}` types can be conveniently accessed using `Nulls.T`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't use the term "companion".

```

Instead of removing `null` values, you can try to convert the `NullableArray` into a normal Julia `Array` using `convert`:
Use `nulls` to generate `Vectors` and `Arrays`, using the optional first argument to specify a companion `Union` type.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"to generate nullable ...". Again, I wouldn't say "companion" and just say "to specify the element type".

## The `DataTable` Type

The `DataTable` type can be used to represent data tables, each column of which is an array (by default, a `NullableArray`). You can specify the columns using keyword arguments:
The `DataTable` type can be used to represent data tables, each column of which is an Array. You can specify the columns using keyword arguments:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"is a vector" (without upper case)

@@ -346,7 +346,8 @@ module TestDataTable
# @test udt == unstack(dt, :variable, :value) == unstack(dt, :id, :variable, :value)
@test udt == DataTable(Any[[1, 2], [1, 5], [2, 6],
[3, 7], [4, 8]], [:id, :a, :b, :c, :d])
@test all(typeof.(udt.columns) .== NullableCategoricalVector{Int, UInt32})

@test all(c -> c <: CategoricalVector{?Int}, typeof.(udt.columns))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor point, but could use isa rather than typeof to make the code simpler (would also work elsewhere).

test/io.jl Outdated
F = Vector{?Int}(1:3),
G = nulls(3),
H = fill(null, 3))
if Sys.WORD_SIZE == 64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing this only on 64-bit is weird. Better compare with the string directly (as I've argued elsewhere).


To get started, let's examine the `Nullable` type. Objects of this type can either hold a value, or represent a missing value (`null`). For example, this is a `Nullable` holding the integer `1`:
To get started, let's examine the `Null` type. `Null` is a type implemented by [Nulls.jl](https://github.com/JuliaData/Nulls.jl) to represent missing data. `null` is an instance of the type `Null` used to represent a missing value. Note that Nulls.jl is not loaded by default when using DataTables and must be imported separately.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably reexport everything Nulls.jl provide, as calling using Nulls all the time is going to be annoying (and a regression from DataArrays).

```

The `get` function can be used to extract the value from the [`Nullable`](http://docs.julialang.org/en/stable/manual/types/#nullable-types-representing-missing-values) wrapper when it is not null. For example:
The `Null` type lets users create `Vector`s and `DataTable` columns with missing values. Here we create a vector with a null value and the element-type of the returned vector is `?Int64`. The syntax `?T`, where `T` can be any type, represents `Union{T, Null}`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@quinnj has removed uses of ?T in the CategoricalArrays PR since it's deprecated on 0.7. Better stop using it too for now, until T? is available in 1.0.

Though we'll have to remove the show methods from Nulls.jl too if we don't want it to appear in the output.

test/io.jl Outdated
DRT = CategoricalArrays.DefaultRefType
@test sprint(printtable, dt) == """
"A","B","C","D","E","F","G","H"
1,"a","A","CategoricalArrays.CategoricalValue{Char,$DRT} 'a'","CategoricalArrays.CategoricalValue{String,$DRT} "A"","1",NULL,NULL
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we need to fix printing of CategoricalValue from printtable: it should definitely use the compact printing. Also, NULL should be null.

BTW, better add a line break before starting the string so that you can drop the excess indentation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would you like the CategoricalValue printing to be formatted? In both master and the nl/null branch of the Nullable -> Null PR in CategoricalArrays I get the following printing behavior in 0.7

julia> x = CategoricalVector(["s1", "s2", "s3"])
3-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
 CategoricalArrays.CategoricalValue{String,UInt32} "s1"
 CategoricalArrays.CategoricalValue{String,UInt32} "s2"
 CategoricalArrays.CategoricalValue{String,UInt32} "s3"

julia> show(x)
CategoricalArrays.CategoricalValue{String,UInt32}["s1", "s2", "s3"]
julia> showcompact(x)
CategoricalArrays.CategoricalValue{String,UInt32}["s1", "s2", "s3"]
julia> print(x)
CategoricalArrays.CategoricalValue{String,UInt32}["s1", "s2", "s3"]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, there's been a change in 0.7 recently. Array printing used to set :compact => true, and that's no longer the case. The 0.6 behavior is better: it only prints the string when printing a categorical array, but prints the full details when showing a single value. Better tackle this as a separate issue. For now, probably better base print tests on the 0.6 output, which is more or less what we should be aiming for in 0.7 too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The show change was introduced in JuliaLang/julia#22981, see discussion there.

@cjprybol
Copy link
Contributor Author

cjprybol commented Aug 24, 2017

Ok, I think I've taken care of all of the ?T references. Thanks for the reviews @nalimilan!

Looks like the only failures left are from the same sorting issue noted by @rofinn in the nulls Categorical Array PR JuliaData/CategoricalArrays.jl#70 (comment). JuliaData/CategoricalArrays.jl#76

test/join.jl Outdated
@test typeof.(l(on).columns) == [Vector{?Int},
Vector{?Float64},
Vector{?Float64}]
@test typeof.(l(on).columns) == [Vector{Union{T, Null}} for T in (Int, Float64, Float64)]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add line break here and below

@nalimilan
Copy link
Member

@quinnj Can we merge this into #66? That would make it easier to review the final PR.

@quinnj
Copy link
Member

quinnj commented Aug 28, 2017

Yes please, feel free to run w/ this & that PR. I'm tied up with some other DataStreams tagging issues currently.

@nalimilan
Copy link
Member

@cjprybol Merge if you think it's ready enough. Then we can add the missing pieces directly to #66.


similar_nullable{T >: Null}(dv::AbstractArray{T}, dims::Union{Int, Tuple{Vararg{Int}}}) =
(v = Vector{?Nulls.T(T)}(dims); fill!(v, null); return v)
(v = Vector{Union{Nulls.T(T), Null}}(dims); fill!(v, null); return v)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can just be Vector{T} since T >: Null

test/data.jl Outdated
@@ -266,7 +266,7 @@ module TestData
# m2 = join(dt1, dt2, on = ["a","b"], kind = :outer)
# @test m2[10,:v2] == null
# @test m2[:a] ==
# (?String)["x", "x", "y", "y",
# Union{String, Null}["x", "x", "y", "y",
Copy link
Contributor Author

@cjprybol cjprybol Aug 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix spacing below

Batch/Row -> Field/Column
Update CategoricalArray tests
remove references to Nullables and NullableArrays
Add check for whether Base.unique! is defined (0.6 vs. 0.7)
Update StatsBase.describe code to work with Nulls
Update documentation and examples for working with nulls
Update printtable to output "null" rather than "NULL"
Remove use of ?T to represent Union{T, Null}
@cjprybol cjprybol merged commit 31bd33b into JuliaData:jq/gangy Aug 28, 2017
@cjprybol cjprybol deleted the cjp-jq/gangy branch August 28, 2017 16:37
@@ -4,16 +4,13 @@

# Like similar, but returns a array that can have nulls and is initialized with nulls
similar_nullable{T}(dv::AbstractArray{T}, dims::Union{Int, Tuple{Vararg{Int}}}) =
(v = Vector{?T}(dims); fill!(v, null); return v)
(v = Vector{Union{T, Null}}(dims); fill!(v, null); return v)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we use similar(dv, Union{T, Null})?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that should work, I'll link to this in #66 and test it when I make updates there. Good suggestion!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants