Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Interval bounds field #217

Open
wants to merge 2 commits into
base: v2-DEV
Choose a base branch
from
Open

Conversation

rofinn
Copy link
Member

@rofinn rofinn commented May 18, 2023

Closes #208 and depends on #214

Intervals.Interval now serves a unique purpose over IntervalSets.Interval which we can now document:

  • If you're only working with individual intervals then use IntervalSets.Interval
  • If your arrays of intervals all have the same bounds then use IntervalSets.Interval
  • If your arrays of intervals may have different bounds then use Intervals.Interval

TODO:

  • Encode Interval bounds as an integer field rather than type parameters.
  • Update AnchoredIntervals
  • Finish getting existing tests and docstrings updated
  • Add documentation

Benchmarks:

In the case where they're all the same, it's order of magnitudes faster to use the parameterized version in IntervalSets.jl

julia> @benchmark [IntervalSets.ClosedInterval(i, i+1) for i in 1:100]
BenchmarkTools.Trial: 10000 samples with 973 evaluations.
 Range (min … max):   71.521 ns … 487.010 ns  ┊ GC (min … max): 0.00% …  0.00%
 Time  (median):      96.881 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   108.180 ns ±  48.964 ns  ┊ GC (mean ± σ):  7.43% ± 12.24%

    ▂▅▇█▇▅▄▃▂▂▁                                                 ▂
  ▄▇██████████████▇▅▆▅▄▅▄▁▅▄▁▃▁▃▄▁▁▁▁▁▃▁▁▁▁▁▄▁▅▄▅▅▄▅▆▇██▇████▇▇ █
  71.5 ns       Histogram: log(frequency) by time        347 ns <

 Memory estimate: 1.77 KiB, allocs estimate: 1.

julia> @benchmark [Intervals.Interval(i, i+1) for i in 1:100]
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  12.172 μs …  65.246 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     12.428 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   12.471 μs ± 585.522 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

        ▁▆██▆▄▁                                                 
  ▂▂▂▃▅▇███████▇▆▅▄▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▂▂▁▁▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  12.2 μs         Histogram: frequency by time         13.7 μs <

 Memory estimate: 2.50 KiB, allocs estimate: 1.

However, once the intervals have been allocated many operations have similar performance.

julia> @benchmark IntervalSets.width.($closed_intervalsets)
BenchmarkTools.Trial: 10000 samples with 996 evaluations.
 Range (min … max):  26.146 ns … 764.684 ns  ┊ GC (min … max):  0.00% … 93.51%
 Time  (median):     32.709 ns               ┊ GC (median):     0.00%
 Time  (mean ± σ):   38.117 ns ±  39.154 ns  ┊ GC (mean ± σ):  10.16% ±  9.43%

  ▆█▅▂▁               ▁                                        ▁
  █████▇█▅▃▄▅▄▄▄▁▃▃▃▃▅█▆▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▄▃▃▁▃ █
  26.1 ns       Histogram: log(frequency) by time       336 ns <

 Memory estimate: 496 bytes, allocs estimate: 1.

julia> @benchmark Intervals.span.($closed_intervals)
BenchmarkTools.Trial: 10000 samples with 988 evaluations.
 Range (min … max):  48.636 ns … 616.634 ns  ┊ GC (min … max): 0.00% … 81.62%
 Time  (median):     52.459 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   57.695 ns ±  38.890 ns  ┊ GC (mean ± σ):  6.68% ±  8.92%

  █▅▁                 ▁                                        ▁
  ████▄▄▁▁▁▃▁▁▁▃▁▁▁▁▁▆█▅▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▃▄ █
  48.6 ns       Histogram: log(frequency) by time       376 ns <

 Memory estimate: 496 bytes, allocs estimate: 1.

And once we need to deal with potentially mixed bounds the Intervals.Interval performs better... as you'd expect.

julia> @benchmark [IntervalSets.Interval{L, R}(i, i+1) for i in 1:100 for (L, R) in Iterators.product([:open, :closed], [:open, :closed])]
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  209.827 μs …   2.811 ms  ┊ GC (min … max): 0.00% … 89.93%
 Time  (median):     215.609 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   232.477 μs ± 196.159 μs  ┊ GC (mean ± σ):  6.87% ±  7.43%

      ▃█▇▄▁                                                      
  ▂▂▄▇█████▆▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▂▁▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▂ ▃
  210 μs           Histogram: frequency by time          265 μs <

 Memory estimate: 302.86 KiB, allocs estimate: 5023.

julia> @benchmark [Intervals.Interval(i, i+1, (L, R)) for i in 1:100 for (L, R) in Iterators.product([Intervals.Open, Intervals.Closed], [Intervals.Open, Intervals.Closed])]
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  54.454 μs …  1.823 ms  ┊ GC (min … max): 0.00% … 96.48%
 Time  (median):     55.620 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   57.686 μs ± 48.645 μs  ┊ GC (mean ± σ):  2.49% ±  2.86%

   ▂▅▇███▇▅▄▂▂▃▃▂▂▁▁                                          ▂
  ▆██████████████████▇▇▆▆▆▆▆▇▆▇▆▇▇▇▆▆▆▆▆▅▆▅▆▆▆▆▆▇▇▇▇▇▇▇▇▇▆▇▆▆ █
  54.5 μs      Histogram: log(frequency) by time      66.2 μs <

 Memory estimate: 43.19 KiB, allocs estimate: 210.

And because we don't need to box the elements operations over all elements are faster.

julia> @benchmark IntervalSets.width.($mixed_intervalsets)
BenchmarkTools.Trial: 10000 samples with 5 evaluations.
 Range (min … max):  6.293 μs … 240.208 μs  ┊ GC (min … max): 0.00% … 92.81%
 Time  (median):     6.773 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   6.902 μs ±   3.596 μs  ┊ GC (mean ± σ):  0.86% ±  1.62%

        ▁▂▂▄▅▆▇▇█▆▆▆▄▄▄▅▄▁                                     
  ▁▁▂▂▄▇███████████████████▇▆▅▄▄▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▃▂▂▂▂▂▂▂▂▁▁▁▁ ▄
  6.29 μs         Histogram: frequency by time        7.99 μs <

 Memory estimate: 3.25 KiB, allocs estimate: 1.

julia> @benchmark Intervals.span.($mixed_intervals)
BenchmarkTools.Trial: 10000 samples with 405 evaluations.
 Range (min … max):  218.133 ns …   4.067 μs  ┊ GC (min … max):  0.00% … 89.90%
 Time  (median):     259.435 ns               ┊ GC (median):     0.00%
 Time  (mean ± σ):   307.224 ns ± 272.729 ns  ┊ GC (mean ± σ):  14.90% ± 14.28%

  ▇█▅▃▁▁                                                        ▁
  ██████▇▆▅▆▄▃▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▃▄▄▃▅▆▇▇▇▇▇█▇ █
  218 ns        Histogram: log(frequency) by time       1.82 μs <

 Memory estimate: 3.25 KiB, allocs estimate: 1.

@rofinn rofinn added this to the Intervals 2.0 milestone May 18, 2023
@rofinn rofinn requested a review from omus as a code owner May 18, 2023 22:51
@codecov
Copy link

codecov bot commented May 18, 2023

Codecov Report

Merging #217 (056e7f7) into master (ba938f6) will decrease coverage by 34.55%.
The diff coverage is 94.78%.

@@             Coverage Diff             @@
##           master     #217       +/-   ##
===========================================
- Coverage   84.31%   49.77%   -34.55%     
===========================================
  Files          12       12               
  Lines         848      870       +22     
===========================================
- Hits          715      433      -282     
- Misses        133      437      +304     
Impacted Files Coverage Δ
src/anchoredinterval.jl 96.55% <87.50%> (-2.58%) ⬇️
src/interval.jl 92.00% <93.67%> (-4.02%) ⬇️
src/Intervals.jl 100.00% <100.00%> (ø)
src/endpoint.jl 98.33% <100.00%> (+0.22%) ⬆️
src/parse.jl 100.00% <100.00%> (ø)

... and 5 files with indirect coverage changes

@rofinn rofinn changed the base branch from master to v2-DEV May 19, 2023 19:25
@omus
Copy link
Collaborator

omus commented May 26, 2023

Encode Interval bounds as an integer field rather than type parameters.

We originally used something very similar before Intervals 1.3 but decided in #102 to switch over to encoding the bounds as parametric type information to reduce memory (the main idea being that a vector of intervals with the same bounds do not need to encode this information for each instance).

I think probably the right approach is to introduce the changes you added here as new interval types (we can possibly rename the original ones too) so end-users can switch between these different types based upon their particular use case to maximize performance.

Copy link
Collaborator

@omus omus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of initial thoughts

Comment on lines +25 to +30
bounds_int(l::Type{Open}, r::Type{Open}) = 0x00
bounds_int(l::Type{Closed}, r::Type{Open}) = 0x01
bounds_int(l::Type{Open}, r::Type{Closed}) = 0x02
bounds_int(l::Type{Closed}, r::Type{Closed}) = 0x03

bounds_types(x::Integer) = bounds_types(Val(UInt8(x)))
Copy link
Collaborator

@omus omus May 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed opportunities to use bit masks:

Suggested change
bounds_int(l::Type{Open}, r::Type{Open}) = 0x00
bounds_int(l::Type{Closed}, r::Type{Open}) = 0x01
bounds_int(l::Type{Open}, r::Type{Closed}) = 0x02
bounds_int(l::Type{Closed}, r::Type{Closed}) = 0x03
bounds_types(x::Integer) = bounds_types(Val(UInt8(x)))
bounds_int(l::Type{Open}, r::Type{Open}) = 0x00 | 0x00
bounds_int(l::Type{Closed}, r::Type{Open}) = 0x10 | 0x00
bounds_int(l::Type{Open}, r::Type{Closed}) = 0x00 | 0x01
bounds_int(l::Type{Closed}, r::Type{Closed}) = 0x10 | 0x01
function bounds_types(x::UInt8)
L = if x & 0x20 == 0x20
Unbounded
elseif x & 0x10 == 0x10
Closed
else
Open
end
R = if x & 0x02 == 0x02
Unbounded
elseif x & 0x01 == 0x01
Closed
else
Open
end
return (L, R)
end

Can make this logic even cleaner with something like: https://github.com/JuliaTime/TimeZones.jl/blob/master/src/class.jl

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually my initial pass did this, but I concluded that the value dispatch was more declarative / readable and I didn't notice a performance difference either way... though maybe there's a specific case you're thinking of?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just from a reasoning standpoint it seems sensible to denote each of the 8-bits for a specific purpose. Your values from 0x00 - 0x03 follow that logic but the unbounded values do not. This makes it difficult to reason into what each bit does and use bitwise operators to determine information about bounds.

I'm fine with using Val dispatch if there isn't a performance issue.

abstract type AbstractInterval{T, L <: Bound, R <: Bound} end
# Methods for convert between int and tuple representations for space efficiency
# (Open, Closed) is 16 bytes while the integer represenation is only 1 byte.
# TODO: Convert types to symbols when we switch to extending IntervalSets.jl
Copy link
Collaborator

@omus omus May 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been catching up with the motivation behind these changes. Is there rational why using symbols is a better approach than using types? Should IntervalSets.jl be updated to use types?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel like it matters either way. My preference is more focused on making things consistent between the two, so Intervals.jl can extend IntervalSets.jl. As it stands, there doesn't seem to be much advantage to having them be types. We don't leverage the type hierarchy much and we don't store any information in them, so it's just extra names/types to remember. The practical benefit of switching to symbols is that it means we may be able to have Intervals.jl extend IntervalSets.jl with only one breaking release in Intervals.jl. Going the other way around would require a breaking release to IntervalSets.jl and a lot more updates in the ecosystem.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the counter argument to using symbols is that by using types you can use the standard type inspection functions to determine what the names are used by this package. Dispatch on these types isn't used commonly outside of this package but I do think it's extremely useful for the internal logic in endpoint.jl.

bounds_types(::Val{0x03}) = (Closed, Closed)

# Extension for backwards support of Unbounded endpoints to avoid changing existing logic
# TODO: Drop these in favour of the approach in IntervalSets.jl
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like an approach has yet to be worked out: JuliaMath/IntervalSets.jl#123

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the question is whether it should be the responsibility of either IntervalSets.jl or Intervals.jl to handle this with yet another custom type. AFAICT, bounded-ness is just a question of whether an endpoint value isfinite which we already have an API for. My preference would be for that API to be handle independently of intervals like in Infinity.jl or Infinities.jl.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do agree that using isfinite on endpoints does seem like a better interface. From what I recall from the Infinity.jl work the parts where this breaks down is that:

  1. Defining Infinite{T} seems like a great idea for making any type into something that supports infinity but you lose the proper type hierarchy which makes this break down in multiple dispatch. Ideally we'd want to define Infinite{T} <: supertype(T) to make this work.
  2. With 1 being infeasible we're left with making specific types to properly define supertypes. This results in not all types having an alternative infinite type.

This is where it started to make more sense to have the intervals to support infinity themselves. For example Interval{Char,Closed,Unbounded}.

An alternative strategy could be something like:

struct Interval{T}
    first::Union{T,NegInf}
    last::Union{T,PosInf}
end

but I doubt that would be an improvement over the v1 setup.

Comment on lines +41 to +42
struct Interval{T} <: AbstractInterval{T}
bounds::UInt8 # bounds comes first to allow incomplete construction for unbounded intervals
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious if the intention is to have Intervals.jl use the IntervalSets.jl interface isn't this moving in the wrong direction? Or is the plan to update IntervalSets.jl in some way?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the plan is to:

  1. Subtype IntervalSets.AbstractInterval
  2. Provide conversion and promotion rules for different interval types
  3. Provide documentation on when to use each interval type and why
  4. More clearly identify what should live in each package (ie: core API in IntervalSets.jl and collections of intervals in Intervals.jl)

In the end, most packages can just depend on a common API in IntervalSets.jl and just use duck-typing.

Yes, IntervalSets.jl may require some updates, but I'd like to narrow down what those should actually be. AFAIK, we didn't end up using a lot of the flexibility included with this package... maybe Beacon has used more of it? Can you clarify why you mean by "moving in the wrong direction"?

The argument for reverting back to storing bounds in a field rather than parameters is to differentiate the Intervals.Interval from the IntervalSets.Interval types. If you're just working with individual intervals or all intervals have the same bounds then use IntervalSets.Interval. If you're working with large collections of Intervals where you might not have consistent bounds then use the Interval.jl type. It's the same reasoning as using AnchoredInterval if you know that all intervals have the same span.

Again, the goal is to have the two main packages complement rather than compete with each other.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify why you mean by "moving in the wrong direction"?

What I meant by this is that IntervalSets.jl encodes the bound type information as type information which was closer to what was done before this PR than with the bit mask. Now that I understand your plan more with supporting both types this makes more sense.

Yes, IntervalSets.jl may require some updates, but I'd like to narrow down what those should actually be. AFAIK, we didn't end up using a lot of the flexibility included with this package.

I think one of those changes is to have IntervalSets.jl to move from Interval{L,R,T} to Interval{T,L,R}. If we're going to have multiple interval types the element type is going to come first for other implementations like you have here.

The argument for reverting back to storing bounds in a field rather than parameters is to differentiate the Intervals.Interval from the IntervalSets.Interval types. If you're just working with individual intervals or all intervals have the same bounds then use IntervalSets.Interval. If you're working with large collections of Intervals where you might not have consistent bounds then use the Interval.jl type. It's the same reasoning as using AnchoredInterval if you know that all intervals have the same span.

I can get behind having both types of concrete interval exist for different use cases. Where I'm less sure is how these packages should be combined (like you mentioned)

Copy link
Member Author

@rofinn rofinn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I've tried to provide a bit more context behind my motivation for the changes. In general the goal is to reduce maintenance requirements and complement IntervalSets.jl rather than compete with it which means:

  1. I don't really care about types vs symbols.
  2. I'm open to renaming things if folks want.
  3. Unifying the APIs should ideally require the fewest breaking changes for the ecosystem
  4. There should be a reason to use one type vs another. It feels like IntervalSets.Interval and Intervals.Interval are competing with each other.

Comment on lines +25 to +30
bounds_int(l::Type{Open}, r::Type{Open}) = 0x00
bounds_int(l::Type{Closed}, r::Type{Open}) = 0x01
bounds_int(l::Type{Open}, r::Type{Closed}) = 0x02
bounds_int(l::Type{Closed}, r::Type{Closed}) = 0x03

bounds_types(x::Integer) = bounds_types(Val(UInt8(x)))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually my initial pass did this, but I concluded that the value dispatch was more declarative / readable and I didn't notice a performance difference either way... though maybe there's a specific case you're thinking of?

abstract type AbstractInterval{T, L <: Bound, R <: Bound} end
# Methods for convert between int and tuple representations for space efficiency
# (Open, Closed) is 16 bytes while the integer represenation is only 1 byte.
# TODO: Convert types to symbols when we switch to extending IntervalSets.jl
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel like it matters either way. My preference is more focused on making things consistent between the two, so Intervals.jl can extend IntervalSets.jl. As it stands, there doesn't seem to be much advantage to having them be types. We don't leverage the type hierarchy much and we don't store any information in them, so it's just extra names/types to remember. The practical benefit of switching to symbols is that it means we may be able to have Intervals.jl extend IntervalSets.jl with only one breaking release in Intervals.jl. Going the other way around would require a breaking release to IntervalSets.jl and a lot more updates in the ecosystem.

bounds_types(::Val{0x03}) = (Closed, Closed)

# Extension for backwards support of Unbounded endpoints to avoid changing existing logic
# TODO: Drop these in favour of the approach in IntervalSets.jl
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the question is whether it should be the responsibility of either IntervalSets.jl or Intervals.jl to handle this with yet another custom type. AFAICT, bounded-ness is just a question of whether an endpoint value isfinite which we already have an API for. My preference would be for that API to be handle independently of intervals like in Infinity.jl or Infinities.jl.

Comment on lines +41 to +42
struct Interval{T} <: AbstractInterval{T}
bounds::UInt8 # bounds comes first to allow incomplete construction for unbounded intervals
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the plan is to:

  1. Subtype IntervalSets.AbstractInterval
  2. Provide conversion and promotion rules for different interval types
  3. Provide documentation on when to use each interval type and why
  4. More clearly identify what should live in each package (ie: core API in IntervalSets.jl and collections of intervals in Intervals.jl)

In the end, most packages can just depend on a common API in IntervalSets.jl and just use duck-typing.

Yes, IntervalSets.jl may require some updates, but I'd like to narrow down what those should actually be. AFAIK, we didn't end up using a lot of the flexibility included with this package... maybe Beacon has used more of it? Can you clarify why you mean by "moving in the wrong direction"?

The argument for reverting back to storing bounds in a field rather than parameters is to differentiate the Intervals.Interval from the IntervalSets.Interval types. If you're just working with individual intervals or all intervals have the same bounds then use IntervalSets.Interval. If you're working with large collections of Intervals where you might not have consistent bounds then use the Interval.jl type. It's the same reasoning as using AnchoredInterval if you know that all intervals have the same span.

Again, the goal is to have the two main packages complement rather than compete with each other.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Storing open-ness as a type parameter makes Postgres parsing type unstable
2 participants