Skip to content
This repository has been archived by the owner on Dec 22, 2021. It is now read-only.

Immutable array #6

Closed
wants to merge 4 commits into from
Closed

Immutable array #6

wants to merge 4 commits into from

Conversation

julienrf
Copy link
Contributor

@julienrf julienrf commented Dec 21, 2016

I introduce immutable arrays, a concrete collection that uses an Array as internal representation. It supports fast indexed access and has a compact in-memory representation.

The implementations of the collections operations (filter, map, etc.) were originally provided by the IterableLike trait, and these implementations were all based on iterators, and thus were not appropriate for working with arrays (where, most of the time we know a priori the size of the resulting array).

Instead of inheriting from these “default” implementations and then overriding them, I decided to move the “default” implementations of collection operations into separate traits. As a consequence, IterableMonoTransforms and IterablePolyTransforms are now completely abstract.

Most of the operations on immutable arrays are based on tabulate.

@@ -166,12 +184,19 @@ trait IterableMonoTransforms[+A, +Repr] extends Any {
def drop(n: Int): Repr = fromIterableWithSameElemType(View.Drop(coll, n))

/** The rest of the collection without its first element. */
def tail: Repr = drop(1)
def tail: Repr = if (coll.isEmpty) ??? else drop(1)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, I think tail should not be defined on empty collections, right?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Precedent is to throw an UnsupportedOperationException, see List.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we should go so far as to eliminate non-total methods, given their usefulness. Apart from :: I don't think we have any type that is statically guaranteed to be non-empty.

Copy link
Contributor Author

@julienrf julienrf Jan 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@szeiger I’m not eliminating tail, I’m fixing its implementation. It is not supposed to terminate when called on an empty list.

Concerning the non-emptiness, we also have +: and :+, and we could overload the apply factory method to return a non-empty collection if called with at least one element (as I did in #9).

@julienrf
Copy link
Contributor Author

I just added a commit that solves the stackoverflow that happened at compile-time.

@@ -166,12 +184,19 @@ trait IterableMonoTransforms[+A, +Repr] extends Any {
def drop(n: Int): Repr = fromIterableWithSameElemType(View.Drop(coll, n))

/** The rest of the collection without its first element. */
def tail: Repr = drop(1)
def tail: Repr = if (coll.isEmpty) ??? else drop(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we should go so far as to eliminate non-total methods, given their usefulness. Apart from :: I don't think we have any type that is statically guaranteed to be non-empty.

*
* Supports efficient indexed access and has a small memory footprint.
*/
class Array[+A] private (private val elements: scala.Array[AnyRef]) extends IndexedSeq[A] with SeqLike[A, Array] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a public constructor or companion apply to make it really useful. Wrapping an existing array without copying is important for performance. Wrapping with copying should also be supported (and be the safe default, of course).

To enable wrapping and copying, you need to take primitive arrays into account as well (like the array-like types in my specialization PR do).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a companion’s apply method (inherited from IterableFactory) which creates an ArrayBuffer and then copies its contents into a fresh new Array.

You think we should also add a public constructor aliasing an Array passed as parameter?

Copy link
Contributor

@szeiger szeiger Jan 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's unsafe but too useful not to have. It doesn't have to be a constructor or factory methods. Depending on the outcome of the mutable vs immutable discussion, it could naturally become Array(...).readOnly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I agree.

@@ -34,6 +35,10 @@ class LazyList[+A](expr: => LazyList.Evaluated[A])
case Some((hd, tl)) => s"$hd #:: $tl"
}
else "LazyList(?)"

// HACK Needed because we inherit from two implementations (one from LinearSeq and one from SeqLikeFromIterable)
override def drop(n: Int): LazyList[A] = super.drop(n)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should LinearSeq have implementations at all or only be a marker trait with a subtype of SeqLikeFromIterable providing the implementations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’d be happy to see if it’s possible to remove any implementations from LinearSeq.

Copy link
Contributor Author

@julienrf julienrf Jan 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just added a commit that solves the issue. The solution is to not automatically implement drop in terms of head and tail but provide this implementation as a LinearSeqLikeOpt trait that can optionally be mixed into a concrete collection implementation. (Note that the problem still exists for List because we want to inherit from both LinearSeqLikeOpt – to get the optimized drop implementation — and SeqLikeFromIterable — to get the implementations of all the other methods)

@lihaoyi
Copy link

lihaoyi commented Jan 9, 2017

Shouldn't this backed by an unboxed scala.Array[T], instead of an scala.Array[AnyRef] when used with primitives? If we want to aim for a "compact in-memory representation", we definitely need to avoid boxing .

From my research in http://www.lihaoyi.com/post/BenchmarkingScalaCollections.html, using an Array[AnyRef] would not actually save much space over using a Vector[T]; once you have 64 entries in the array the difference is less than 20%, and it goes down to almost nothing as the collections get larger. On the other hand, using an Array[Int] over an Array[Object] is consistently a 5x reduction in memory usage no matter how large the array is, which is expected given you're adding the size of both an object-reference as well as the object-header on top of the field that stores the actual integer.

I think an immutable array won't be generally useful as a compact data-store unless we ensure that it's backed by an unboxed Array. That may require us to change the API, and take classtags during construction and during transformations like map, which might make it unable to implement the same interface as a boxed-collection, but I think it's necessary to be useful. If our immutable array is boxed, the only time it saves significant memory is for tiny collections of less than 10 items. I don't think it's worth creating a whole new collection to satisfy such a narrow use case.

FWIW in my own code I'm currently using my own custom unboxed "immutable" array (which is a read-only view over a growable, unboxed, append-only array buffer):

sealed abstract class Agg[+T] extends (Int => T) with TraversableOnce[T]{

  def apply(i: Int): T
  def length: Int
  def toArray[B >: T: ClassTag]: Array[B]

  def zipWithIndex: Agg[(T, Int)]
  def withFilter(f: T => Boolean): Agg[T]
  def filter(f: T => Boolean): Agg[T]
  def map[V: ClassTag](f: T => V): Agg[V]

  def flatMap[V: ClassTag](f: T => TraversableOnce[V]): Agg[V]

  def last: T
  def lastOption: Option[T]
  def head: T
  def contains[B >: T](t: B): Boolean

  def headOption: Option[T]
  def indices: Range
  def indexOf[B >: T](t: B): Int
  def groupBy[V](key: T => V): Map[V, Agg[T]]
}


/**
  * A custom growable-array of unboxed `T`s.
  */
class Aggregator[T: ClassTag](initialSize: Int = 4) extends Agg[T] {
  private[this] var data = new Array[T](initialSize)
  private[this] var length0 = 0
  def length = length0
  def apply(i: Int) = data(i)
  def append(i: T) = {
    if (length >= data.length) {
      // Grow by 3/2 + 1 each time, same as java.util.ArrayList. Saves a bit
      // of memory over doubling each time, at the cost of more frequent
      // doublings,
      val newData = new Array[T](data.length * 3 / 2 + 1)
      System.arraycopy(data, 0, newData, 0, length)
      data = newData
    }
    data(length) = i
    length0 += 1
  }
  ...
}

This really did help reduce memory usage in my current project by several times (I'm holding a lot of large primitive caches and indices in memory...), so this isn't just a theoretical concern.

@lrytz
Copy link
Member

lrytz commented Jan 9, 2017

This new collection should come with benchmarks and usage recommendations. When should someone pick it over Vector or Array, and when not? Is memory footprint the only advantage? How does the performance of operations compare to Vector and Array? How does it perform on primitive types?

@julienrf
Copy link
Contributor Author

julienrf commented Jan 9, 2017

Shouldn't this backed by an unboxed scala.Array[T], instead of an scala.Array[AnyRef] when used with primitives?

Yes, that’s the plan! I will follow the design proposed in #2.

@DarkDimius
Copy link

How does the performance of operations compare to Vector and Array? How does it perform on primitive types?

@julienrf I'd propose to try(and benchmark) out a version that uses an underlying Array[Any]. The main reason is that HotSpot can unroll loops that iterate on an array and speculate on the first element of the array and perfrom specialized operation on the remaining elements.

@szeiger
Copy link
Contributor

szeiger commented Jan 10, 2017

@lihaoyi Yes, I already have the machinery necessary for that in https://github.com/scala/collection-strawman/blob/specialization/src/main/scala/strawman/collection/CollectionStrawman6.scala#L469. We should use it here, too. I'll rebase and polish up my specialization branch so we can merge it.

Slick also uses its own array-based collections for the AST: https://github.com/slick/slick/blob/master/slick/src/main/scala/slick/util/ConstArray.scala. AFAIR this resulted in a 30% performance improvement for the whole query compiler compared to Scala collections. Switching between different Scala collections (only Vector vs only ArrayBuffer vs whatever was most convenient in a specific case) made almost no difference.

@szeiger
Copy link
Contributor

szeiger commented Jan 12, 2017 via email

@DarkDimius
Copy link

@szeiger if you use Array[Any] the underlying array can be unboxed and stored efficiently, but every access will need to go through ScalaRuntime.

@szeiger
Copy link
Contributor

szeiger commented Jan 12, 2017 via email

@DarkDimius
Copy link

@szeiger, no, it erazes to simply Object, and you'll need methods like https://github.com/lampepfl/dotty/blob/master/library/src/scala/runtime/ScalaRunTime.scala#L82 to access elements inside it.

@DarkDimius
Copy link

@szeiger, same happens with Array[T] where T isn't bounded by AnyRef.

@lihaoyi
Copy link

lihaoyi commented Jan 12, 2017

Not sure if I'm missing something to but new Array[T](initialSize) with T: reflect.ClassTag definitely results in an unboxed array of primitives.

Maybe you need some specialization cleverness to avoid boxing on-access but having an efficient at-rest data structure is trivial, and important.

Copy link
Contributor

@odersky odersky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think immutable.Array should indeed be polymorphic. I would actually have a tendency to make it a value class.

There's also the issue about how to organize the ...Like traits, i.e. whether to split them into completely abstract traits and fromIterable traits. In my view these two topics should be treated separately.

with IterableMonoTransforms[A, C[A @uncheckedVariance]] // sound bcs of VarianceNote
with IterablePolyTransforms[A, C] {
with IterablePolyTransforms[A, C]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need to split IterableLike and IterableLikeFromIterable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IterableLike defines abstract methods with no implementation.

IterableLikeFromIterable implements these methods based on a fromIterable(it: Iterable[A]): C[A] method.

In the case of arrays, it didn’t make sense to inherit from operations implemented in terms of fromIterable (most of them are implemented in terms of Array.tabulate).

Does that make sense to you?

@julienrf
Copy link
Contributor Author

julienrf commented Jan 18, 2017

@DarkDimius I changed the implementation to use an Array[Any] (see this commit) and compared the performance using the benchmarks of #19 but noticed no difference between the version using an underlying Array[AnyRef]. I might have done something wrong. Note that I’m creating the array with Array.ofDim[Any](n), so it always uses classOf[Any] at creation time, maybe that’s because of that?

@julienrf
Copy link
Contributor Author

julienrf commented Jan 18, 2017

@lrytz

This new collection should come with benchmarks and usage recommendations. When should someone pick it over Vector or Array, and when not? Is memory footprint the only advantage? How does the performance of operations compare to Vector and Array? How does it perform on primitive types?

I got preliminary answers to these questions. I ran the benchmarks (see #19) and compared List, Vector (the one of Scala 2.12, because our redesign does not include it yet) and immutable.Array.

You can find the comparison of execution time of various operations here. Unsurprisingly, we see that immutable.Array provides the fastest index access (apply method), iteration (foreach) and concatenation (concat), but the worst tail (but we saw that we could circumvent this by using .view.tail).

You can find the comparison of memory footprint here. Slightly more compact than Vector.

@jcdavis
Copy link

jcdavis commented Jan 20, 2017

@julienrf presumably your tail performance probably would be better if tail used arrayCopy?

Im curious what the goal is here - if this is supposed to be a new general purpose immutable Seq as in List/Vector I have a number of thoughts (this being the internet and all) from my years working on a large scala codebase that I would be happy to flesh out.

@lihaoyi
Copy link

lihaoyi commented Jan 20, 2017

Im curious what the goal is here - if this is supposed to be a new general purpose immutable Seq as in List/Vector I have a number of thoughts (this being the internet and all) from my years working on a large scala codebase that I would be happy to flesh out.

FWIW I'd actually prefer an immutable "dumb" flat-array as the "standard" data-type v.s. the current immutable.Vector, after going through my benchmarks

The only thing that Vector gives you is "acceptable" +: :+ tail init performance. In all other cases (lookup, concat, ...) while their asymptotic complexities are only off by a factor of log(n), the constant factors for Vector are pretty heavy:

  • Small Vector[Object]s of size 0-5 take up 4x as much memory as an Array[Object]; for Vector[Int] vs Array[Int], the ratio is more like 10x

  • Concatenation of equal-sized Vectors is about 10x slower than concatenation of equal-sized Arrays, regardless of size, despite whatever structural-sharing Vector provides

Furthermore, for the things where Vector is better than Array, it is generally far worst than List:

  • item-by-item construction via Vector :+ is about 15x slower than List ::

  • item-by-item deconstruction via .tail is 20-150x slower (!)

In general, I think that Vector, while avoiding the huge performance pitfalls of O(n) operations on Array or List, has generally pretty bad performance and memory-usage characteristics across-the-board.

To put it another way, I don't think people should lightly take the 10x slowdown in common operations or the 3-4x increase in memory usage for small collections unless they really need both item-by-item construction and indexed lookup. And I think that's a sufficiently niche use case that pushing people towards an immutable.Array or immutable.List should be the default best practice.

@lrytz
Copy link
Member

lrytz commented Jan 20, 2017

One problem I see is naming: Scala collections that are imported by default are immutable (Map, Set), except for Array. So far that was ok because Array meant low-level array, nothing else. If we introduce immutable.Array, it will be confusing why this is not the one imported by default.

@jcdavis
Copy link

jcdavis commented Jan 20, 2017

@lihaoyi I definitely agree - the 200 byte size (56 bytes for the Vector + 144 bytes for the array) for 1-32 values in particular is awful. However, it also has some good tricks like dirtying which can avoid copies.

My dream for a new immutable Seq is array-backed (With unboxed primitive via specialization or otherwise), with start/end indexes + a dirty flag to allow for fast tail and concat. The array could start at a much more reasonable size (say 4), and increase by 50% or 100% when needed for expanding. Oh, and it definitely implements Builder.sizeHint (unlike Vector )

Making the wrapped array "smarter" could probably be squeezed in 8 bytes (so 16 -> 24), which is an important target because otherwise you are 32 bytes (in x86-64 with CompressedOops land)

@sjrd
Copy link
Member

sjrd commented Jan 20, 2017

On a somewhat related note, Vector currently takes up the bulk of the code size of a smallish application in Scala.js. And there's basically no way around it because zillions of parts of the stdlib create Vectors because they're the default. If they could be treated as a special-purpose collection instead of the default, the wins for Scala.js in terms of code size would be substantial.

@DarkDimius
Copy link

DarkDimius commented Jan 20, 2017

@lihaoyi @jcdavis

My dream for a new immutable Seq is array-backed (With unboxed primitive via specialization or otherwise), with start/end indexes + a dirty flag to allow for fast tail and concat. The array could start at a much more reasonable size (say 4), and increase by 50% or 100% when needed for expanding. Oh, and it definitely implements Builder.sizeHint (unlike Vector )

https://slides.com/mukel/delaying-arrays/ (in order to correctly iterate through slides press space instead of left)
here is a report including benchmarking of a immutable collection prototype I've been working on a year ago, which is baked by Array of Arrays with is improvement over your proposal.
Note that the currently implemented vectors in stdlib are not the best ones we know, @nicolasstucki has worked on RRB vectors that are also used in this presentation.

@DarkDimius
Copy link

@DarkDimius
Copy link

The big thing about this array was that instead of counting complexity as standard algorithmic complexity, I wanted to optimize memory locality.
I believe we've succeeded to a huge degree and it might make sence to include them for consideration as standard immutable datastructure.

@DarkDimius
Copy link

The biggest drawback of the DelayingArrays is that it may indeed consume twice the memory needed to efficiently store the data.
But advantages are also substantial - very simple code while being more efficient for common operations.

@julienrf
Copy link
Contributor Author

julienrf commented Jan 20, 2017

presumably your tail performance probably would be better if tail used arrayCopy?

Yes, good point. And I could do the same for take and drop, too.

Im curious what the goal is here - if this is supposed to be a new general purpose immutable Seq as in List/Vector

First, I wanted to compare the time and space characteristics of an immutable collection backed by an array. I think its characteristics make it worth introducing as a concrete immutable collection type, for use cases where List is not appropriate (and Vector would still be useful because we can see it as a compromise between arrays and lists). Then, collections constructed with Seq(…) or Seq.fill(…) should probably default to immutable.Array, for the reasons mentioned by @lihaoyi and @sjrd. Another reason would be that we could use this datatype to store variadic arguments and that might be easier to interoperate with Java, where variadic arguments are already backed by arrays (I’m not sure about the interoperability because we are probably not going to make immutable.Array a value class). (and then, an eventual by product of this is that we could make scala.Seq = scala.collection.immutable.Seq)

@DarkDimius
Copy link

https://github.com/DarkDimius/DelayingArrays/blob/rewrite/src/main/scala/me/d_d/delaying/DelayingArray.scala - here is a source for a collection that was presented in those reports and presentation.

https://github.com/DarkDimius/DelayingArrays/blob/rewrite/src/main/scala/me/d_d/delaying/DArray.scala - here is a more complex version which I played with where I was trying to squeeze even more performance but it wasn't worth it)

@szeiger
Copy link
Contributor

szeiger commented Jan 20, 2017

Array is not the only non-immutable collection in the default namespace though. Seq is the generic version. The motivation for adding immutable arrays is that they can be used for varargs, which are of type scala.Seq. This is the reason why scala.Seq is currently scala.collection.Seq rather than scala.collection.immutable.Seq.

I'm not convinced by the name immutable.Array. Array is not a collection but exposes native JVM arrays directly. I'd prefer to keep it that way and use a different name like immutable.ArraySeq for a proper collection type.

Regarding the performance, I agree that List and Array-like collections are more suitable than Vector for many use cases, the problem is which one should be the default. Both have terrible performance for some operations. Vector is not particularly good at anything but avoids the degenerate cases. It's easy to avoid picking a default in some cases (like removing Seq.apply) but others are more fundamental to the Scala collections design and harder to get rid of without major design changes. For example, what should Map(1 -> 2).map(_ + _) return?

@Ichoran
Copy link
Contributor

Ichoran commented Jan 21, 2017

We will need to be cautious changing List as the default immutable.Seq since people have relied upon it--wrongly, but their code will break. (Maybe we want it to?)

However, there isn't much problem cutting out the 32-wide trie of Vector and replacing it with anything that has superior characteristic. Vector does nothing dreadfully, and it actually does higher-order operations pretty well. So despite its immensity and shocking constant factors, it's a really good go-to collection for "I don't care if it's fast, I just want it to work, and O(n^2) doesn't count as 'working'". Any other collection which is at least vaguely similar in speed and memory usage is a good alternate candidate. We should try out delaying arrays in some other common usage scenarios like concurrent access. The benchmarks in the paper look good!

@lihaoyi
Copy link

lihaoyi commented Jan 21, 2017

One thing of value in a "dumb" immutable.Array is that people already understand it: it's clear from the start that a copying prepend and append, tail or init are slow, for example. If you want it to be fast, use a mutable.Buffer or linked List.

While this may seem trivial, I would argue that it isn't. In the last 5 years, the fact that Vector#tail is 20-150x slower than List#tail, or that equal-sized Vector++ is 10x slower than Array++ for all sizes of collection, despite structural sharing, has not become common knowledge.

Thus I think the simplicity and ease-of-understandability of a "dumb" immutable.Array implementation is in itself valuable, in addition to really compact memory usage and fast bulk operations. Whether that's valuable enough to overcome the benefits of a more complex implementation (whether RRB-Vectors or Delaying-Arrays or something else) is a separate question.

@Ichoran
Copy link
Contributor

Ichoran commented Jan 21, 2017

@lihaoyi - I completely agree. That we may be able to improve Vector doesn't, I think, have much bearing on whether we want an immutable Array. It's really simple and really fast for what it's really fast at. These are nice qualities to have, and if it's specialized to be even faster that's even better.

@odersky
Copy link
Contributor

odersky commented Jan 22, 2017

I also think we are talking about two different things here

  • An alternative or replacement of immutable.Vector that has better constant factors for some operations.

  • A complement of mutable.Array that maps into essentially the same thing (possibly modulo a wrapper.) That was what [WIP] Add immutable array vc #16 tried to do.

The two do not necessarily have the same best solutions. #6 is clearly about the first issue, not the second (for some time, I misunderstood this point). I therefore think we should not use the name immutable.Array for this. Also, we should look at other possible implementation schemes such as the one referred to by @DarkDimius, and choose only after careful benchmarking.

In any case, this discussion is not critical to reach the the go/no go decision on the strawman. We can do this later once we flesh out the supported implementation types.

@julienrf
Copy link
Contributor Author

Superseeded by #31.

@julienrf julienrf closed this Feb 15, 2017
@julienrf julienrf deleted the immutable-array branch June 28, 2017 09:15
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants