Implement copy-on-write for all backends #157

mratsim · 2017-11-21T12:00:15Z

While I love value semantics and still thinks it is best to avoid this type of questions:

I think:

easy syntax,
behaviour consistency between backends (Cpu, Cuda, OpenCL),
efficient memory allocation
and performance
cannot be ignored.

Most Arraymancer performance issues come from memory allocation and are due to value semantics. The alternative, using unsafeView, unsafeSlice, unsafeSqueeze, etc is clunky as it litters the codebase with low-level details.

With "copy-on-write", it is possible to be:

safe as with value semantics
and fast like reference semantics

by only copying when necessary, i.e. when a value is mutable.

Benefits expected:

Copies involving non-mutable variables will be shallow by default
Slicing of non-mutable variables will be shallow by default
Everything involving mutable variables will be deep by default to prevent "unexpected surprises"
Keep the behaviour predictable:
- let assignment from let ==> shallow
- var assignment from let ==> deep
- let assignment from var ==> deep
- There will be no under the hood if refcount = 1 optimization for var assignment .
  That would make Arraymancer depends on the refcounting GC and it would be much better to use destructors and the future =move and =sink in Implement =move and =sink operators (destructors) #150 for that.
Copy-on-write will work well with unsafe proc chaining unsafe proc chaining -> first proc unsafe = all procs unsafe #151

Limitations:

Implicit result will still require unsafeView and unsafeUnsqueeze for shallow-copies. (Obviously slice mutation does not copy but overwrite in-place).
This is perfectly reasonable in my opinion.

This superseeds #19

The text was updated successfully, but these errors were encountered:

mratsim · 2017-11-21T13:15:43Z

Still stuck on assignment overloading: nim-lang/Nim#6348 and nim-lang/Nim#6786

mratsim · 2017-11-22T19:19:13Z

Outcome of my research.

A tentative copy-on-write object to be used in a single-threaded context or a master thread + slave thread managed by OpenMP that only works on already allocated memory:

Forum thread.

{.experimental.}

type
  Storage[T] = ref object
    refcount: int
    data: seq[T]
    # lock / guard ?

type COWobject[T] = object
  metadata: int
  storage: Storage[T]

proc detach[T](c: var COWobject[T]) =
  # Note: this only works without races if only the main thread can access this.
  # Also increment is only done on assignment, slices do not increment.
  if c.storage.refcount == 1:
    return
  
  let old_store = c.storage
  var fresh_store: Storage[T]
  new fresh_store
  
  fresh_store.refcount = 1
  deepCopy(fresh_store.data, old_store.data)
  
  c.storage = fresh_store
  dec old_store.refcount

proc `=`[T](dst: var CowObject[T]; src: CowObject[T]) =
  inc src.storage.refcount
  system.`=`(dst, src)

proc `=destroy`[T](c: CowObject[T]) =
  # Note from Nim manual: destructors are tied to a variable
  # And will not trigger for say slices.
  dec c.storage.refcount

proc toCowObj[T](s: varargs[T]): COWobject[T] {.noInit.} =
  result.metadata = 1337
  new result.storage
  result.storage.data = @s

proc `[]`[T](c: COWobject[T], idx: int): T =
  c.storage.data[idx]

proc `[]`[T](c: var COWobject[T], idx: int): var T =
  detach c
  c.storage.data[idx]

proc `[]=`[T](c: var COWobject[T], idx: int, val: T) =
  detach c
  c.storage.data[idx] = val


proc main() =
  let a = toCowObj(1, 2, 3, 4, 5)
  let b = a
  var c = a
  let d = c
  var e = c
  let f = e
  let g = f
  
  c[1] += 10
  e[2] = 100
  
  
  echo "\n\nMemory location"
  echo "a: [1, 2, 3, 4, 5]: " & $a.repr
  echo "b: [1, 2, 3, 4, 5]: " & $b.repr
  echo "c: [1, 12, 3, 4, 5]: " & $c.repr
  echo "d: [1, 2, 3, 4, 5]: " & $d.repr
  echo "e: [1, 2, 100, 4, 5]: " & $e.repr
  echo "f: [1, 2, 3, 4, 5]: " & $f.repr
  echo "g: [1, 2, 3, 4, 5]: " & $g.repr
  
  echo "\n\n"
  echo "a: [1, 2, 3, 4, 5]: " & $a
  echo "b: [1, 2, 3, 4, 5]: " & $b
  echo "c: [1, 12, 3, 4, 5]: " & $c
  echo "d: [1, 2, 3, 4, 5]: " & $d
  echo "e: [1, 2, 100, 4, 5]: " & $e
  echo "f: [1, 2, 3, 4, 5]: " & $f
  echo "g: [1, 2, 3, 4, 5]: " & $g

main()

Concurrent thread-safe version

Currently there is no plan to make that thread-safe with different Tensors in different thread sharing the same memory. First of all Nim seq and GC are thread-local.

In case this is needed is the future, here is some research around copy-on-write, concurrency, lock, lock-free or wait-free data structure

Nim

Nim offers several intrinsics by default:

proc atomicInc(memLoc: var int; x: int = 1): int {..}
proc atomicDec(memLoc: var int; x: int = 1): int {..}
proc addAndFetch(p: ptr int; val: int): int {..}
proc atomicInc(memLoc: var int; x: int = 1): int {..}
proc atomicDec(memLoc: var int; x: int = 1): int {..}
proc cas[T: bool | int | ptr](p: ptr T; oldValue, newValue: T): bool {..}
proc cpuRelax()

Undocumented intrinsics in atomics.nim

The locks module, with a very nice "withLock" template.

And several lock examples in the manual

Warnings - lock-free is not the solution to everything

Fear and Loathing in Lock-Free programming
Single writer principle
Note this very nice table about contention:

Method	Time (ms)
One Thread	300
One Thread with Memory Barrier	4,700
One Thread with CAS	5,700
Two Threads with CAS	18,000
One Thread with Lock	10,000
Two Threads with Lock	118,000

Intel Threads Building Blocks do NOT use lock-free queue because they are slower than locking/spinlocks
Another Intel forum topic on non-blocking data structure

Benchmarks

Synchrobench
- arrays, binary trees, hash tables, linked lists, queues and skip lists that are synchronized with copy-on-write, locks, read-modify-write, read-copy-update and transactional memory.

Reference-counting

Refcounting in D: GC and destructors interactions
Stack overflow discussion of Boost tutorial implementation of atomic reference counting
DrDobb's in-depth implementation
Reference counting - Harder than it sounds
Lock-free reference counting, hazard pointers and scope guards. Code here.

Atomics

R/W lock (many readers or one writer)

Read-copy-update (RCU)

RCU lock in Rust using Crossbeam
In-depth review of RCU

Copy on write

C++ example of copy-on-write strings
C++14 "Copy on write pointer done right"
Partial copy-on-write Deferred Reclamation (PDR)
Limit of copy-on-write in Ruby
To be checked: R implementation

Software Transactional Memory

mratsim · 2017-11-23T22:00:59Z

WIP in #159

Further research

Herb Sutter on CoW pessimization and in-depth article
Using deep-copy in Perl to avoid CoW performance issues
Instagram engineers: Is Python refcounting forcing copy on read (since read takes a reference in Python)

mratsim · 2017-11-24T07:52:40Z

After tinkering for a few hours, I have definite proof that copy-on-write does not work cleanly for a neural net library so I will use reference semantics by default and require cloning for deep copies.

Increasing and decreasing the refcounter in tight loops seems quite costly, I didn’t benchmark properly on my full bench suite, but my incRef/decRef proc were taking 1-3% on a test_nnp_loss (sigmoid_cross_entropy and softmax_cross_entropy)
Bis. Potential idea, instead of having a counter: just have a boolean “shared”, when doing assignment, changed shared to true, when doing =sinkdo not change it. —> when “shared is true, mutation copies”. This avoids keeping track of the refcounter.
Except that at the neural network level, Tensors will be referred to by a graph and they will always be “shared”, defeating the whole purpose of copy-on-write. Even refcounting doesn't work because each Variable holds a reference to the same tape that holds a reference to all Variables and Tensors --> a.tape = b.tape triggers refcounting on the whole graph.

Last things

implicit sharing means hard to know if Arraymancer would implicitly copy or implicitly share, meaning users will have to think if this is important.
to provide "unsafe" semantics with COW tensors, the syntax is system.=(dest, src) which is also unwieldly (with deep copy, just use unsafeView, with ref semantics, just use clone)

Edit: summary I did for Nim forum

Tensors wrapped in containers: in neural networks, you create a tensor then you wrap it in a container that will be used in a graph that will keep track of all operations applied to it.

import ../src/arraymancer, sequtils

let ctx = newContext Tensor[int] # Context that will track operations applied on the tensor to compute gradient in the future

let W = ctx.variable toSeq(1..8).toTensor.reshape(2,4) # What is the refcount? --> it's 0

let x = toSeq(11..22).toTensor.reshape(4,3)
let X = ctx.variable x # What is the refcount? it's 1 until x goes out of scope

The refcount and logic required becomes hard to follow and will probably lead to an overengineered solution (or test suite).

Predictability: When everything deepcopies or everything shallowcopies, everything is easier to reason about. If there is a perf or a sharing issue just add a shallow copy or a clone and done.
Workaroundability: Since copy-on-write must overload assignment, if you want to ensure shallow-copy for example you have to use:

let foo = toCowObj(1, 2, 3, 4, 5)
var bar: CowObject[int]
system.`=`(bar, foo)

mratsim · 2017-11-25T12:40:14Z

Closed by #160

mratsim added key feature optimization RFC labels Nov 21, 2017

mratsim mentioned this issue Nov 21, 2017

Implement optimized value semantics for CudaTensor #19

Closed

mratsim added the breaking-change label Nov 23, 2017

mratsim mentioned this issue Nov 25, 2017

Implement reference semantics for all Tensors #160

Closed

mratsim closed this as completed Nov 25, 2017

cmacmackin mentioned this issue Jan 3, 2021

Future direction for memory management? #489

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement copy-on-write for all backends #157

Implement copy-on-write for all backends #157

mratsim commented Nov 21, 2017

mratsim commented Nov 21, 2017

mratsim commented Nov 22, 2017

mratsim commented Nov 23, 2017

mratsim commented Nov 24, 2017 •

edited

Loading

mratsim commented Nov 25, 2017

Implement copy-on-write for all backends #157

Implement copy-on-write for all backends #157

Comments

mratsim commented Nov 21, 2017

mratsim commented Nov 21, 2017

mratsim commented Nov 22, 2017

Concurrent thread-safe version

Nim

Warnings - lock-free is not the solution to everything

Benchmarks

Reference-counting

Atomics

R/W lock (many readers or one writer)

Read-copy-update (RCU)

Copy on write

Software Transactional Memory

mratsim commented Nov 23, 2017

Further research

mratsim commented Nov 24, 2017 • edited Loading

mratsim commented Nov 25, 2017

mratsim commented Nov 24, 2017 •

edited

Loading