Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement copy-on-write for all backends #157

Closed
mratsim opened this issue Nov 21, 2017 · 5 comments
Closed

Implement copy-on-write for all backends #157

mratsim opened this issue Nov 21, 2017 · 5 comments

Comments

@mratsim
Copy link
Owner

mratsim commented Nov 21, 2017

While I love value semantics and still thinks it is best to avoid this type of questions:
image

I think:

  • easy syntax,
  • behaviour consistency between backends (Cpu, Cuda, OpenCL),
  • efficient memory allocation
  • and performance
    cannot be ignored.

Most Arraymancer performance issues come from memory allocation and are due to value semantics. The alternative, using unsafeView, unsafeSlice, unsafeSqueeze, etc is clunky as it litters the codebase with low-level details.

With "copy-on-write", it is possible to be:

  • safe as with value semantics
  • and fast like reference semantics

by only copying when necessary, i.e. when a value is mutable.

Benefits expected:

  • Copies involving non-mutable variables will be shallow by default
  • Slicing of non-mutable variables will be shallow by default
  • Everything involving mutable variables will be deep by default to prevent "unexpected surprises"
  • Keep the behaviour predictable:
    • let assignment from let ==> shallow
    • var assignment from let ==> deep
    • let assignment from var ==> deep
    • There will be no under the hood if refcount = 1 optimization for var assignment .
      That would make Arraymancer depends on the refcounting GC and it would be much better to use destructors and the future =move and =sink in Implement =move and =sink operators (destructors) #150 for that.
  • Copy-on-write will work well with unsafe proc chaining unsafe proc chaining -> first proc unsafe = all procs unsafe #151

Limitations:

  • Implicit result will still require unsafeView and unsafeUnsqueeze for shallow-copies. (Obviously slice mutation does not copy but overwrite in-place).
    This is perfectly reasonable in my opinion.

This superseeds #19

@mratsim
Copy link
Owner Author

mratsim commented Nov 21, 2017

Still stuck on assignment overloading: nim-lang/Nim#6348 and nim-lang/Nim#6786

@mratsim
Copy link
Owner Author

mratsim commented Nov 22, 2017

Outcome of my research.

A tentative copy-on-write object to be used in a single-threaded context or a master thread + slave thread managed by OpenMP that only works on already allocated memory:

Forum thread.

{.experimental.}

type
  Storage[T] = ref object
    refcount: int
    data: seq[T]
    # lock / guard ?

type COWobject[T] = object
  metadata: int
  storage: Storage[T]

proc detach[T](c: var COWobject[T]) =
  # Note: this only works without races if only the main thread can access this.
  # Also increment is only done on assignment, slices do not increment.
  if c.storage.refcount == 1:
    return
  
  let old_store = c.storage
  var fresh_store: Storage[T]
  new fresh_store
  
  fresh_store.refcount = 1
  deepCopy(fresh_store.data, old_store.data)
  
  c.storage = fresh_store
  dec old_store.refcount

proc `=`[T](dst: var CowObject[T]; src: CowObject[T]) =
  inc src.storage.refcount
  system.`=`(dst, src)

proc `=destroy`[T](c: CowObject[T]) =
  # Note from Nim manual: destructors are tied to a variable
  # And will not trigger for say slices.
  dec c.storage.refcount

proc toCowObj[T](s: varargs[T]): COWobject[T] {.noInit.} =
  result.metadata = 1337
  new result.storage
  result.storage.data = @s

proc `[]`[T](c: COWobject[T], idx: int): T =
  c.storage.data[idx]

proc `[]`[T](c: var COWobject[T], idx: int): var T =
  detach c
  c.storage.data[idx]

proc `[]=`[T](c: var COWobject[T], idx: int, val: T) =
  detach c
  c.storage.data[idx] = val


proc main() =
  let a = toCowObj(1, 2, 3, 4, 5)
  let b = a
  var c = a
  let d = c
  var e = c
  let f = e
  let g = f
  
  c[1] += 10
  e[2] = 100
  
  
  echo "\n\nMemory location"
  echo "a: [1, 2, 3, 4, 5]: " & $a.repr
  echo "b: [1, 2, 3, 4, 5]: " & $b.repr
  echo "c: [1, 12, 3, 4, 5]: " & $c.repr
  echo "d: [1, 2, 3, 4, 5]: " & $d.repr
  echo "e: [1, 2, 100, 4, 5]: " & $e.repr
  echo "f: [1, 2, 3, 4, 5]: " & $f.repr
  echo "g: [1, 2, 3, 4, 5]: " & $g.repr
  
  echo "\n\n"
  echo "a: [1, 2, 3, 4, 5]: " & $a
  echo "b: [1, 2, 3, 4, 5]: " & $b
  echo "c: [1, 12, 3, 4, 5]: " & $c
  echo "d: [1, 2, 3, 4, 5]: " & $d
  echo "e: [1, 2, 100, 4, 5]: " & $e
  echo "f: [1, 2, 3, 4, 5]: " & $f
  echo "g: [1, 2, 3, 4, 5]: " & $g

main()

Concurrent thread-safe version

Currently there is no plan to make that thread-safe with different Tensors in different thread sharing the same memory. First of all Nim seq and GC are thread-local.

In case this is needed is the future, here is some research around copy-on-write, concurrency, lock, lock-free or wait-free data structure

Nim

Nim offers several intrinsics by default:

  • proc atomicInc(memLoc: var int; x: int = 1): int {..}
  • proc atomicDec(memLoc: var int; x: int = 1): int {..}
  • proc addAndFetch(p: ptr int; val: int): int {..}
  • proc atomicInc(memLoc: var int; x: int = 1): int {..}
  • proc atomicDec(memLoc: var int; x: int = 1): int {..}
  • proc cas[T: bool | int | ptr](p: ptr T; oldValue, newValue: T): bool {..}
  • proc cpuRelax()

Undocumented intrinsics in atomics.nim

The locks module, with a very nice "withLock" template.

And several lock examples in the manual

Warnings - lock-free is not the solution to everything

Method Time (ms)
One Thread 300
One Thread with Memory Barrier 4,700
One Thread with CAS 5,700
Two Threads with CAS 18,000
One Thread with Lock 10,000
Two Threads with Lock 118,000

Benchmarks

  • Synchrobench
    • arrays, binary trees, hash tables, linked lists, queues and skip lists that are synchronized with copy-on-write, locks, read-modify-write, read-copy-update and transactional memory.

Reference-counting

Atomics

R/W lock (many readers or one writer)

Read-copy-update (RCU)

Copy on write

Software Transactional Memory

@mratsim
Copy link
Owner Author

mratsim commented Nov 23, 2017

WIP in #159

Further research

@mratsim
Copy link
Owner Author

mratsim commented Nov 24, 2017

After tinkering for a few hours, I have definite proof that copy-on-write does not work cleanly for a neural net library so I will use reference semantics by default and require cloning for deep copies.

  • Increasing and decreasing the refcounter in tight loops seems quite costly, I didn’t benchmark properly on my full bench suite, but my incRef/decRef proc were taking 1-3% on a test_nnp_loss (sigmoid_cross_entropy and softmax_cross_entropy)

  • Bis. Potential idea, instead of having a counter: just have a boolean “shared”, when doing assignment, changed shared to true, when doing =sinkdo not change it. —> when “shared is true, mutation copies”. This avoids keeping track of the refcounter.

  • Except that at the neural network level, Tensors will be referred to by a graph and they will always be “shared”, defeating the whole purpose of copy-on-write. Even refcounting doesn't work because each Variable holds a reference to the same tape that holds a reference to all Variables and Tensors --> a.tape = b.tape triggers refcounting on the whole graph.

Last things

  • implicit sharing means hard to know if Arraymancer would implicitly copy or implicitly share, meaning users will have to think if this is important.
  • to provide "unsafe" semantics with COW tensors, the syntax is system.=(dest, src) which is also unwieldly (with deep copy, just use unsafeView, with ref semantics, just use clone)

Edit: summary I did for Nim forum

  1. Tensors wrapped in containers: in neural networks, you create a tensor then you wrap it in a container that will be used in a graph that will keep track of all operations applied to it.
import ../src/arraymancer, sequtils

let ctx = newContext Tensor[int] # Context that will track operations applied on the tensor to compute gradient in the future

let W = ctx.variable toSeq(1..8).toTensor.reshape(2,4) # What is the refcount? --> it's 0

let x = toSeq(11..22).toTensor.reshape(4,3)
let X = ctx.variable x # What is the refcount? it's 1 until x goes out of scope

The refcount and logic required becomes hard to follow and will probably lead to an overengineered solution (or test suite).

  1. Predictability: When everything deepcopies or everything shallowcopies, everything is easier to reason about. If there is a perf or a sharing issue just add a shallow copy or a clone and done.

  2. Workaroundability: Since copy-on-write must overload assignment, if you want to ensure shallow-copy for example you have to use:

let foo = toCowObj(1, 2, 3, 4, 5)
var bar: CowObject[int]
system.`=`(bar, foo)

@mratsim
Copy link
Owner Author

mratsim commented Nov 25, 2017

Closed by #160

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant