The "device memory" concept for multicore
Treats the C function stack of the monolithic update step as a "device memory". There is no explicit synchronization; instead, we implement "update on host" where needed: updates that would affect other tasks happen directly on the host (updating, e.g. adding to, the host's value of a tensor cell rather than its task-local copy which might be stale).