Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Backend] Chainer support #720

Closed
wants to merge 23 commits into from
Closed

[Backend] Chainer support #720

wants to merge 23 commits into from

Conversation

BarclayII
Copy link
Collaborator

@BarclayII BarclayII commented Jul 19, 2019

Description

Chainer support as per request in #668. Turns out to be a lot more effort that I expected.

In particular, zerocopy of Chainer variables to DGL tensors is not very good. Maybe I missed the correct way to implement this, but here are my issues:

  • Chainer CPU variables directly wrap around numpy NDArrays, and I could not make zerocopying between numpy and DGL tensors work.
  • As a result I have to resort to old-style Chainer custom layers, which is not a big deal since the primary benefit of new-style layers is second-order gradients, which is OK for now since we fuse the gradient operators anyway.
  • Another consequence is that I could not get shared memory to work: DGL frees the node features and edge features in shared memory right after zerocopy_from_dgl_ndarray.

The issues are commented in python/dgl/backend/chainer/tensor.py.

Other changes include (lots of) rewrites in unit tests, replacing tensor operators with backend functions and numpy functions.

@cmpute I may include one or two examples for Chainer in the (potentially far) future, but it would be greatly appreciated if you could try it out after this PR is merged and maybe contribute an example.

@yzh119 I suspect that this PR would be merged earlier than your pooling PR, so please ping me if you need any help to implement the Chainer equivalents.

Minor change

Added a return_edge_id optional argument in DGLGraph.adjacency_matrix_scipy() method which defaults to True. If False, then 1's are returned.

The reason to do this is that if edge IDs are returned, then one of the nonzero entries will be actually 0, which may be confusing.

Checklist

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [$CATEGORY] (such as [Model], [Doc], [Feature]])
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage
  • Code is well-documented
  • To the my best knowledge, examples are either not affected by this change,
    or have been fixed to be compatible with this change
  • Related issue is referred in this PR
  • New CI image is pushed

@yzh119
Copy link
Member

yzh119 commented Jul 19, 2019

I understand, btw you are crazy...

@cmpute
Copy link

cmpute commented Jul 19, 2019

Thanks for you wonderful work!.. I will see whether I can test it

@@ -866,8 +866,8 @@ def zerocopy_from_dgl_ndarray(input):
# kernels (see kernel.py), and plug into tensor framework using custom op
# extensions.

def binary_reduce(reducer, binary_op, graph, lhs, rhs, lhs_data, rhs_data,
out_size, lhs_map, rhs_map, out_map):
def binary_reduce(reducer, op, G, A_target, B_target, A, B,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the argument names are the same as c++ backend, should we refactor c++ impls too?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be another separate PR I think... Lots of stuff to change there.

return grad.reshape(shape)


def sync():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like a unique operation just for chainer, should we incorporate this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean MXNet?

import numpy as np
from chainer.backend import get_array_module as _get_array_module
try:
import cupy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this affect the requirements of DGL?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes; Chainer could install with or without CuPy.

@@ -3045,7 +3045,7 @@ def edge_subgraph(self, edges, preserve_nodes=False):
sgi = self._graph.edge_subgraph(induced_edges, preserve_nodes=preserve_nodes)
return subgraph.DGLSubGraph(self, sgi)

def adjacency_matrix_scipy(self, transpose=False, fmt='csr'):
def adjacency_matrix_scipy(self, transpose=False, fmt='csr', return_edge_ids=True):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, and I think it should defaults to False.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have existing code that depend on this behavior?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My transformer depend on this behavior and I would refactor the code.
Some unittests may depend on this function but I guess they expect this function to return all 1s.

@@ -562,7 +562,7 @@ def edge_subgraph(self, e, preserve_nodes=False):
return SubgraphIndex(gidx, self, induced_nodes, e)

@utils.cached_member(cache='_cache', prefix='scipy_adj')
def adjacency_matrix_scipy(self, transpose, fmt):
def adjacency_matrix_scipy(self, transpose, fmt, return_edge_ids=True):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same as in graph.py.

tests/backend/chainer/__init__.py Show resolved Hide resolved
@@ -124,6 +127,8 @@ def _pfc(x):
v = F.tensor([3, 4, 5])
assert _pfc(g.edges[u, v].data['l']) == [1., 1., 1.]

@unittest.skipIf(dgl.env.get_backend() == 'chainer',
'Chainer does not support head gradients')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by head gradients?

Copy link
Collaborator Author

@BarclayII BarclayII Jul 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tensor.backward(blah)

@@ -69,7 +74,7 @@ def _test(red):
g.update_all(fn.copy_src(src='u', out='m'),
builtin[red](msg='m', out='r1'))
r1 = g.ndata['r1']
F.backward(r1.sum())
F.backward(F.reduce_sum(r1))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose reduce_sum means taking the sum of all elements?

Copy link
Collaborator Author

@BarclayII BarclayII Jul 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes and it's only available in tests

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, then I would refactor my code for pooling layers following the same naming style.

@@ -98,7 +98,7 @@ def check_basic(g, nf):
for i in range(1, nf.num_layers):
in_deg = nf.layer_in_degree(i)
out_deg = nf.layer_out_degree(i - 1)
assert F.asnumpy(F.sum(in_deg, 0) == F.sum(out_deg, 0))
assert F.asnumpy(in_deg).sum(0) == F.asnumpy(out_deg).sum(0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seesm these two impls are equivalent, why converting to numpy first?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chainer variables don't support things like a == b.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see...

@BarclayII BarclayII changed the title [WIP][Backend] Chainer support [Backend] Chainer support Jul 22, 2019
# Currently I could not unify the NumPy-DLPack and CuPy-DLPack interaction
# logic, so I resorted to old-style Chainer functions.
# The new array backend ChainerX currently does not support DLPack.
class BinaryReduce(chainer.Function):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ylfdq1118 I highly suggest we could further separate the logic into frame-specific and frame-agnostic part. For example, I feel what we need is an adaptor class backend.Function which unifies torch.autograd.Function, mxnet.autograd.Function, chainer.Function. By doing so, we could put this BinaryReduce class outside of backend, otherwise it's just too much of copy-pasting.

@jermainewang
Copy link
Member

  • Chainer CPU variables directly wrap around numpy NDArrays, and I could not make zerocopying between numpy and DGL tensors work.

Investigated some solutions. Although numpy has a FFI to support from/to ctype array, it does not handle memory ownership like DLPack does. Thus, in order to support zerocopy between numpy and DLPack, we need to use numpy's C library (see pytorch codes for usage). It adds an extra dependency during compilation but it looks manageable. What do you think? Shall we take that route?

  • Another consequence is that I could not get shared memory to work: DGL frees the node features and edge features in shared memory right after zerocopy_from_dgl_ndarray.

Could you be more specific about this issue? One side note. @zheng-da and I had a discussion before and we both think that shared memory should be treated as one of the device types. We also raised an issue (dmlc/dlpack#41) in DLPack to collect some feedback.

@BarclayII
Copy link
Collaborator Author

BarclayII commented Jul 25, 2019

Investigated some solutions. Although numpy has a FFI to support from/to ctype array, it does not handle memory ownership like DLPack does. Thus, in order to support zerocopy between numpy and DLPack, we need to use numpy's C library (see pytorch codes for usage). It adds an extra dependency during compilation but it looks manageable. What do you think? Shall we take that route?

Sounds fine to me. By asking "Shall we take that route" you mean you have other solutions available?

Could you be more specific about this issue? One side note. @zheng-da and I had a discussion before and we both think that shared memory should be treated as one of the device types. We also raised an issue (dmlc/dlpack#41) in DLPack to collect some feedback.

The issue is in _move_data_to_shared_mem_array in python/dgl/contrib/graph_store.py.
The function is supposed to return a framework-specific view of the DGL array in shared memory, meaning that the object it returns is a, say, PyTorch tensor, with its contents residing in shared memory. However, since zerocopying from DGL arrays to numpy does not work yet, the resulting Chainer variable would have its memory independent of the DGL array in shared memory. So after returning, the DGL array would be recycled.

As per device type, I think a better generalization would be having file as a device type. Because technically we can store tensors on any storage as long as it support random access (e.g. shared memory, file on a disk (impractical usually), or other devices that supports direct (p)read(2) and (p)write(2)). Just my two cents though.

@BarclayII
Copy link
Collaborator Author

Closing this since the Chainer team is migrating to PyTorch.

@BarclayII BarclayII closed this Dec 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants