-
Notifications
You must be signed in to change notification settings - Fork 74.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distributed tensorflow on Mesos #1996
Comments
/cc @mesosphere |
We(@douban)'re developing a lightweight mesos framework (named There's still a dependency conflict on |
@mckelvin I'm interested in the implementation. Is it possible to share with the rest us that the possibilities opened by |
@mckelvin Yes. This issue was tagged "contribution welcome". So If you have something to share probably could be transformed in a pull request soon. |
@mrry What do you think? Do you have some feedback on how to handle this so that could be easier to integrate with TF repository with a PR? |
I'm agnostic as to whether this would be better as a standalone repository, or integrated into somewhere like Hopefully the integration can be relatively simple, and exist as a set of Python scripts somewhere (though I don't have enough experience with Mesos to say). There might be some changes required in the core, so I'll be watching this thread, and prepared to respond to feature requests. |
@mrry Some initial work is at https://github.com/douban/tfmesos |
@bhack You got it! I've pushed the initial code to https://github.com/douban/tfmesos as well as https://hub.docker.com/r/tfmesos/tfmesos/ (@windreamer is the main developer of tfmesos). If you have mesos+docker installed, you can run the demo now. Before you start, you should pull tfmesos docker images on mesos server and slaves, via: Notice: There're still some unsolved issues in tfmesos and it's not production ready. For example, the tfmesos container is running in # coding: utf-8
# ~/demo.py
import sys
import tensorflow as tf
from tfmesos import cluster
def main(argv):
jobs_def = [
{
"name": "ps",
"num": 2
},
{
"name": "worker",
"num": 2
},
]
mesos_master = sys.argv[1]
with cluster(jobs_def, master=mesos_master, quiet=False) as targets:
with tf.device('/job:ps/task:0'):
a = tf.constant(10)
with tf.device('/job:ps/task:1'):
b = tf.constant(32)
with tf.device("/job:worker/task:1"):
op = a + b
with tf.Session(targets['/job:worker/task:0']) as sess:
print sess.run(op)
if __name__ == '__main__':
main(sys.argv)
|
/cc @mtamburrano |
@mckelvin I think that the docker image could be based on https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/docker/README.md. |
the main problem is |
@windreamer See also NVIDIA/nvidia-docker#60 |
Related news: Dcos is opensource now |
We are trying to test tensorflow on mesos on GPU. If somebody it is interested can see douban/tfmesos#3. We also need to think to a smarter auto device placement in cluster scenarios like with mesos. See also #2126 |
Hi, I'm new Mesos contributor and I have a little experience developing Mesos frameworks. While I'm studying the possibility to run Tensorflow as a native framework I found this thread. ^.^ I saw that you're talking about GPU resource in Mesos. Take a look: |
@jvanz we are trying experimental work is here: but before all of there, I think this JIRA should be fixed first: |
@windreamer for issue 5186 I suggest you to open directly a PR at https://github.com/apache/mesos/pulls |
@windreamer @girving How do you think this can be initally contributed to TF? I think that if you can create a PR we can attract a more extended users base. |
sure,it is an honour to contribute these small code base to TF. however, i think TF prefers k8s over mesos or yarn. |
K8s is in house but I don't that @mrry is against Mesos contribution. |
by the way, |
@bhack @windreamer: We'd be delighted to see TensorFlow working on Mesos (and YARN, and other cluster managers). I don't know anything about how |
Ok, we can run a distributed I think although |
As we have already discussed we need to take a decision on the new http://mesos.apache.org/documentation/latest/container-image/ |
@bhack yes, Apache Mesos 1.0 introduces a lot of new features, and I need more time to decide which is the best way. |
and unfortunately https://issues.apache.org/jira/browse/MESOS-5186 is still unresolved... |
For what it's worth, the only reliable / supported path for GPUs in Mesos Development for the docker containerizer is currently under development, Regarding MESOS-5168, what features of proto3 do you require? I know we On Wed, Aug 31, 2016 at 1:27 PM windreamer notifications@github.com wrote:
|
@klueska proto3 support is vital, or we will end up with a version confliction between I would like to try the mesos containerization way to launch the image, but it is still a bit mess for me to figure out how to enable this and GPU support. I need more time, and any suggestion and contribution is definitely welcome! |
Also is there a DC/OS plan? |
@windreamer I think the reason that the JIRA was probably never resolved is that it's not clear that the fix being proposed is the right one. It may happen to work for your particular use case, but many of mesos's probufs aren't written in a way that is compatible with proto3 clients. For example, many of mesos's protobufs still contain Regarding problems figuring out how to enable GPU support -- I can help with that. We basically mimic the functionality of nvidia-docker so that anything that runs in nvidia-docker should now be able to run in mesos as well. Consider the following example:
The flags of note here are:
When launching an agent, both the The In addition to these agent isolation flags, Mesos requires frameworks that want to consume GPU resources to have the GPU_RESOURCES framework capability set. Without this, the master will not send an offer to a framework if it contains GPUs. The choice to make frameworks explicitly opt-in to this GPU_RESOURCES capability was to keep legacy frameworks from accidentally consuming a bunch of non-GPU resources on any GPU-capable machines in a cluster (and thus blocking your GPU jobs from running). It's not that big a deal if all of your nodes have GPUs, but in a mixed-node environment, it can be a big problem. Finally, the Hopefully you can extrapolate things from there. Let me know if you have any questions. |
I played a little bit with running Tensorflow on Mesos, without the GPU support. Regarding to the protobuf 2 dependency issue, I removed the protobuf 2 dependency from Mesos by not setting the protobuf 2 path in PYTHONPATH when starting the executor. It will pick the protobuf 3 dependency if installed. To make it production-ready, we may have to handle a few failure cases: master failure, agent failure, executor failure, network partition, message lost, service discovery, etc. So I am looking into other Mesos-based framework such as Marathon and see whether it handles all the failure cases for us. |
@YuefengZhou Marathon? In the DC/OS flavour? |
@bhack I have an experimental Mesos/Marathon cluster setup in my machine. I haven't looked into DC/OS yet. But I guess if it can work on Marathon smoothly with fault tolerance, it is not difficult to switch to DC/OS setup. |
GPU support will be included in Marathon 1.3 (being released in the next On Wed, Aug 31, 2016 at 11:03 PM Yuefeng Zhou notifications@github.com
|
@klueska for And for proto3, if my understanding is right, protobuf is both a compiler and runtime library. For compiler, proto3 compiler is not backward-compatible with proto2. But as long as library is generated, proto3 runtime is backward-compatible with proto2, as least I believe so. So which compiler used to build mesos library is managed by mesos itself. Only driver developers (such as pymesos of our own) may need to pay attention to this problem. For end user, I believe both proto2 or proto3 are ok with them. this is why I don't think https://issues.apache.org/jira/browse/MESOS-5186 is a big issue to be resolved until Mesos 2.0 |
/cc @vicki-c if interested in this topic. |
Why this rapid position change by google? Was self assigned to google just 17 days ago. |
Unified Containerization support for Apache Mesos 1.0 is merged now douban/tfmesos#12 (comment) |
I accidentally added these tags here and not in the issue I intended to. Sorry. |
We have examples running on Marathon in the new repo: github.com/tensorflow/ecosystem. Closing this issue. If you have any changes you want to make, please create pull requests or issues in the new repo. |
Thanks a lot to @mckelvin , after we clean the compatible issues. http://issues.apache.org/jira/browse/MESOS-5186 is committed to master just now. |
I know this thread is closed, but I wanted to point out the new release of distributed TensorFlow on DC/OS that we announced today. https://mesosphere.com/blog/tensorflow-gpu-support-deep-learning/ |
…upstream-sync-230220 Develop upstream sync 230220
In the distributed howto: "We are working on tools for launching tasks programmatically, e.g. using a cluster manager like Kubernetes. If there are particular cluster managers for which you'd like to see support, please raise a GitHub issue."
It could be interesting to have CPU and GPU dockefiles ready for distributed than can run in a scalable way on Mesos (with Marathon and Kubernetes)
The text was updated successfully, but these errors were encountered: