-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark integration, and examples #28
Comments
Thanks for your interest in the project! Currently the parameter server runs stand-alone outside of Spark which means you'd have to run the master and servers on a cluster as separate java processes from Spark (the scripts in the
This project is part of my ongoing Master thesis. I currently have an implementation of LDA that uses this parameter server and Spark that scales up to 2TB of data and 1000s of topics on a moderate computing cluster. I will open source that implementation within the next month or so. I will also write more extensive documentation going into more detail how to set up a cluster, some code examples and some tuning tips with regards to ExecutionContexts, timeouts, etc. Due to other deadlines regarding my thesis I haven't found the time for this yet. |
Ok, thanks. I'd like to see if I can find some time to investigate some implementations on top of this. Do you know how it compares to https://github.com/dmlc/ps-lite in terms of performance etc? In terms of ease-of-integration with frameworks like Spark, something based on Scala/Akka seems much nicer. |
I currently have not compared the performance to ps-lite (or any other parameter server for that matter), but it would be a very interesting thing to measure. If I had to guess I'd say that ps-lite is much faster. Especially due to the very early alpha-state and experimental nature of Glint compared to the very mature implementation of ps-lite. My main goal is indeed to have a parameter server that is very easily integrated with Spark. So far it seems to work well in my practical case (LDA with Collapsed Gibbs sampling) but some raw numbers like updates/sec, requests/sec, etc. would be very interesting. I'll keep you updated here when I have some real measurements. Additionally Glint has no regard for fault tolerance (unlike ps-lite, which offers instantaneous failover), so if a server goes down the data is lost. This is something I definitely wish to address in the future but is at the moment outside the scope of the project. |
I've started working on some POCs for Spark integration starting with On Thu, 11 Feb 2016 at 18:29 Rolf Jagerman notifications@github.com wrote:
|
That sounds great! 👍 I have just now open sourced the LDA implementation here so you could take a look at that. It is based on some state-of-the-art LDA research and the many low-level optimizations, caches and buffers make the code base a bit hard to follow. I can give you some pointers: The construction of the count-table matrix for the collapsed Gibbs sampler happens here
The solver uses Spark to map partitions of our RDD to resampled partitions (effectively doing an LDA iteration) here
The We make extensive use of buffers for performance reasons, so this can obfuscate the code quite a bit. The locks that you see in the code act as a back pressure mechanism that limit the number of open requests to the parameter servers. This is necessary since the code is so fast that it could easily flood the parameter server by asynchronously sending requests. I will be finished with my thesis in about 2 weeks, after which I'll have some more time to create some minimal examples that are much easier to read and understand. In the mean time I hope this helps! :-) |
Great, I will take a deeper look at that code. The general approach is On Tue, 15 Mar 2016 at 17:39 Rolf Jagerman notifications@github.com wrote:
|
Also another hint that could be helpful: In my glint configuration I increased the Akka frame size and heart beat timeouts. This allows me to send much larger pull and push requests. My configuration file looks like this:
|
I've added slightly more comprehensive documentation at http://rjagerman.github.io/glint/. This might be of use to you as it includes a short section on spark integration and serialization (see the getting started guide). I'm also currently working on getting some benchmarks for common tasks (e.g. logistic regression, SVMs, regression, all-reduce, etc.) so we can compare glint against spark and other frameworks. In time this will produce some more example code for these common tasks. |
Thanks Rolf! I have been swamped with Spark 2.0 work, but I'd like to get
|
Hi @rjagerman ,After the look of your code ,I think it's clear to use the your API in the spark map or mapPartitions operations ,But I find it's a little hard to use your API ,unless i don't use spark aggregate ,such as I want to implement Gradient Descent. Thanks!
|
Hi @rjagerman, this project looks very interesting and I'd like to explore it a bit more. You mention Spark integration as a goal, has there been work done on that? What about example algorithms using this parameter server?
The text was updated successfully, but these errors were encountered: