[Feedback] Initial experience with GraphScope #1466
Replies: 2 comments
-
Thanks for your interest in GraphScope! And thank you for your time to give us such a long and precious feedback! We are a team focusing on graph computing for many years, we have the similar feelings as you on Giraph, as well as some other graph systems. It could be hard for common users to use. That motivated us to develop GraphScope. One of its goals is to make graph computing accessible to more end-users, hence easy to set up and develop for is always among our first considerations. Glad to hear that the local deployment, cloud-native design and JupyterLab powered playground earned your compliments! For the second part, we also want to share our thoughts with you:
Thank you again for your valuable feedback, sincerely! |
Beta Was this translation helpful? Give feedback.
-
Convert to Discussion. |
Beta Was this translation helpful? Give feedback.
-
Describe the issue
Hello, I had the chance to try this framework for a few hours today. It's very exciting for me and some of the problems I am researching and starting to solve. I have been looking at different ways to handle graph data for a couple of weeks and I have found Apache Giraph really difficult to set up and develop for, and Pyspark really slow or unfriendly for doing simple filtering on a graph. I wanted to share some of my experience as a new user of GraphScope, in the hopes that some of my experience can help this tool improve. Please don't take any of this negatively, that isn't my intention at all.
There are a few things I've written below that I think would deserve their own issue threads. I didn't go into very much detail because I am planning on submitting separate issues. I just wanted to share some first impressions and my own experience with getting started with this tool, especially if it will help you improve GraphScope.
General feedback points so far:
The good
Current tools: I discovered this project from this paper. I was excited to see another group trying to solve this issue, and better than existing frameworks. I have had the chance to try both Pyspark and Giraph recently for graph processing and the development experience has been really challenging for me. Right now my company is experimenting for a problem we will want to start solving shortly.
Documentation: I love your docs and it looks like a lot of effort was dedicated to making the docs easy for a new user. I was having a very difficult time getting Giraph running, and then even time harder finding documentation and examples for getting a Giraph job running.
Local development: installation is incredibly easy. A single
pip install
for anyone wanting to try this on their local machine is great and much better than most other frameworks out there.Kubernetes: I've worked with Kubernetes for a few years at various levels (deploying clusters with scripts, maintaining those clusters, developing services on it, automating DevOps processes for deployment, etc.) and think that it's a great deployment environment and much easier than Hadoop. To be honest I haven't looked into that area of deployment for GraphScope yet, however I did share a couple of thoughts below that may be interesting, based on my own prior experiences.
JupyterLab: It was amazing to be able to try GraphScope in 5 mins or less by launching it automatically. I really love this aspect, thanks for making it easy.
Some additional thoughts
Adoption: The first time I read this paper (very quickly), I thought that Alibaba was already using GraphScope in production and it looked perfect for the problem we want to solve. After re-reading it I'm under the impression that the production use case involves Spark (Pyspark?), Giraph, JanusGraph and Tensorflow. Now it makes sense why a tool to handle all four of those would make a lot of sense, and especially help simplify the development experience of an engineer working with graph data. It still isn't very clear if this project is being used already in some production instances, and I started wondering if this is something that is in progress or planned, for replacing the existing infrastructure. I have unfortunately seen many projects get abandoned due to changes in business interest and I'm just wondering if this has good support from the business.
Documentation: I tried running some of the examples in the docs and had some compile issues, some due to C++ (I believe) and then some others. I experienced my session crashing in JupyterLab and I had a really hard time recovering it because of this line. I had to log out, terminate all of the kernels and a few other things to get it running again. I submitted [BUG] LPA from the tutorial fails to compile #1150 already and I will submit others as soon as I return to them and as soon as I can.
Datasets: I really liked that MAG is available as a starting dataset. I wasn't able to see some of the properties listed here. Is the dataset provided already pre-processed for proof of demonstration in the docs? From the docs I wasn't able to tell if there were other datasets available, such as a larger MAG dataset or just another dataset. I saw that there were other popular graph datasets available and possibly something to consider adding in the future?
GNN example: I ran the notebook 10. Revisit classification on citation network on k8s. I noticed that the final accuracy for the trained GNN was 12%, and it's quite possible that I misinterpreted it. This isn't a big problem because I believe the original intention was to show how easy it is to use Tensorflow with graph data, and I really believe it is very easy with this tool. I think from the perspective of a new user it would be much more interesting for adopting the tool as a new user if the accuracy was higher, even if it's just 65%, it would be better than a 'probability coin toss'.
Kubernetes: I was personally never a huge fan of Helm. It starts to get really complicated once any business logic is required for maintaining anything past-deployment. For example, if there are job failures you would need to have additional custom control plane for handling that. Personally I think an operator (CRD) fills this use case exactly -- I have created one before for automatically managing dynamic routing rules for development environments and I might be able to provide an example if there's any interest in it. Is this something you've already considered?
Beta Was this translation helpful? Give feedback.
All reactions