-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Figures on very big graphs ? #8
Comments
I think this code has only been used on up to 4 billion edges (with the UK-2007) graph, but nothing prevents a larger graph. The main requirement is available memory or swap for the partitioned edge file, even though it is mostly traversed linearly. We have processed the 2014 CC graph (64 billion edges) on timely with a different analysis (subgraph finding) over at: https://github.com/frankmcsherry/dataflow-join I'm afraid I only know the definition of harmonic centrality, and not efficient algorithms for its computation. Generally, things like betweenness centrality do not have linear-time algorithms, which means bad news on massive graphs. Approximations to betweenness centrality, like only counting subsets of distances, can execute efficiently but you need to determine to what extent the approximation works for you. |
And, if it helps, you can read more about performance on 1.6 billion edge and 3.7 billion edge graphs at: https://github.com/frankmcsherry/blog/blob/master/posts/2015-07-08.md The short answer is, twenty iterations on the 1.6B edge graph took about 20 seconds, using a cluster of 16 machines. There are scaling numbers there, and the times should probably just scale up linearly with the number of edges (perhaps getting worse with the number of nodes, as the random access working set increases). |
great I'll try to run it on a big graph, thanks! I noticed that the nodes are in u32, I'll need to change them to |
Sounds good; the 128B common crawl graph has identifiers that still fit in a u32, if that helps, but if you have your own larger graph you may need to modify.
… On Mar 2, 2018, at 14:53, Antoine D ***@***.***> wrote:
great I'll try to run it on a big graph, thanks!
I noticed that the nodes are in u32, I'll need to change them to u64 to handle 4B+ nodes.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
|
yes I use a CC dataset with 5B nodes 😕 |
Hi,
this project seems very interesting! I was wondering whether you had tested in on quite big graph like the host level common crawl with tens (or hundreds) of billion edges ? Do you have some figures on which cluster would be required to run it and a rough idea of the time that would be needed to compute the pagerank on it ?
I'm sorry I haven't yet delved enough into timely but would harmonic centrality be as easy to implement as the pagerank ?
The text was updated successfully, but these errors were encountered: