Figures on very big graphs ? #8

antoine-de · 2018-03-02T10:17:07Z

Hi,

this project seems very interesting! I was wondering whether you had tested in on quite big graph like the host level common crawl with tens (or hundreds) of billion edges ? Do you have some figures on which cluster would be required to run it and a rough idea of the time that would be needed to compute the pagerank on it ?

I'm sorry I haven't yet delved enough into timely but would harmonic centrality be as easy to implement as the pagerank ?

frankmcsherry · 2018-03-02T10:24:29Z

I think this code has only been used on up to 4 billion edges (with the UK-2007) graph, but nothing prevents a larger graph. The main requirement is available memory or swap for the partitioned edge file, even though it is mostly traversed linearly.

We have processed the 2014 CC graph (64 billion edges) on timely with a different analysis (subgraph finding) over at: https://github.com/frankmcsherry/dataflow-join

I'm afraid I only know the definition of harmonic centrality, and not efficient algorithms for its computation. Generally, things like betweenness centrality do not have linear-time algorithms, which means bad news on massive graphs. Approximations to betweenness centrality, like only counting subsets of distances, can execute efficiently but you need to determine to what extent the approximation works for you.

frankmcsherry · 2018-03-02T10:29:27Z

And, if it helps, you can read more about performance on 1.6 billion edge and 3.7 billion edge graphs at:

https://github.com/frankmcsherry/blog/blob/master/posts/2015-07-08.md
https://github.com/frankmcsherry/blog/blob/master/posts/2015-07-31.md

The short answer is, twenty iterations on the 1.6B edge graph took about 20 seconds, using a cluster of 16 machines. There are scaling numbers there, and the times should probably just scale up linearly with the number of edges (perhaps getting worse with the number of nodes, as the random access working set increases).

antoine-de · 2018-03-02T13:53:40Z

great I'll try to run it on a big graph, thanks!

I noticed that the nodes are in u32, I'll need to change them to u64 to handle 4B+ nodes.

frankmcsherry · 2018-03-02T15:46:52Z

Sounds good; the 128B common crawl graph has identifiers that still fit in a u32, if that helps, but if you have your own larger graph you may need to modify.

…

On Mar 2, 2018, at 14:53, Antoine D ***@***.***> wrote: great I'll try to run it on a big graph, thanks! I noticed that the nodes are in u32, I'll need to change them to u64 to handle 4B+ nodes. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

antoine-de · 2018-03-02T16:35:43Z

yes I use a CC dataset with 5B nodes 😕

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Figures on very big graphs ? #8

Figures on very big graphs ? #8

antoine-de commented Mar 2, 2018

frankmcsherry commented Mar 2, 2018

frankmcsherry commented Mar 2, 2018

antoine-de commented Mar 2, 2018

frankmcsherry commented Mar 2, 2018 via email

antoine-de commented Mar 2, 2018

Figures on very big graphs ? #8

Figures on very big graphs ? #8

Comments

antoine-de commented Mar 2, 2018

frankmcsherry commented Mar 2, 2018

frankmcsherry commented Mar 2, 2018

antoine-de commented Mar 2, 2018

frankmcsherry commented Mar 2, 2018 via email

antoine-de commented Mar 2, 2018