Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Figures on very big graphs ? #8

Open
antoine-de opened this issue Mar 2, 2018 · 5 comments
Open

Figures on very big graphs ? #8

antoine-de opened this issue Mar 2, 2018 · 5 comments

Comments

@antoine-de
Copy link

Hi,

this project seems very interesting! I was wondering whether you had tested in on quite big graph like the host level common crawl with tens (or hundreds) of billion edges ? Do you have some figures on which cluster would be required to run it and a rough idea of the time that would be needed to compute the pagerank on it ?

I'm sorry I haven't yet delved enough into timely but would harmonic centrality be as easy to implement as the pagerank ?

@frankmcsherry
Copy link
Owner

I think this code has only been used on up to 4 billion edges (with the UK-2007) graph, but nothing prevents a larger graph. The main requirement is available memory or swap for the partitioned edge file, even though it is mostly traversed linearly.

We have processed the 2014 CC graph (64 billion edges) on timely with a different analysis (subgraph finding) over at: https://github.com/frankmcsherry/dataflow-join

I'm afraid I only know the definition of harmonic centrality, and not efficient algorithms for its computation. Generally, things like betweenness centrality do not have linear-time algorithms, which means bad news on massive graphs. Approximations to betweenness centrality, like only counting subsets of distances, can execute efficiently but you need to determine to what extent the approximation works for you.

@frankmcsherry
Copy link
Owner

And, if it helps, you can read more about performance on 1.6 billion edge and 3.7 billion edge graphs at:

https://github.com/frankmcsherry/blog/blob/master/posts/2015-07-08.md
https://github.com/frankmcsherry/blog/blob/master/posts/2015-07-31.md

The short answer is, twenty iterations on the 1.6B edge graph took about 20 seconds, using a cluster of 16 machines. There are scaling numbers there, and the times should probably just scale up linearly with the number of edges (perhaps getting worse with the number of nodes, as the random access working set increases).

@antoine-de
Copy link
Author

great I'll try to run it on a big graph, thanks!

I noticed that the nodes are in u32, I'll need to change them to u64 to handle 4B+ nodes.

@frankmcsherry
Copy link
Owner

frankmcsherry commented Mar 2, 2018 via email

@antoine-de
Copy link
Author

yes I use a CC dataset with 5B nodes 😕

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants