Skip to content
ruig2 edited this page Mar 23, 2019 · 28 revisions

Paper presentations

References to big data visualization

Massive data visualization analysis - analysis of current visualization techniques and main challenges for the future

Download link: https://ieeexplore.ieee.org/abstract/document/7975704

This paper lists a few big data viz tools including D3, Tableau and so on. Those tools introduced in the paper that we are not that familiar with are:

Though the above tools are claimed to be for big data, in fact, it is still static and small compared to what we want to do.

For example, the above tools focus on the rendering part. The input could be the number of people in each country which is stored in an Excel file, and then a heat map or count map is generated accordingly. And our target is to analyze dynamic data based on the fact that people are born and die every day, and we want to know the number of people born in each town each week.

Large-Scale Graph Visualization and Analytics

Ma, Kwan-Liu, and Chris W. Muelder. "Large-scale graph visualization and analytics." Computer 46.7 (2013): 39-46. Linke: https://ieeexplore.ieee.org/abstract/document/6576786

This paper introduces a few key topics in the area of big graph visualization. People are working on new algorithms and developing new tools, however, we didn't find an integrated tool (ideally with a webpage UI) to render real-time and dynamic big graph.

Visual Analysis of Large Heterogeneous Social Networks by Semantic and Structural Abstraction

Summarized by Rui Guo.

Link: https://ieeexplore.ieee.org/abstract/document/1703364

It's the paper that proposes OntoVis system. This paper focuses on heterogeneous networks, which means one node can be either a user or an organization. By drawing users and organizations in the same graph, we can understand the relationship between users, between organizations and between users and organization. A few layout algorithms for geometric data (i.e. nodes without a fixed location on the graph) are discussed in the paper.

We are not sure if we'd like to build such a heterogeneous system. A trade-off here is the more types of nodes in a graph, then the easier users will be lost.

Gephi : An Open Source Software for Exploring and Manipulating Networks

Link: https://gephi.org/publications/gephi-bastian-feb09.pdf Summarized by Rui Guo.

Gephi is an open-source project that tries to visualize graph data. This is the best graph visualize system I got so far. The introduction/demo video of it can be found at https://gephi.org/features/ .

Pros of this project are:

  1. the useful features. As the video shows, users can merge nodes or split nodes, run queries and do clustering and so on. Almost all the ideas that I come up with graph visualizing (except visualize in 3D) can be found here.
  2. It's maintained by a startup company, and it's more stable than the research projects developed in universities.

Cons of this project are:

  1. it's client-based. You need to download it and install it, and this can be inconvenient and hard to scale because big data can be distributed on different machines.
  2. it's still small data. The above paper was published in 2009 and at that time a large network means 20,000 nodes.
  3. it's not active recently. People are still working on this but commits are not as frequent as in 2016 or before.

Interactive Visualizations demo from Oxford Internet Institute, University of Oxford

Link to the demo: http://oxfordinternetinstitute.github.io/InteractiveVis/network/# Summarized by Rui Guo.

This is a beautiful visualization demo. It's built in 2012 and utilizes sigma.js.

Pros:

  • Beautiful UI: 1) the colors of nodes and edges are gentle and distinguished; 2) curved lines instead of straight lines
  • Interactive and the latency is very low.
  • Utilization of sigma.js , that means the potential to use GPU.

Cons:

  • It's small and static data. It will be great if we can show dynamic data and run query on the graph (e.g. who followed me yesterday and who unfollowed).

graphVizdb: A Scalable Platform for Interactive Large Graph Visualization

Summarized by Rui Guo on 2019-02-07.

One sentence summary: This paper proposes a web-based tool to visualize big geometric data (100M nodes and 100M edges).

Pros:

  • Support large data set (100M nodes and 100M edges) by offline pre-proceeding. During it,
  1. locations on the canvas are assigned to nodes by layout algorithms (e.g. greedy algorithm to put the part of the graph with the largest number of edges in the center of the canvas);
  2. a few abstraction layers are computed (e.g. by compressing a part of the graph into a node), and the layers are similar to the Google Map layers: when zoom-in, you see provinces of a country and then towns, cities and so on.
  3. B+ tree and R-tree are used to index nodes and their locations on the canvas. In fact, the experimental results show DB is not a bottleneck at all.
  • A few operators supported: check the graph vertically (zoom-in) and horizontally (move the graph around) with R-tree index, search tags of nodes and edges with B+ tree index.
  • Low latency when checking the graph on the web client. The pre-proceeding may take an hour but the online demo reacts in seconds.
  • It shows a linear growth in the overall running time when querying a larger part of the same graph.

Cons:

  • The web client is the bottleneck. It seems to use CPU only rather than GPU. Per the statistical data shows, the rendering and delivering time are the major part. -> after a powerful web FE is developed, then we can deal with 1) stream data (e.g. new nodes and edges coming every day, and we may want to do pre-proceeding progressively), 2) advanced graph query (e.g. users who followed and unfollowed me yesterday) to tell a better story about the system architecture.
  • More advanced interactive operators can be supported. For example, users may want to merge a few nodes into one node manually as in the Gephi demo.
  • Each node is of same size and color. Though nodes and edges are marked with different text tags, they are not distinguishable when checking the graph. It still remains a problem to render one node on the graph because one node may have lots of fields.

Visualizing large knowledge graphs: A performance analysis

Link of the paper: https://www.sciencedirect.com/science/article/pii/S0167739X17323610

Summarized by Rui Guo on Feb 11, 2019

This paper focuses on the data processing part of the graph visualization pipeline, and analyzes the performance by cooking the data by Spark.

What we can learn from the paper:

  • We can use Spark and other big data frameworks to pre-proceed graph data, and the source code of the experiments in this paper is available.
  • Big graph can be handled by a few machines in a short time (the paper says "12 million vertices and 172 million edges can be executed in <25s").
  • The denser a graph is (more edges in a graph), the slower the proceeding is. Luckily, in the real world graphs are sparse (e.g. a friendship network on Facebook).
  • References and tools to visualize big data. This paper is a good point to start to know the entire visualization world.

What we can do differently from this paper:

  • This paper works on the pre-proceeding (e.g. PageRank and layout algorithms) part on big graphs. We'd like to build an interactive and dynamic system, and that means we need to focus on the rendering part after the pre-proceeding stage. What's more, the input data may be dynamic rather than static (i.e. streamed data coming in every second), and the query/interaction can be dynamic (e.g. query a sub-graph, select different labels).

GraphDL: An Ontology for Linked Data Visualization

Download link: https://link.springer.com/chapter/10.1007/978-3-030-00374-6_33 Summarized by Rui Guo on Feb 11, 2019

Similar to the above one (Visualizing large knowledge graphs: A performance analysis) though this one is shorter and doesn't work on performance analysis.

A Hierarchical Framework for Efficient Multilevel Visual Exploration and Analysis

Download link: http://www.semantic-web-journal.net/system/files/swj1227.pdf Summarized by Rui Guo on Feb 14, 2019

It is from the same group as the graphvizdb project. This paper focuses on range query, e.g. how many users are between age 20-year-old and 50-year-old. This is a different view to visualize a graph by charts and histograms rather than by drawing the graph itself on the screen. A new tree data structure (maybe similar to B+ tree?) is defined in the paper to speed up the query.

Interactive Visual Graph Analytics on the Web

Download link: http://graphvis.com/pubs/ahmed-et-al-icwsm15.pdf Website: http://graphvis.com/ Summarized by Rui Guo on Feb 20, 2019

This is a web-based graph data visualization tool developed at Purdue University. However, there is no related video and I need to apply for access to the online demo. I'll share more details if I am given access.

Pros:

  • web-based client and interactive. Cons:
  • deal with static data (?);
  • The paper lists lots of its fancy features, but it is easy to get lost here. Maybe it would be better to demo its a few key features on a specific dataset and show that it is a killer-level application rather than a huge combination of visualization algorithms.

Visualizing Dynamic Bitcoin Transaction Patterns

Paper link: https://www.liebertpub.com/doi/pdf/10.1089/big.2015.0056

Demo link: https://imperialcollegelondon.app.box.com/v/bitcoinVis

Summarized by Rui Guo on Feb 21, 2019

This paper focuses on Bitcoin transaction viz. Each transaction is presented in a few nodes (input account, output account) and edges (transaction). They use SigmaJS for the web-based UI part and ForceAtlas2 (reference: ForceAtlas2, a Continuous Graph Layout Algorithm for Handy Network Visualization Designed for the Gephi Software) as the layout algorithm. Per our previous discussion, we would like to adopt the same technical tools to develop our system.

Pros of this paper:

  • Very straight forward visualization, and provides pretty convincing examples that show the effectiveness of the system (e.g. transaction rate attack by sending the small amount of money between a few accounts over and over again);

  • Well-organized: it introduces how Bitcoin works, how the visualization system is built and shares a few examples, and then try to persuade the reader the effectiveness of the system by giving the feedback from the visitors to their group (unlike the traditional domains of database, we cannot compare two visualization systems by benchmarks and experiments).

Future works of the paper:

  • It focuses on one block of transactions, and maybe it can be extended to across multiple blocks to track the transactions on the same account.

Empirical Comparison of Visualization Tools for Larger-Scale Network Analysis

Paper link: https://www.hindawi.com/journals/abi/2017/1278932/

Summarized by Rui Guo on Feb 23, 2019

This paper compares 4 popular graph viz tools. According to the paper, for general viz purpose Gephi is the best, and for scalability, Pajek-XXL is the best (support > 10 billion nodes).

I googled a little bit, and the homepages and video demos for those tools are:

Tulip (http://tulip.labri.fr/TulipDrupal/)

Cytoscape (https://cytoscape.org/what_is_cytoscape.html):

Pajek (http://mrvar.fdv.uni-lj.si/pajek/):

Exploration and Visualization in the Web of Big Linked Data: A Survey of the State of the Art

Paper Link: https://arxiv.org/pdf/1601.08059.pdf

Summarized by Rui Guo

  • 144 references, and lots of tools mentioned in the paper Challenges in big graph viz:
  • Queries -> user-specific
  • Sampling and filtering
  • Aggregation (e.g. clustering)
  • Offline pre-proceedings, incremental/progressive to new and dynamic data

Neo4j internal

Book link: https://neo4j.com/graph-databases-book/

Summarized by Rui Guo on March 23, 2019

Chapter 6 of the book talks about the internal design of Neo4j. Basically, instead of a B+ tree, Neo4j stores relationships in linked lists. For example, all the friends of Alice are stored in a list, and if we have the header of the friend list then it is not necessary for us to query a B+ tree anymore. To speed up the query on the list, each Neo4j record is of the same length, that means given the record id, the location of the record can be calculated immediately using the fixed record length and the offset to the record.

Currently, the bottleneck of the visualization system is on the frontend and network. In the near future we might get issues on the backend performance and Neo4j can be one of the solutions here.

Clone this wiki locally