Skip to content

Setting up a mirroring cluster

Georgios Gousios edited this page May 2, 2012 · 17 revisions

General information

The Github API limit (currently, 5000 reqs/sec) makes it impossible to retrieve all data linked from events on a single node. For that, GHTorrent was designed to work on multiple phases in a distributed fashion. Depending on the data you want to collect, a cluster setup may be necessary.

A full GHTorrent cluster consists of the following types of nodes:

  • Event retrieval nodes: Nodes that query the public Github event API for new events. More than one instances are required to both ensure that no events are lost due to spikes in event generation and that machine or network malfunctions the event collection machine do not affect the service.
  • Linked data retrieval nodes: Retrieval of data linked by events is where the Github API is imposing the most significant restrictions.
  • MongoDB shards: A MongoDB installation can be sharded (have the data spread on multiple nodes) on a per collection basis. Sharing MongoDB helps with both distributing the storage requirements and faster querying. Sharding is transparent to the application and therefore no modification is required to GHTorrent to work with a shared MongoDB. See more on MongoDB sharding here
  • RabbitMQ active-active mirrors: RabbitMQ can work in cluster mode for high availability.

Recommended setup

To setup a GHTorrent cluster, at least 2 nodes are required. However, the more the available nodes, the more data can be collected and the more resilient to external errors or mirroring script bugs the cluster will be. The following sections assume a Debian-based operating system.

Setting up software dependencies

On the node

To prepare MongoDB:

$ mongo admin
> db.addUser('github', 'github')
> use github
> db.addUser('github', 'github')

To prepare RabbitMQ:

$ rabbitmqctl add_user github
$ rabbitmqctl set_permissions -p / github ".*" ".*" ".*"

# The following will enable the RabbitMQ web admin for the github user
# Not necessary to have, but good to debug and diagnose problems
$ rabbitmq-plugins enable rabbitmq_management
$ rabbitmqctl set_user_tags github administrator

Setting up a GHTorrent node

The following steps will create a new GHTorrent cluster node. To ensure that the mirroring operations will continue even if there are bugs in the code, we use the supervise program (from D.J. Bernstein's daemontools package) which monitors processes and restarts them in case of errors.

  • Install the necessary dependencies
apt-get update
apt-get -y install ruby rubygems git daemontools screen sudo
  • Add a user for the mirroring operations.
adduser github
  • Install the necessary Ruby libraries
gem install amqp mongo json bson_ext
  • Checkout the code from Github (as github user)
git clone git://github.com/gousiosg/github-mirror.git
  • Select whether the node will be a data retrieval or event mirror node and setup process supervision:
ln -s github-mirror/mirror_events.rb github-mirror/run

# or

ln -s github-mirror/data_retrieval.rb github-mirror/run
  • Use the config.yaml file to configure the mirroring tools with the locations of the MongoDB and RabbitMQ. If MongoDB and/or RabbitMQ hosts run in a separate network from the host you are currently configuring (or even on the same network, since MongoDB does not provide for SSL encrypted sockets), you can use ssh port forwarding to setup a secure channel between the local machine and the MongoDB/RabbitMQ host(s). To do so, create a github user on the machine where MongoDB/RabbitMQ runs and do the following:
ssh -fN -L 5672:rabbithost:5672  github@rabbithost ls
ssh -fN -L 27017:mongohost:27017 github@mongohost ls

The created ports will not be deleted after the controlling terminal exits.

  • Start mirroring
supervise github-mirror

To keep the mirroring session open on terminal disconnects, use screen or an equivalent program.

Clone this wiki locally