Introduction to the engineering workflow on network data for undergraduates.
The networkie
libray is organized with the following modules:
- gen: Network object generation via various generative models, including loading from custom data.
Current implementation includes
Custom
. There are virtual classes forErdős–Rényi model
andGaussian mixture model
(will be implemented in the future). - partition: Graph partitioning via various methods. Current implementation includes
Louvain
. - transform: Network data transformation between various formats. Current implementation includes the
Edgelist
class. It is designed to transform*.edgelist
txt data to other formats. Current implementation includes transformation between the plain textedgelist
file and the Numpy.ndarraymatrix
format. - viz: Network visualization module.
- dataset: Network data sets for the project.
More broadly, the repository is organized as follows:
- build_tools Continuous integration related scripts.
- dataset Network data sets.
- homework Homework designed for learning network analysis with the library.
- networkie Main library of the
networkie
. - scripts Other scripts, including the script that parses Google Forms' raw csv file.
- tests Scripts for unit tests via
pytest
.
The main function for the files in the root directory is as follows:
- .gitattributes Attributes for the Git repo. Here, we ask Git to ignore counting
*.ipynb
files as part of the code contribution in the repo. - .gitignore Listing of the files (and file extensions) that you would like Git to ignore.
- .travis.yml Configuration file for travis-ci.
- LICENSE License for the project. Here, we use GNU GPLv3 because we cared about sharing improvements.
- pytest.ini Configuration file for the
pytest
framework. - README.md The readme file for the project.
- requirements.txt The libraries that you have to install before running the code of the project smoothly.
- requirements_test.txt The libraries that you have to install before running the code of the project smoothly; when running in a test mode.
- *.ipynb Jupyter Notebooks that are useful for learning.
First things first, please fork this repository on your GitHub. And then, clone this repository locally on your computer.
git clone https://github.com/YOUR-USERNAME/data-engineering-tutorial.git
Now, let's install the libraries needed by the project. We will do so by installing the libraries into existing environment. However, a better practice is to create a new environment for the project.
conda install numpy jupyter networkx pytest
Or, you may simply do:
pip install -r requirements.txt
pip install -r requirements_test.txt
Oh, did I mention that Anaconda (which ships with conda
) is an useful package management tool for Python?
To download Anaconda Distribution, visit here.
If you have pulled the codes, it should now be ready to work! To make sure the code works well, run the unit tests:
pytest
If pytest
shows a green pass sign, you should be ready!
We use these *.ipynb
files to illustrate the use of networkx
in analyzing the in-class social network data.
For more info, please check out the references.
-
Overview - introduction to basic tools, e.g. git, GitHub, Python.
-
Foundations - Reading files, importing data, creating the Graph, adding attributes to the graph.
-
Analysis - centrality measures, community detection with modularity (TODO).
-
Visualization - network visualization.
- Network Analysis Made Simple by Eric Ma on GitHub. We will go through some parts of the repo in the lecture.
- Exploring and Analyzing Network Data with Python, yet another example for using
networkx
, with a focus on analyzing humanities data. - Ball, B., & Newman, M. (2013). Friendship networks and social status. Network Science, 1(1), 16-30. The paper illustrates an interesting research work on generic in-class social network data.