Data Science

Table of Contents

Doc
Open data
Data storage
Markup Language
Languages
Workflow/Pipelines tools
Dataset
Tools
Data structure
Algorithm
Statistics
Big Data and Cloud
Books
Course
Misc

Doc

Three Things About Data Science You Won't Find In the Books

Open data

awesome-public-datasets - A topic-centric list of high-quality open datasets in public domains. By everyone, for everyone!
fivethirtyeight/data - Data and code behind the articles and graphics at FiveThirtyEight https://data.fivethirtyeight.com/

Data storage

New tech

IPFS is the Distributed Web

Markup Language

YAML 语言教程

Languages

Workflow/Pipelines tools

DSL

GNU make, manual, Make 命令教程
Common Workflow Language
snakemake - Snakemake is a workflow management system that aims to reduce the complexity of creating workflows by providing a fast and comfortable execution environment, together with a clean and modern specification language in python style. Build bioinformatics pipelines with Snakemake
nextflow - A DSL for data-driven computational pipelines http://nextflow.io
sake - A self-documenting build automation tool

Language-dependent

toil - A scalable, efficient, cross-platform and easy-to-use workflow engine in pure Python
Ruffus - Ruffus is a Computation Pipeline library for python. It is open-sourced, powerful and user-friendly, and widely used in science and bioinformatics.

Dataset

awesome-public-datasets - An awesome list of (large-scale) public datasets on the Internet. (On-going collection)

Tools

csvkit - A suite of utilities for converting to and working with CSV, the king of tabular file formats. http://csvkit.rtfd.org/
csvtk - Another cross-platform, efficient, practical and pretty CSV/TSV toolkit in Golang http://shenwei356.github.io/csvtk
icdiff - improved colored diff http://www.jefftk.com/icdiff

Data structure

skizze - A probabilistic data structure service and storage
dablooms - A Scalable, Counting, Bloom Filter. Java, Python, Go edition.
inbloom - Cross language bloom filter implementation
HyperLogLog & HyperLogLog++, 大数据计算：如何仅用1.5KB内存为十亿对象计数 - Hyper LogLog 算法, 如何快速估计巨大 dataset 中unique 元素的数目

Algorithm

Statistics

p-value

Big Data and Cloud

Books

Mining of Massive Datasets - The book is based on Stanford Computer Science course CS246: Mining Massive Datasets

Course

Stanford Computer Science course CS246: Mining Massive Datasets
algorithms for Big Data - This class will give you a biased sample of techniques for scalable data anslysis

Misc

What to do with “small” data?
Convert xlsx to csv in linux command line, use ssconvert of Gnumeric