Table of Contents
- Doc
- Open data
- Data storage
- Markup Language
- Languages
- Workflow/Pipelines tools
- Dataset
- Tools
- Data structure
- Algorithm
- Statistics
- Big Data and Cloud
- Books
- Course
- Misc
- awesome-public-datasets - A topic-centric list of high-quality open datasets in public domains. By everyone, for everyone!
- fivethirtyeight/data - Data and code behind the articles and graphics at FiveThirtyEight https://data.fivethirtyeight.com/
New tech
- IPFS is the Distributed Web
- Data Science with Python & R: Dimensionality Reduction and Clustering
- R vs Python: head to head data analysis | R vs Python:硬碰硬的数据分析
- DataPyR
- Choosing R or Python for data analysis? An infographic
DSL
- GNU make, manual, Make 命令教程
- Common Workflow Language
- snakemake - Snakemake is a workflow management system that aims to reduce the complexity of creating workflows by providing a fast and comfortable execution environment, together with a clean and modern specification language in python style. Build bioinformatics pipelines with Snakemake
- nextflow - A DSL for data-driven computational pipelines http://nextflow.io
- sake - A self-documenting build automation tool
Language-dependent
- toil - A scalable, efficient, cross-platform and easy-to-use workflow engine in pure Python
- Ruffus - Ruffus is a Computation Pipeline library for python. It is open-sourced, powerful and user-friendly, and widely used in science and bioinformatics.
- awesome-public-datasets - An awesome list of (large-scale) public datasets on the Internet. (On-going collection)
- csvkit - A suite of utilities for converting to and working with CSV, the king of tabular file formats. http://csvkit.rtfd.org/
- csvtk - Another cross-platform, efficient, practical and pretty CSV/TSV toolkit in Golang http://shenwei356.github.io/csvtk
- icdiff - improved colored diff http://www.jefftk.com/icdiff
- skizze - A probabilistic data structure service and storage
- dablooms - A Scalable, Counting, Bloom Filter. Java, Python, Go edition.
- inbloom - Cross language bloom filter implementation
- HyperLogLog & HyperLogLog++, 大数据计算:如何仅用1.5KB内存为十亿对象计数 - Hyper LogLog 算法, 如何快速估计巨大 dataset 中unique 元素的数目
- YGC的统计笔记
- 《On the scalability of statistical procedures: why the p-value bashers just don't get it》by Jeff Leek
- The Only Probability Cheatsheet You'll Ever Need
- Where priors come from (some popular distribution)
- Probabilistic Programming & Bayesian Methods for Hackers
- The Advanced Matrix Factorization Jungle
- R, STATISTICS, PSYCHOLOGY, OPEN SCIENCE, DATA VISUALIZATION
p-value
- Scientific method: Statistical errors
- P value ban: small step for a journal, giant leap for science
- Science Isn’t Broken
- Odds Are, It's Wrong
- 左耳朵耗子谈云计算:拼的就是运维
- 分布式系统的事务处理
- 【专治不明觉厉】之“大数据”
- hadoop和大数据的关系?和spark的关系?互补?并行?
- 后Hadoop时代的大数据架构
- Elements of Scale: Composing and Scaling Data Platforms
- http://www.analyticsvidhya.com/
- 数据科学家修炼宝典
- Mining of Massive Datasets - The book is based on Stanford Computer Science course CS246: Mining Massive Datasets
- Stanford Computer Science course CS246: Mining Massive Datasets
- algorithms for Big Data - This class will give you a biased sample of techniques for scalable data anslysis
- What to do with “small” data?
- Convert xlsx to csv in linux command line, use
ssconvert
of Gnumeric