Skip to content

Latest commit

 

History

History
89 lines (76 loc) · 3.95 KB

XX00-outlines.asciidoc

File metadata and controls

89 lines (76 loc) · 3.95 KB

Outline

Additional Content

  1. Statistics and sampling — this is a shitshow. Cleaning up the sections on sampling may be worth it, and maybe a paragraph on underflow issues.

    • sampling strategies

    • covariance

    • sketching

  2. Advanced patterns — a dumping ground for things that didn’t make it in to chapters 6-9. Some of this is worth cleaning up.

    • Handling Ties when Ranking Records

    • Selecting a Single Maximal Record Within a Group, Ignoring Ties

    • Selecting Records Having the Top K Values in a Group (discarding ties)

    • Selecting Records Having the Top K Values in a Table

    • Parsing a Delimited String into a Collection of Values

    • Flattening a Bag Generates Many Records

    • Flattening a Tuple Generates Many Columns

    • Flatten on a Bag Generates Many Records from a Field with Many Elements

    • Flatten on a Tuple Folds it into its Parent

    • Generating Data by Distributing Assignments As Input

    • Generating a Sequence Using an Integer Table

    • Generating Pairs

    • Transpose record into attribute-value pairs

    • Decorate-Flatten-Redecorate — Generate a won-loss record

    • Decorate-Flatten-Redecorate — Run Expectancy

    • Co-Grouping Elements from Multiple Tables

    • Cube and rollup

    • Stitch and Over

    • multi-join

  3. Event streams — has some good examples, no real flow. I’d be excited to get geo-ip matching in the book, as it demonstrates a range join.

    • loading and parsing logs

    • histogram of page views

    • funnel analysis (what are the common paths to reach each page)

    • sessionizing

    • geo-IP matching and range queries

    • benign DDOS

  4. Data munging. This is some of the earliest material and thus some of the messiest. I don’t believe this is worth reworking.

  5. Text analysis — the prose is in very rough shape, but the code is worked through.

    • tokenizing a wordbag

    • term statistics — count, relative frequency, range, dispersion

    • removing stopwords

    • Good-Turing smoothing

    • Why you should use crude methods for corrective measures on data

    • document statistics — tf/idf, pmi

  6. Big Data Ecosystem — readable; would like this to go in

  7. Organizing data and 43. Commandline mojo and cat herding — section on data formats, and section on commandline tools: wc, cut, etc. This is actually in decent shape, if you think it’s in scope

    • data formats:

    • JSON,

    • TSV,

    • XML,

    • why XML sucks,

    • Parquet,

    • Other

    • commandline tools

  8. Tips and Gotchas — this is a dumping ground for "little things I’ve learned". If this can go in as just a set of bullet-points, I think some of it is valuable

  9. Direct Java API — readable; if you agree with what’s there, would like this to go in.

  10. Advanced Pig — the material that’s there, on pig config variables and two of the fancy joins, is not too messy. I’d like to at least tell readers about the replicated join, and probably even move it into the earlier chapters. The most we should do here would be to also describe an inline Python UDF and a Java UDF, but there’s no material for that (though I do have code examples of UDFs)

    • Simple Java UDF

    • Simple Python UDF

    • Range Join

    • Replicated (HashMap) Join

    • (Skew Join)

    • Pig tuning and configuration

  11. and 54. Hadoop internals and tuning. As you can see just from the number of files involved this is particularly disorganized.

    • What happens when a job is launched

    • A shallow dive into the HDFS

    • What happens in a Reducer, in detail

    • What happens in a Mapper, in detail

    • What a Happy Mapper looks like

    • What a Happy Reducer looks like

    • How to tune your cluster to your job

    • notable configuration settings

    • USE method

  12. HBase data modeling — mostly readable, but somewhat disconnected from the rest of the book

  13. Appendixes

    • how to get a hadoop cluster (pointers only, no instructions

    • overview of datasets

    • Cheatsheets

    • references