Data is worthless. Actually, it’s worse than worthless. It costs you money to gather, store, manage, replicate and analyze. What you really want is insight — a relevant summary of the essential patterns in that data — produced using relationships to analyze data in context.
Statistical summaries are the purest form of this activity, and will be used repeatedly in the book to come, so now that you see how Hadoop is used it’s a good place to focus.
Some statistical measures let you summarize the whole from summaries of the parts: I can count all the votes in the state by summing the votes from each county, and the votes in each county by summing the votes at each polling station. Those types of aggregations — average/standard deviation, correlation, and so forth — are naturally scalable, but just having billions of objects introduces some practical problems you need to avoid. We’ll also use them to introduce Pig, a high-level language for SQL-like queries on large datasets.
Other statistical summaries require assembling context that grows with the size of the whole dataset. The amount of intermediate data required to count distinct objects, extract an accurate histogram, or find the median and other quantiles can become costly and cumbersome. That’s especially unfortunate because so much data at large scale has a long-tail, not normal (Gaussian) distribution — the median is far more robust indicator of the "typical" value than the average. (If Bill Gates walks into a bar, everyone in there is a billionaire on average.)
Standard scores are also called z-values, z-scores, normal scores, and standardized variables; the use of "Z" is because the normal distribution is also known as the "Z distribution". They are most frequently used to compare a sample to a standard normal deviate (standard normal distribution, with μ = 0 and σ = 1), though they can be defined without assumptions of normality.
-
Calculating Summary Statistics on Groups with Aggregate Functions
-
COUNT_STAR(), Count Distinct, count of nulls, MIN(), MAX(), SUM(), AVG() and STDEV()
-
there are a core set of aggregate functions that we use to summarize the
-
Use COUNT_STAR() to count Records in a Group; MIN() and MAX() to find the single largest / smallest values in a group; SUM() to find the total of all values in a group. The built-in AVG() function returns the arithmetic mean. To find the standard deviation, use the (double check the name) function from Datafu.
-
describe difference between count and count_star. Note that the number of null values is (count_star - count). Recommend to always use COUNT_STAR unless you are explicitly conveying that you want to exclude nulls. Make sure we follow that advice.
-
demonstrate this for summarizing players' weight and height by year. Show a stock-market style candlestick graph of weight and of height (min, avg-STDEV, avg, avg+STDEV, max), with graph of "volume" (count, count distinct and count_star) below it. Players are getting bigger and stronger; more of them as league and roster size grows; more data (fewer nulls) after early days.
-
the median is hard and so we will wait until stats chapter.
-
other summary stats (kurtosis, other higher-moments), no built-in function
-
nested FOREACH (in the previous chapter we found obp, slg, ops from counting stats; now do it but for career.
-
Aggregating Nullable Columns (NULL values don’t get counted in an average. To have them be counted, ternary NULL values into a zero)
-
Criteria for being in the structural patterns part: in remainder of book,(a) there’s only one way to do it; (b) we don’t talk about how it’s done.
Later or here or at all demonstrate combiners in m-r?
TODO: content to come
Join pl-yr on pk-te-yr: pl-pk-te-yr Group ppty on pl: pl-g-pks-tes-yrs Agg pgty on pk and yr: pl-g-tes-stadia Flatten pgty on pk and yr: pl-te-stadia Teammates: pl-yr to tm-yr-g-pls; cross to tm-yr-g-plaplbs; project to plas-plbs-gs flatten to pla-plbs group to pla-g-plbs distinct to pla-d-plbs (or pl[teammates]) flatten to pla-plb (or teammates)
Weather stations: wstn-info ws-dt-hr[obs] Wp: art[text] art[info] art-dt-h[views] Server Logs: ip-url-time[reqs] UFOs: sightings[plc-dt-tm] airports: ap-tm[alid,flid,dest,flights]
— Group on year; find COUNT(), count distinct, MIN(), MAX(), SUM(), AVG(), STDEV(), byte size
SELECT MIN(HR) AS hr_min, MAX(HR) AS hr_max, AVG(HR) AS hr_avg, STDDEV_POP(HR) AS hr_stddev, SUM(HR) AS hr_sum, COUNT() AS n_recs, COUNT() - COUNT(HR) AS hr_n_nulls, COUNT(DISTINCT HR) AS hr_n_distinct — doesn’t count NULL FROM bat_season bat ;
SELECT MIN(nameFirst) AS nameFirst_min, MAX(nameFirst) AS nameFirst_max, — MIN(CHAR_LENGTH(nameFirst)) AS nameFirst_strlen_min, MAX(CHAR_LENGTH(nameFirst)) AS nameFirst_strlen_max, MIN(OCTET_LENGTH(nameFirst)) AS nameFirst_bytesize_max, MAX(OCTET_LENGTH(nameFirst)) AS nameFirst_bytesize_max, AVG(CHAR_LENGTH(nameFirst)) AS nameFirst_strlen_avg, STDDEV_POP(CHAR_LENGTH(nameFirst)) AS nameFirst_strlen_stddev, LEFT(GROUP_CONCAT(nameFirst),25) AS nameFirst_examples, SUM(CHAR_LENGTH(nameFirst)) AS nameFirst_strlen_sum, — COUNT() AS n_recs, COUNT() - COUNT(nameFirst) AS nameFirst_n_nulls, COUNT(DISTINCT nameFirst) AS nameFirst_n_distinct FROM bat_career bat ;
SELECT player_id, MIN(year_id) AS yearBeg, MAX(year_id) AS yearEnd, COUNT() AS n_years, MIN(HR) AS hr_min, MAX(HR) AS hr_max, AVG(HR) AS hr_avg, STDDEV_POP(HR) AS hr_stddev, SUM(HR) AS hr_sum, COUNT() AS n_recs, COUNT(*) - COUNT(HR) AS hr_n_nulls, COUNT(DISTINCT HR) AS hr_n_distinct — doesn’t count NULL FROM bat_season bat GROUP BY player_id ORDER BY hr_max DESC ;
Our next pattern is to transpose fields from each row into records having a column with the field name and a column with the field value, sometimes called attribute-value form.
TODO: content to come
In the structural operations chapter, we brought up the subject of calculating quantiles (an equal-width histogram), but postponed the discussion, judging it to be fiendishly hard. Calculating even an exact median — the simplest case — in a single map-reduce flow is not just hard, it’s provably impossible (REF cormode paper).
The issue is that you need to get all candidates for the edge of a bin onto the same reducer, and know the number of elements that precede the candidates on your reducer. From the mapper, however, it’s impossible to know what keys to assign without knowing the global distribution — the very thing we want to calculate! /end move to statistics)
SELECT COUNT(*), CEIL(COUNT(*)/2) AS midrow FROM bat_career ; SELECT G, cols.* FROM bat_career bat, (SELECT COUNT(*) AS n_entries, CEIL(COUNT(*)/2) AS midrow FROM bat_career) cols ORDER BY HR LIMIT 1 OFFSET 8954 ;
Well, we’ve met another operation with this problem, namely the sort (ORDER BY) operation. It does a first pass to sample the global distribution of keys, then a full map-reduce to place ordered values on the same reducer. Its numerate younger brother, RANK, will do what we need. The quartiles — the boundaries of the four bins bins each holding 25% of the values — …
(Show using RANK and then filter; use the "pre-inject and assert global values" trick for the bin size. Handle the detail of needing to average two values when boundary splits an index, eg median of a table with even number of rows)
-
Random sampling using the traditional pseudo-random number generators (which can be dangerous; we’ll tell you how to do it right) (use input filename as seed)
-
Consistent sampling returns a fraction of records by key: if a record with the key "chimpanzee" is selected into the sample, all records with that key are selected into the sample.
-
(with/without replacement; weighted)
-
Reservoir sampling selects a given number of records. A uniform reservoir sample with count 100, say, would return 100 records, each with the same chance of being selected, regardless of the size of the dataset.
-
Subuniverse sampling selects a set of records and all associated records with it — useful when you want to be able to joins on the sampled data, or to select a dense subgraph of a network. (TECH: is "dense subgraph" right?)
-
Stratified sampling: sampling from groups/bins/strata/whatever - http://en.wikipedia.org/wiki/Stratified_sampling
-
Sampling into multiple groups eg for bootstrapping
-
Note that pig sample is mathematically lame (see Datafu for why)
-
Note that pig sample is nice about eliminating records while loading (find out if Datafu does too)
-
Warning I may have written lies about reservoir sampling make sure to review
-
Spatial Sampling
-
Also: generating distributions (use the random.org data set and generate a column for each dist using it)
-
Expand the random.org by taking each r.o number as seed
-
http://opencoursesfree.org/archived_courses/cs.berkeley.edu/~mhoemmen/cs194/Tutorials/prng.pdf
-
"numbers with statistical properties of randomness. Note that I didn’t write “random numbers,” but rather, “numbers with statistical properties of randomness.”"
-
Make sure you have enough bits
-
Even 52 cards has 52! =~ 255 bits of permutation… can’t possibly get every permutation for a table of even modest size
-
Make sure you look out for ties and shuffle them as well
-
Do you have to be think-y about the partitioner?
-
Download about (8 years *365 days * 1 mebibyte) of randoms from random.org. This is however only 90 million 256-bit (32-byte) numbers, or 350 million 64-bit (8-byte) numbers.
-
Don’t just (rand mod 25) for a 1-in-25 random sample — you’ll be biased because it’s not an exact number of bits. Instead reject if > 25 and try again.
-
Watch out for non-reentrant rand() — mutex or something (do we need to worry about this in hadoop?)
-
Sampling-with-replacement is the most popular method for sampling from the initial data set to produce a collection of samples for model fitting. This method is equivalent to sampling from a multinomial distribution where the probability of selecting any individual input data point is uniform over the entire data set. Unfortunately, it is not possible to sample from a multinomial distribution across a cluster without using some kind of communication between the nodes (i.e., sampling from a multinomial is not embarrassingly parallel). But do not despair: we can approximate a multinomial distribution by sampling from an identical Poisson distribution on each input data point independently, lending itself to an embarrassingly parallel implementation.
Here’s a clip from the PokerStars website (they did their homework):
-
A deck of 52 cards can be shuffled in 52! ways. 52! is about 2^225 (to be precise, 80,658,175,170,943,878,571,660,636,856,404,000,000,000,000,000 ways). We use 249 random bits from both entropy sources (user input and thermal noise) to achieve an even and unpredictable statistical distribution.
-
Furthermore, we apply conservative rules to enforce the required degree of randomness; for instance, if user input does not generate required amount of entropy, we do not start the next hand until we obtain the required amount of entropy from Intel RNG.
-
We use the SHA-1 cryptographic hash algorithm to mix the entropy gathered from both sources to provide an extra level of security
-
We also maintain a SHA-1-based pseudo-random generator to provide even more security and protection from user data attacks
-
To convert random bit stream to random numbers within a required range without bias, we use a simple and reliable algorithm. For example, if we need a random number in the range 0-25: o we take 5 random bits and convert them to a random number 0-31 o if this number is greater than 25 we just discard all 5 bits and repeat the process
-
This method is not affected by biases related to modulus operation for generation of random numbers that are not 2n, n = 1,2,..
-
To perform an actual shuffle, we use another simple and reliable algorithm: o first we draw a random card from the original deck (1 of 52) and place it in a new deck - now original deck contains 51 cards and the new deck contains 1 card o then we draw another random card from the original deck (1 of 51) and place it on top of the new deck - now original deck contains 50 cards and the new deck contains 2 cards o we repeat the process until all cards have moved from the original deck to the new deck
-
This algorithm does not suffer from "Bad Distribution Of Shuffles" described in [2]
[2] "How We Learned to Cheat at Online Poker: A Study in Software Security" - http://itmanagement.earthweb.com/entdev/article.php/616221 [3] "The Intel Random Number Generator" - http://www.cryptography.com/resources/whitepapers/IntelRNG.pdf"
-- Consistent sample of events SELECT ev.event_id, LEFT(MD5(CONCAT(ev.game_id, ev.event_id)), 4) AS evid_hash, ev.* FROM events ev WHERE LEFT(MD5(CONCAT(ev.game_id, ev.event_id)), 2) = '00';
-- Consistent sample of games -- all events from the game are retained -- FLO200310030 has gid_hash 0000... but evid_hash 0097 and so passes both SELECT ev.event_id, LEFT(MD5(ev.game_id),4) AS gid_hash, ev.* FROM events ev WHERE LEFT(MD5(ev.game_id),2) = '00';
Out of 1962193 events in the 2010, 7665 expected (1/256th of the total); got 8159 by game, 7695 by event
SELECT n_events, n_events/256, n_by_game, n_by_event FROM (SELECT COUNT(*) AS n_events FROM events) ev, (SELECT COUNT(*) AS n_by_event FROM events WHERE LEFT(MD5(CONCAT(game_id,event_id)),2) = '00') ev_e, (SELECT COUNT(*) AS n_by_game FROM events WHERE LEFT(MD5(game_id),2) = '00') ev_g ;
Don’t generate a random number as a sampling or sort key in a map job. The problem is that map tasks can be restarted - because of speculative execution, a failed machine, etc. — and with random records, each of those runs will dispatch differently. It also makes life hard in general when your jobs aren’t predictable run-to-run. You want to make friends with a couple records early in the so urge, and keep track of its passage though the full data flow. Similarly to the best practice of using intrinsic vs synthetic keys, it’s always better to use intrinsic metadata — truth should flow from the edge inward.
-
Random sample:
-
fixed size of final sample
-
fixed probability (binomial) for each element
-
spatial sampling
-
with/without replacement
-
weighted
-
by interestingness
-
stratified: partition important features into bins, and sample tastefully to achieve a smooth selection across bins. Think of density of phase space
-
consistent sample: the same sampling parameters on the same population will always return the same sample.
-
-
Algorithms:
-
iterative
-
batch
-
scan
-
reservoir
-
-
graph:
-
sample to preserve connectivity
-
sample to preserve local structure
-
sample to preserve global representation
-
-
random variates
-
Ziggurat Algorithm to use a simple lookup table to accelerate generation of complex distributions
-
We’re not going to worry about extracting samples larger than fit on one reducer.
The simplest kind of sample is a uniform sample selecting a fraction p
of the full dataset.
The naive way take is to generate a random number and select each line if it is less than the probability p
. Don’t do this.
You want your job to be deterministic. In the large, so that it is predictable and debuggable. in the small, a mapper may be re-tried if the attempt fails, or while doing speculative execution.
What we’ll do instead is use a standard digest function (for example, the MD5 hash or murmur hash). A digest function turns any key into a fixed-size number, with the important property that any small change in the input string results in an arbitrarily large change in the output number. It’s deterministic (the same input always gives the same output) but effectively washes out all information from the input string.
Then, rather than compare a random number against a fraction, we’ll turn the digest into an integer (by treating the lowest 64 bits as an integer) and compare it to that fraction of the largest 64-bit number.
<remark>confirm that it’s LSB not MSB we want</remark>
A [http://github.com/mrflip/wukong/blob/master/examples/sample_records.rb Ruby example] is available in the wukong examples:
# # Probabilistically emit some fraction of record/lines # # Set the sampling fraction at the command line using the # --sampling_fraction= # option: for example, to take a random 1/1000th of the lines in huge_files, # ./examples/sample_records.rb --sampling_fraction=0.001 --go huge_files sampled_files # class Mapper < Wukong::Streamer::LineStreamer include Wukong::Streamer::Filter
def intialize(*) super get_sampling_threshold end
# randomly decide to emit +sampling_fraction+ fraction of lines def emit? line digest_i < sampling_threshold end
protected
# Uses the sampling_fraction, a real value between 0 and 1 giving the fraction of lines to # emit. at sampling_fraction=1 all records are emitted, at 0 none are. # # @return [Integer] between 0 and MAX_DIGEST; values below the sampling_threshold should be emitted. def get_sampling_threshold if not options[:sampling_fraction] then raise ArgumentError, "Please supply a --sampling_fraction -- a real value between 0 and 1" ; end @sampling_threshold = (Float(options[:sampling_fraction]) * MAX_DIGEST).to_i end
# @return [Integer] the last 64 bits of the record's md5 hash def digest_i(record) Digest::MD5.digest(record.to_s).unpack('Q*').last end # One more than the largest possible digest int that digest_i will return. MAX_DIGEST = 2 ** 64 end
# Execute the script with nil reducer Script.new( Mapper, nil ).run
-
See this rapleaf blog post for why randomness is considered harmful.
Another, often faster, way of doing random sampling is to
generate a geometrically-distributed (rather than uniformly-distributed) sampling series
For each value R
, Your mapper skips R
lines and
see section on statistics for how to get a geometrically-distributed number.
Want to generate a sample of fixed size N_s — say, 1000 arbitrary records — no matter how large or small the dataset. (Clearly if it is smaller than N_s, you will emit the full dataset).
Suppose you assigned every record an arbitrary sample key, and sorted on that key. Choosing the first N_s records would be a fair way to get our sample. In fact, this is how most card games work: shuffle the records (cards) into an arbitrary order, and draw a fixed-size batch of cards from the collection.
But! of course, a total sort is very expensive. As you may guess, it’s unnecessary.
Each mapper creates a "reservoir", of size N_s, for the rows it will select. Add each record to the reservoir, and if there are more than N_s occupants, reject the record with highest sample index. (in practice, you won’t even add the record if it would be that highest record). A Fibonacci heap (implementing a priority queue) makes this very efficient
Ruby’s stdlib has a SortedSet
class — a Set that guarantees that it’s element are yielded in sorted order (according to the return values of their #<⇒
methods) when iterating over them.
Each mapper outputs the sampling index of each preserved row as the key, and the rest of the row as the value;
It’s essential that you keep the sampling index given by the first pass.
To make a set of uniformly distributed uncorrelated numbers,
For an n1
by n2
matrix of numbers,
matrix = [] n_elems = n1 * n2 (0..n1).each do |ii| matrix[ii] = row = [] (0..n2).each do |jj| row[jj] = digest( ii*n2 + jj ) / n_elems end end
simplify and summarize patterns in data: even simple counts and frequencies require some craft at large scale; measures that require any global context, like the median, become fiendish.
Describe long-tail and normal distribution
Build intuition about long-tail
Log counts and combinators (see blekko posts)
Libraries:
-
-
Distribution — probability distributions; uses fast GSL or Statistics2 if available
-
-
ruby-gsl-ng — interface to the GSL, JRuby-compatible. Incomplete tho.
Line numbering is astonishingly hard. Each reducer needs to know how many records came before it — but that information is non-local. It’s a total sort with a degree of difficulty.
Choose how you’re going to partition and order the data.
Pig comes with a sampling partitioner to do a total sort with no pre-knowledge of the key distribution. If you know something about the data, you can partition yourself, saving a pass over the data. Temperature has a non-uniform distribution (there are more warm and chilly days than boiling or frigid days) Let’s be thoughtful but not overthink; we’ll take 25 C as the midpoint
# TODO: inverse distribution return low_bound if temp < -30 return hi_bound if temp >= 80 (temp - 25.0)
sidebar: If the partition is the filename: in Pig, look at the PathPartitioner
; in Wukong and Hadoop streaming, use the ENV['map_input_file']
environment variable. To partition randomly see Consistent Random Sampling.
For each record, emit a
[partition_key] B [key] [record]
Each mapper must also emit how many keys it saw for each partition key, and send that list to every partition:
0 A [p0_count, p1_count, ... pN_count] 1 A [p0_count, p1_count, ... pN_count] ... N A [p0_count, p1_count, ... pN_count]
Now each reducer can figure out which partition key in order it is, sum the counts for every partition,
Compare : Count bins (histogram) For 100 M rows — What about when there are 10,000 values? 1m values? 1B possible values?
observations:
-
it’s hadoop, we don’t have to be total wusses
-
the reducer already does a sort, so worrying about order N beyond that is silly *
set of numbers with exact ranks below and above — might not be members though cap on output from any mapper — target of data sent to each reducer if binning is easy then data sent to reducer will be small, if not it will be large can also adjust the partition vs local sort
-
single precision: 24 bits, exp - 126 to + 127 (8 bits)
-
double precision: 53 bits, exp -1022 to +1023 (11 bits)
-
avoid subtracting nearly equal numbers [1]
-
avoid adding small numbers to very large numbers
-
stop and think any time you are mixing addition and exponentiation
-
stop and think any time you make a number huge, then regular (
log(fact(x))
) -
stop and think any time you make a number tiny, then regular (
1 - exp(-exp(x))
)
Float times are sneaky examples of this
use special functions for the following:
-
log(1 + x)
when x might be small log_one_plus(x) -
log(1 + exp(x))
log_one_plus_exp(x) -
log( -log(1 - x) )
complementary_log_log(x) -
1 - exp(-exp(x))
complementary_log_log_inverse(x) -
log(n!)
log_fact(x) -
exp(x) - 1
exp_x_minus_one(x) -
log( x / (1-x) )
the "logit" function logit(x) -
exp(x)/(1 + exp(x))
inverse logit inverse_logit(x) -
log(logit(x))
log logit log_logit(x) -
Ratios of factorials —
log( 200! / (190! * 10!))
=logfact(200) - logfact(190) - logfact(10)
Always add a comment explaining why you used these crazy functions, or some helpful soul will "refactor" your code to be simpler (and, unwittingly, wrong).
-
def approx_eq(xx,yy) (xx - yy).abs < TOL ; end
-
If you want to find weirdness, 0.1 cannot be exactly represented in a float.
-
to uniquely represent in decimal,
-
%12.9e3f
for a float, 9 decimal digits of mantissa, exponent, and sign -
for a double, 16 decimal digits of mantissa (plus sign and exponent)
-
note
%a
— a direct hex representation of the floating-point number."%a" % 1e-100
is"0x1.bff2ee48e053p-333"
;"%a" % 0.1
is"0x1.999999999999ap-4"
-
TODO: ensure some calculations would cause underflow, overflow, etc with float/int; and maybe even double/long.
reference: Avoiding Overflow, Underflow, and Loss of Precision reference: Don’t Store That in a Float
Generate exponential variate:
expdist = Math.log( rand ) / log_one_plus( -geomparam ) geomdist = floor( expdist )
Error function | `Math.erf` | http://www.ruby-doc.org/core-1.9.3/Math.html#method-c-erf Complementary Error func | `Math.erfc` | http://www.ruby-doc.org/core-1.9.3/Math.html#method-c-erfc Inverse Error function | Phi (standard normal CDF) | `0.5 * ( 1 + erf( x / sqrtof2 ) )` Phi inverse | `sqrt(2.0) * ierf(2.0*x - 1.0)` | `CC0, CC1, CC2 = [2.515517, 0.802853, 0.010328] ; DD0, DD1, DD2 = [1.432788, 0.189269, 0.001308]` | `def approx_rational(t) numerator = (CC2*t + CC1)*t + CC0 ; denominator = ((DD2*t + DD1)*t + DD0)*t + 1.0 ; t - numerator / denominator ; end` | `def inv_phi(p) if p < 0.5 then -approx_rational( Math.sqrt(-2.0 * Math.log(p)) ) else approx_rational( Math.sqrt(-2.0 * Math.log(1.0 - p)) ) ; end` Gamma | `Math.gamma` | http://www.ruby-doc.org/core-1.9.3/Math.html#method-c-gamma Log Gamma | `Math.lgamma` | http://www.ruby-doc.org/core-1.9.3/Math.html#method-c-lgamma `log(1 + x)` for small x | `(fabs(x) > 1e-4) ? log(1.0 + x) : (-0.5*x + 1.0)*x` else `exp(x) - 1` for small x | `log(n!)` | `Math.lgamma(n+1)` | fraction+exponent of `exp()` | `Math.frexp` | http://www.ruby-doc.org/core-1.9.3/Math.html#method-c-erfc fr, ex = Math.frexp(val) # fr a float, ex an int fr * 2**ex == val # => true ldexp(fr,ex) == val # => true Uniform distributed (0.2 uSec) | `rand` Gamma distributed | Normal distributed (1.9 uSec) | `mean + stddev * Math.sqrt(-2.0 * Math.log(uniform)) * Math.cos(2.0 * Math::PI * uniform)` | `mean + stddev * Math.sqrt(-2.0 * Math.log(uniform)) * Math.sin(2.0 * Math::PI * uniform)` Beta distributed | `uu = gammadist(a, 1) ; vv = gammadist(b, 1) ; u / (u + v)` Cauchy distributed | `median + scale * Math.tan(Math::PI * (uniform - 0.5))` Chi-square distributed | `gammadist(0.5 * degrees_of_freedom, 2.0)` Exponential distributed | `1.0 / gammadist(shape, 1.0 / scale)` Inverse gamma distributed | `1.0 / gammadist(shape, 1.0 / scale)` Laplace distributed | `mean + Math.log(2) + ((u < 0.5 ? 1 : -1) * scale * Math.log(u < 0.5 ? u : 1 - u))` Log normal distributed | `Math.exp(normal(mu, sigma))` Poisson distributed | Student-t distributed | `normal / ((chi_square(degrees_of_freedom) / degrees_of_freedom) ** 0.5)` Weibull distributed | `scale * ((-Math.log(uniform)) ** (1.0 / shape))` Geometric distributed | `expdist( -1.0 / log_one_plus(-geomparam) ).floor` Binomial probability # @param pp [Float] # @param qq [Float] # @param mm [Integer] # @param nn [Integer] def binomial_prob(pp, qq, m, n) log_bin = gamma(mm + nn + 1.0) log_bin -= lgamma(nn + 1.0) + lgamma(mm + 1.0) log_bin += (mm * log(pp)) + (nn*log(qq)) return exp(log_bin) end
references:
-
John D Cook’s Stand Alone Code
-
ealdent’s simple-random, CPOL (MIT-compatible) license
The naive method is var = ( sum(xx2) - (sum(x)2/count) ) / (count-1)
(if it’s the entire population, divide by count
not count-1
. The difference is negligible for large count).
But wait!! We’re subtracting two possibly-close numbers, breaking the cardinal rule of numerical computing.
Welford’s method calculates these moments in a streaming fashion, in one pass. It avoids the danger of loss of numerical precision present in the naive approach.
field :count, Integer, doc: "Number of records seen so far" field :mm, Float, doc: "A running estimate of the mean" field :ss, Float, doc: "A running proportion of the variance; the variance is `ss / (count-1)`" class Welford def initialize first_row(0.0) end def first_row(first_val) @count = 0 @mm = first_val @ss = 0.0 end def process(val) @count += 1 diff = val - @mm @mm, @ss = [ @mm + (diff / @count), @ss + (diff * diff) ] end def stop emit( results ) end def results [ count, mean, variance, stddev, mm, ss ] end def mean return 0.0 if count < 1 mm end def variance return 0.0 if count < 2 ss / (count - 1) end def stddev Math.sqrt(variance) end end class WelfordReducer mm_all = sum{|count, mm| count * mm } / sum{|count| count } ss_all = sum{ FIXME } end
Weighted:
class WeightedWelford def process(val, weight) new_total_weight = total_weight + weight diff = val - @mm rr = diff * weight / new_total_weight @mm += rr @ss += @mm + (total_weight * diff * rr) total_weight = new_total_weight super end def variance ( ss * count.to_f / total_weight ) / (count-1) end end class WeightedWelfordReducer mm_all = sum{ FIXME: what goes here } ss_all = sum{ FIXME: what goes here } end
Naively:
class Naive < Welford field :sum, Float, doc: "The simple sum of all the numbers" field :sum_sq, Float, doc: "The simple sum of squares for all the numbers" def first_row(*) @sum = 0 @sum_sq = 0 super end def process(val) @sum += val @sum_sq += val * val super end def results super + [ naive_mean, naive_variance, naive_stddev, sum, sum_sq ] end def naive_average ; ( sum / count ) ; end def naive_variance ; ( sum_sq - ((sum * sum)/count) ) / (count-1) ; end def naive_stddev ; Math.sqrt(naive_variance) ; end end
Directly:
def DirectMoments < Naive field :known_count, doc: "The already-computed final count of all values" field :known_mean, doc: "The already-computed final mean of all values" field :sum_dev_sq, doc: "A running sum of the squared difference between each value and the mean" field :sdsq_adj, doc: "A compensated-summation correction of the running sum" def first_row(*) @sum_dev_sq = 0 @sdsq_adj = 0 super end def process(val) @sum_dev_sq += (val - known_mean)**2 @sdsq_adj += (val - known_mean) super end def results super + [ direct_mean, direct_variance, direct_stddev, compsum_variance, @sum_dev_sq, @sdsq_adj ] end def direct_mean ; known_mean ; end def direct_stddev ; Math.sqrt(direct_variance) ; end def direct_variance ; sum_dev_sq / (count - 1) ; end def compsum_variance ( sum_dev_sq - (sdsq_adj**2 / count) ) / (count-1) end end
To find higher moments,
-
each partition calculates the statistical moments
(g0, mu, var, alpha_3, alpha_4)
-
for a time series,
g0
is the duration; for a series, it’s the count.
-
-
now get
g_mo_part(mo, part) := mm(mo,part) * g0(part)
-
then
raw_moment(mo) := g_mo_all / g0_all
-
from raw moments get central moments:
theta_mo(mo) := Expectation[(val - mean)**mo]
-
finally
-
mean_all := m_1_all
-
var_all := theta_2_all
-
alpha_3_all := theta_3_all / (var_all ** 3)
-
`alpha_4_all := theta_4_all / (var_all ** 4)
-
references:
-
John Cook’s Accurately computing running variance, who in turn cites
-
"Chan, Tony F.; Golub, Gene H.; LeVeque, Randall J. (1983). Algorithms for Computing the Sample Variance: Analysis and Recommendations. The American Statistician 37, 242-247."
-
"Ling, Robert F. (1974). Comparison of Several Algorithms for Computing Sample Means and Variances. Journal of the American Statistical Association, Vol. 69, No. 348, 859-866."
-
class CompensatedSummer field :tot, Float, doc: "Total of all values seen so far" field :adj, Float, doc: "Accumulated adjustment to total" def first_record(val) self.tot = val self.adj = 0 end def process(val) old_tot = @tot adj_val = val - @adj @tot = old_tot + adj_val @adj = (@tot - old_tot) - adj_val end end
Consider this diagram, adapted from What Every Computer Scientist Should Know About Floating-Point Arithmetic
a ____total____ a + _valH_ _valL_ a = ___tmptot____ a a ___tmptot____ a - ____total____ a = _valH_ a a _valH_ a - _valH_ _valL_ a = _valL_ (-corr)
do
`( 1 / (count-1)) * sum[ ((val_x - mean_x) / stddev_x) * ((val_y - mean_y) / stddev_y) ]
To combine covariance of two sets,
CovAB = Cov_A + Cov_B + ( (mean_x_a - mean_x_b) * (mean_y_a - mean_y_b) * (count_a * count_b / (count_a + count_b)) )
REFERENCE: How to calculate correlation accurately
sx = 0, sy = 0, stt = 0.0, sts = 0.0 sx = x_vals.sum sy = y_vals.sum x_vals.zip(y_vals).each do |xval, yval| t = xval - (sx / count) stt += t * t sts += t * yval end slope = sts / stt intercept = (sy - sx*slope) / count
To make a naive algorithm fail,
num_samples = 1e6 def generate_samples xvals = num_samples.times.map{|i| x_offset + i * x_spread } yvals = xvals.map{|xval| (actual_slope * xval) + actual_intercept + (actual_variance * normaldist()) } end large constant offset causes loss of precision: actual_slope = 3 actual_intercept = 1e10 actual_variance = 100 x_offset = 1e10 x_spread = 1 generate_samples(...) very large slope causes inaccurate intercept: actual_slope = 1e6 actual_intercept = 50 actual_variance = 1 x_offset = 0 x_spread = 1e6 generate_samples(...)
-
John Cook, Comparing two ways to fit a line to data
break numbers into bins where we can conveniently do exact Bignum math.
running totals —
int_part, frac_part frac_part = frac_part * smallest possible
keep sums using
-
Random Sampling from Databases, Frank Olken, 1993
-
RBTree for ruby
-
Stack Overflow: How to pick random (small) data samples using Map/Reduce? answer by Bkkbrad
From @dataspora’s [http://github.com/dataspora/big-data-tools/blob/master/samplen.py Big Data Tools]:
"The reservoir sampling algorithm outputs a sample of N lines from a file of undetermined size. It does so in a single pass, using memory proportional to N. These two features — (i) a constant memory footprint and (ii) a capacity to operate on files of indeterminate size — make it ideal for working with very large data sets common to event processing. "
Working python code (see [http://github.com/dataspora/big-data-tools/blob/master/samplen.py Big Data Tools] for current version):
import sys import random
if len(sys.argv) == 3: input = open(sys.argv[2],'r') elif len(sys.argv) == 2: input = sys.stdin; else: sys.exit("Usage: python samplen.py <lines> <?file>")
N = int(sys.argv[1]); sample = [];
for i,line in enumerate(input): if i < N: sample.append(line) elif i >= N and random.random() < N/float(i+1): replace = random.randint(0,len(sample)-1) sample[replace] = line
for line in sample: sys.stdout.write(line)
-
Holistic vs algebraic aggregations
-
Underflow and the "Law of Huge Numbers"
-
Approximate holistic aggs: Median vs remedian; percentile; count distinct (hyperloglog)
-
Count-min sketch for most frequent elements
-
Approx histogram
-
Counting
-
total burgers sold - total commits, repos,
-
counting a running and or smoothed average
-
standard deviation
-
sampling
-
uniform
-
top k
-
reservior
-
?rolling topk/reservior sampling?
-
algebraic vs holistic aggregate
-
use countmin sketch to turn a holistic aggregate into an algebraic aggregate
-
quantile or histogram
-
numeric stability
-
x
and y
agree to m
bits, up to m
bits can be lost in computing x-y
."