-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wire up Standard Deviation aggregate function. #1573
Conversation
@pauldix is it worth doing a partial mean in the Map even though we still need to send all the data to the reduce to get diffs for std dev? It will distribute a small part of the load before reduce. Thoughts? |
@corylanou nah, not worth it. The time to pipe the data over the network is what will dominate here |
func MapStddev(itr Iterator, e *Emitter, tmax int64) { | ||
var values []float64 | ||
|
||
for k, v := itr.Next(); k != 0; k, v = itr.Next() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should have something here that emits the values after it gets to a certain size. Like say 1000. This may fall over later if you try to emit a values array that is super massive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so I can effectively "batch" emits?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're already putting all the values into a single emit. I would just say that you need to limit the size of that. So you can do multiple emits per map
LGTM. |
Wire up Standard Deviation aggregate function.
Wait, we want population stddev, don't we? Also, that string "undefined" for undefined stddevs is going to break everyone trying to work with this data. If JSON supported floats I would suggest that it be NaN, but it doesn't. Given that, I'd say we're best off making it 0. I know that's mathematically incorrect, but it's going to be the most sensible thing for people actually working with this. |
@pauldix Aren't we technically working with samples? Since the original measurements are theoretically continuous functions, I'm not sure we have a population. |
I don't see why we'd throw out one of the measurements |
http://en.wikipedia.org/wiki/Bessel%27s_correction "While there are n independent samples, there are only n − 1 independent residuals, as they sum to 0." |
@toddboom ah ok. I'm am enlightened ;) |
Fixes #1475