Make our initialization sampling more robust #180

kaituo · 2020-06-26T16:43:29Z

Is your feature request related to a problem? Please describe.
Currently, we sample 24 points (with 60 points apart between two neighboring sample) in the history for cold start. If there is any data hole, we stop moving forward. For example, if the 3rd sample is missing, we can only end up with 61 data points (with interpolation and other optimization). This causes long initialization.

Describe the solution you'd like
We can still move forward if there is a data hole and only interpolate points between present samples. Continuous shingles will be used for cold start.

wnbts · 2020-06-29T16:37:24Z

sampling-based data collection suffers from missing data. To eliminate this issue, the robust solution is collecting all points by scanning.

kaituo · 2020-06-29T17:54:26Z

Why do we have to go to the other extreme to collect all samples? Even with all of the samples, we may suffer missing data. Also, what do you think of my solution?

wnbts · 2020-06-29T17:59:29Z

The goal should be making data collection robust enough that it won't become a recurring problem in the future. The proposed solution is towards that direction but is still vulnerable to missing data points and will be a recurring problem again.

wnbts · 2020-06-29T18:06:56Z

regarding the extreme comment, this is not an extreme use for a. the workload is limited with specified number of samples, which is far less than that used in visualization/exploration, b. the workload occurs only once per model, c. on sparse streams, the process results into no aggregation work, and d. it is also used elsewhere in es.

kaituo · 2020-06-29T19:56:27Z

regarding the extreme comment, this is not an extreme use for a. the workload is limited with specified number of samples, which is far less than that used in visualization/exploration, b. the workload occurs only once per model, c. on sparse streams, the process results into no aggregation work, and d. it is also used elsewhere in es.

a. how many samples are you gonna collect? 1440 samples?
b. agreed.
c. you still need to deal with sparse data when sampling all data points, like more than 25% of samples are missing.
d. agreed.

I meant we don't need to self-negate so much. Current solution is fast, cheap, and relatively accurate. Chris and you have spent so much time to get current thing to work. My proposal adds minor improvement on top of yours.

wnbts · 2020-06-30T01:27:04Z

thanks, i believe an exhaustive search with a reasonable limited range is the right solution to this issue or any potential issues of this nature in the future. The initial assumption that data stream is dense and continuous that data points can be easily/correctly interpolated has proven to be false and the data from real world is increasingly showing the opposite. Continuing with the bias does not just slow down data collection, worsening user experience, but also brings new issues to the quality of the data and the model.

The model needs max(24 hours/interval, 256) samples. To allow for some missing data points, the samples in the request can be max(24 hours/interval, 512).

It is true that for very sparse data, even an exhaustive search does not guarantee to produce enough data points but it is a best-effort guaranteed by the system.

kaituo · 2020-06-30T01:37:43Z

How would you deal with missing data with your exhaustive search?

wnbts · 2020-06-30T16:21:22Z

Since missing data points from exhaustive search are known to be missing, they are filtered out for training data.

ylwu-amzn · 2020-07-07T18:42:44Z

Since missing data points from exhaustive search are known to be missing, they are filtered out for training data.

Just to make sure I understand it correctly. So we plan to query last max(24 hours/interval, 512) data points, then feed all of these data points into model directly, not considering missing data points? If the first data point is not missing, then the following 200 data points are missing, should we feed all data points into model? Should we train model with non-consecutive data points?

wnbts · 2020-07-07T20:44:18Z

Since missing data points from exhaustive search are known to be missing, they are filtered out for training data.

Just to make sure I understand it correctly. So we plan to query last max(24 hours/interval, 512) data points, then feed all of these data points into model directly, not considering missing data points? If the first data point is not missing, then the following 200 data points are missing, should we feed all data points into model? Should we train model with non-consecutive data points?

If only the first point is present and the rest is missing, the first is unusable and discarded. The model is trained on consecutive data points as currently designed for.

ylwu-amzn · 2020-07-07T22:23:13Z

Since missing data points from exhaustive search are known to be missing, they are filtered out for training data.

Just to make sure I understand it correctly. So we plan to query last max(24 hours/interval, 512) data points, then feed all of these data points into model directly, not considering missing data points? If the first data point is not missing, then the following 200 data points are missing, should we feed all data points into model? Should we train model with non-consecutive data points?

If only the first point is present and the rest is missing, the first is unusable and discarded. The model is trained on consecutive data points as currently designed for.

For example, we get last 512 data points, first 100 data points exist, 101 to 399 is missing, 400 to 512 exist, we will only feed [400, 512] into model as these data points are consecutive?

What if 400 to 510 exist, but 511 and 512 missing, should we drop all data points? How about we use some interpolation to fill 511 and 512 like Kaituo said, then feed [400, 512] into model to train?

wnbts · 2020-07-07T22:28:43Z

Since missing data points from exhaustive search are known to be missing, they are filtered out for training data.

Just to make sure I understand it correctly. So we plan to query last max(24 hours/interval, 512) data points, then feed all of these data points into model directly, not considering missing data points? If the first data point is not missing, then the following 200 data points are missing, should we feed all data points into model? Should we train model with non-consecutive data points?

If only the first point is present and the rest is missing, the first is unusable and discarded. The model is trained on consecutive data points as currently designed for.

For example, we get last 512 data points, first 100 data points exist, 101 to 399 is missing, 400 to 512 exist, we will only feed [400, 512] into model as these data points are consecutive?

What if 400 to 510 exist, but 511 and 512 missing, should we drop all data points? How about we use some interpolation to fill 511 and 512 like Kaituo said, then fill [400, 512] into model to train?

In the first example, [0, 100] and [400, 512] will be used.

In the second example, [400, 512] will be used.

As long as a data block is continuous, the block will be used.

ylwu-amzn · 2020-07-07T22:35:09Z

As long as a data block is continuous, the block will be used.

The block is continuous means there are at least two consecutive data points? Try to understand the algorithm for below cases
1.[0] exists, [1, 399] missing, [400, 512] exists, will feed [400, 512] into model
2.[0,1] exists, [2, 399] missing, [400, 512] exists, will feed [0, 1], [400, 512] into model
3.[0,1] exists, [2, 399] missing, [400, 510] exists, [511, 512] missing, will feed [0, 1], [400, 510] into model

wnbts · 2020-07-08T01:14:47Z

As long as a data block is continuous, the block will be used.

The block is continuous means there are at least two consecutive data points? Try to understand the algorithm for below cases
1.[0] exists, [1, 399] missing, [400, 512] exists, will feed [400, 512] into model
2.[0,1] exists, [2, 399] missing, [400, 512] exists, will feed [0, 1], [400, 512] into model
3.[0,1] exists, [2, 399] missing, [400, 510] exists, [511, 512] missing, will feed [0, 1], [400, 510] into model

a continuous block means there are at least single sized continuous data points.

correct.
since shingle is 8, [400, 512] is used.
for the same reason, [400, 510] is used.

ylwu-amzn · 2020-07-08T17:51:07Z

As long as a data block is continuous, the block will be used.

The block is continuous means there are at least two consecutive data points? Try to understand the algorithm for below cases
1.[0] exists, [1, 399] missing, [400, 512] exists, will feed [400, 512] into model
2.[0,1] exists, [2, 399] missing, [400, 512] exists, will feed [0, 1], [400, 512] into model
3.[0,1] exists, [2, 399] missing, [400, 510] exists, [511, 512] missing, will feed [0, 1], [400, 510] into model

a continuous block means there are at least single sized continuous data points.

correct.

since shingle is 8, [400, 512] is used.

for the same reason, [400, 510] is used.

Thanks. Using data block which >= shingle size makes sense. Is it reasonable to lower the limitation, like data block size should be >= (single size % 90) ? And we can interpolate the other 10% missing data points. So we can handle more use case but still get reasonable AD result. We may need to do some testing to verify AD result and confirm which percentage is the bottom line.

One edge case: 7 consecutive data points exist, then one data point missing, then another 7 consecutive data points exist.

[0, 6] exists, 
[7] missing, 
[8, 14] exists, 
[15] missing, 
[16, 22] exists,
...

In this case, the data is not so sparse. If we limit the data block size >= shingle size, we will drop all data blocks.

wnbts · 2020-07-14T16:47:28Z

The change has been checked in. Closing the issue.

kaituo added the enhancement New feature or request label Jun 26, 2020

wnbts mentioned this issue Jul 2, 2020

change to exhausive search for training data #184

Merged

wnbts closed this as completed Jul 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make our initialization sampling more robust #180

Make our initialization sampling more robust #180

kaituo commented Jun 26, 2020

wnbts commented Jun 29, 2020

kaituo commented Jun 29, 2020

wnbts commented Jun 29, 2020

wnbts commented Jun 29, 2020

kaituo commented Jun 29, 2020 •

edited

Loading

wnbts commented Jun 30, 2020

kaituo commented Jun 30, 2020

wnbts commented Jun 30, 2020

ylwu-amzn commented Jul 7, 2020 •

edited

Loading

wnbts commented Jul 7, 2020 •

edited

Loading

ylwu-amzn commented Jul 7, 2020 •

edited

Loading

wnbts commented Jul 7, 2020 •

edited

Loading

ylwu-amzn commented Jul 7, 2020

wnbts commented Jul 8, 2020

ylwu-amzn commented Jul 8, 2020

wnbts commented Jul 14, 2020

Make our initialization sampling more robust #180

Make our initialization sampling more robust #180

Comments

kaituo commented Jun 26, 2020

wnbts commented Jun 29, 2020

kaituo commented Jun 29, 2020

wnbts commented Jun 29, 2020

wnbts commented Jun 29, 2020

kaituo commented Jun 29, 2020 • edited Loading

wnbts commented Jun 30, 2020

kaituo commented Jun 30, 2020

wnbts commented Jun 30, 2020

ylwu-amzn commented Jul 7, 2020 • edited Loading

wnbts commented Jul 7, 2020 • edited Loading

ylwu-amzn commented Jul 7, 2020 • edited Loading

wnbts commented Jul 7, 2020 • edited Loading

ylwu-amzn commented Jul 7, 2020

wnbts commented Jul 8, 2020

ylwu-amzn commented Jul 8, 2020

wnbts commented Jul 14, 2020

kaituo commented Jun 29, 2020 •

edited

Loading

ylwu-amzn commented Jul 7, 2020 •

edited

Loading

wnbts commented Jul 7, 2020 •

edited

Loading

ylwu-amzn commented Jul 7, 2020 •

edited

Loading

wnbts commented Jul 7, 2020 •

edited

Loading