Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document datehistogram with long offsets #93328

Merged
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,8 @@ time zone.
One month is the interval between the start day of the month and time of
day and the same day of the month and time of the following month in the specified
time zone, so that the day of the month and time of day are the same at the start
and end.
and end. Note that the day may differ if an
<<search-aggregations-bucket-datehistogram-offset-months,`offset` is used that is longer than a month>>.

`quarter`, `1q` ::

Expand Down Expand Up @@ -543,6 +544,94 @@ NOTE: The start `offset` of each bucket is calculated after `time_zone`
adjustments have been made.
// end::offset-note[]

[[search-aggregations-bucket-datehistogram-offset-months]]
===== Long offsets over calendar intervals

It is typical to use offsets in units smaller than the `calendar_interval`. For example,
using offsets in hours when the interval is days, or an offset of days when the interval is months.
If the calendar interval is always of a standard length, or the `offset` is less than one unit of the calendar
interval (for example less than `+24h` for `days` or less than `+28d` for months),
then each bucket will have a repeating start. For example `+6h` for `days` will result in all buckets
starting at 6am each day. However, `+30h` will also result in buckets starting at 6am, except when crossing
days that change from standard to summer-savings time or vice-versa.

This situation is much more pronounced for months, where each month has a different length
to at least one of its adjacent months.
To demonstrate this, consider eight documents each with a date field on the 20th day of each of the
eight months from January to August of 2022.

When querying for a date histogram over the calendar interval of months, the response will return one bucket per month, each with a single document.
Each bucket will have a key named after the first day of the month, plus any offset.
For example, the offset of `+19d` will result in buckets with names like `2022-01-20`.

[source,console,id=datehistogram-aggregation-offset-example-19d]
--------------------------------------------------
"buckets": [
{ "key_as_string": "2022-01-20", "key": 1642636800000, "doc_count": 1 },
{ "key_as_string": "2022-02-20", "key": 1645315200000, "doc_count": 1 },
{ "key_as_string": "2022-03-20", "key": 1647734400000, "doc_count": 1 },
{ "key_as_string": "2022-04-20", "key": 1650412800000, "doc_count": 1 },
{ "key_as_string": "2022-05-20", "key": 1653004800000, "doc_count": 1 },
{ "key_as_string": "2022-06-20", "key": 1655683200000, "doc_count": 1 },
{ "key_as_string": "2022-07-20", "key": 1658275200000, "doc_count": 1 },
{ "key_as_string": "2022-08-20", "key": 1660953600000, "doc_count": 1 }
]
--------------------------------------------------
// TESTRESPONSE[skip:no setup made for this example yet]

Increasing the offset to `+20d`, each document will appear in a bucket for the previous month,
with all bucket keys ending with the same day of the month, as normal.
However, further increasing to `+28d`,
what used to be a February bucket has now become `"2022-03-01"`.

[source,console,id=datehistogram-aggregation-offset-example-28d]
--------------------------------------------------
"buckets": [
{ "key_as_string": "2021-12-29", "key": 1640736000000, "doc_count": 1 },
{ "key_as_string": "2022-01-29", "key": 1643414400000, "doc_count": 1 },
{ "key_as_string": "2022-03-01", "key": 1646092800000, "doc_count": 1 },
{ "key_as_string": "2022-03-29", "key": 1648512000000, "doc_count": 1 },
{ "key_as_string": "2022-04-29", "key": 1651190400000, "doc_count": 1 },
{ "key_as_string": "2022-05-29", "key": 1653782400000, "doc_count": 1 },
{ "key_as_string": "2022-06-29", "key": 1656460800000, "doc_count": 1 },
{ "key_as_string": "2022-07-29", "key": 1659052800000, "doc_count": 1 }
]
--------------------------------------------------
// TESTRESPONSE[skip:no setup made for this example yet]

If we continue to increase the offset, the 30-day months will also shift into the next month,
so that 3 of the 8 buckets have different days than the other five.
In fact if we keep going, we will find cases where two documents appear in the same month.
Documents that were originally 30 days apart can be shifted into the same 31-day month bucket.

For example, for `+50d` we see:

[source,console,id=datehistogram-aggregation-offset-example-50d]
--------------------------------------------------
"buckets": [
{ "key_as_string": "2022-01-20", "key": 1642636800000, "doc_count": 1 },
{ "key_as_string": "2022-02-20", "key": 1645315200000, "doc_count": 2 },
{ "key_as_string": "2022-04-20", "key": 1650412800000, "doc_count": 2 },
{ "key_as_string": "2022-06-20", "key": 1655683200000, "doc_count": 2 },
{ "key_as_string": "2022-08-20", "key": 1660953600000, "doc_count": 1 }
]
--------------------------------------------------
// TESTRESPONSE[skip:no setup made for this example yet]

It is therefor always important when using `offset` with `calendar_interval` bucket sizes
to understand the consequences of using offsets larger than the interval size.

More examples:

* If the goal is to, for example, have an annual histogram where each year starts on the 5th February,
you could use `calendar_interval` of `year` and `offset` of `+33d`, and each year will be shifted identically,
because the offset includes only January, which is the same length every year.
However, if the goal is to have the year start on the 5th March instead, this technique will not work because
the offset includes February, which changes length every four years.
* If you want a quarterly histogram starting on a date within the first month of the year, it will work,
but as soon as you push the start date into the second month by having an offset longer than a month, the
quarters will all start on different dates.

[[date-histogram-keyed-response]]
==== Keyed Response

Expand Down