Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal Amendment: UTF-8 migration #30

Merged
merged 10 commits into from
Dec 5, 2023
Merged

Conversation

ywwg
Copy link
Member

@ywwg ywwg commented Nov 15, 2023

This proposal expands upon the existing UTF-8 proposal, providing more detail into how we plan to handle the migration of metrics data, ensuring that it remains queryable during the transition period. This is critical to smooth migration of OTEL data.

@ywwg
Copy link
Member Author

ywwg commented Nov 15, 2023

TODO: need to clarify expansion method, otherwise first draft ready to go

@ywwg ywwg marked this pull request as ready for review November 16, 2023 19:29
@beorn7
Copy link
Member

beorn7 commented Nov 16, 2023

I added @roidelapluie and @bwplotka as reviewers of the original UTF-8 design doc, and @gouthamve as he was involved discussing the approach.

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, this is quite a smart way to have better UX on those, often already ingested, Otel yolo underscore metrics. Great work!

The collisions are tricky case here indeed. Hard to say if there will be case that might surprise users. We don't have specific examples that semantically makes sense. Given we cannot rule out the surprises completely, I would vote for dropping that storage trick. It's quite a complexity, yet still prone to surprise situations. Additionally, if we have block with mixed client versions (prone to collisions) and it's get compacted into 2 days or 2w block (e.g. Thanos), that means whole 2w is prone to collisions right? Or we change compaction code too?

I would say let's make unsupported input for this migration feature explicitly unsupported, not "sometimes" unsupported. I wonder if we can simply rely on the query API parameter (and feature flag) to allow some debuggability/quick opt-out, without storage deduction and be explicit in docs etc

Which brings an interesting question, would this feature be part of an official PromQL spec? (aka require every PromQL exposed backend to have "UTF-8 compatibility broad" lookup, to be compatible) or is it an optional feature?

Having explicit regex makes sense.

Quick suggestions:

  • Put storage trick to alternatives as some future improvement if we know about specific collision case which makes sense.
  • Propose query API param (?)
  • Clarify if this is part of the official PromQL spec or just a migration friendly Prometheus feature


### Proposed Solution

To help alleviate this confusion we first propose to bump the version number in the tsdb meta.json file. On a per-block basis, the query code can check the version number and know if the data was written with an old version of the database code. This helps distinguish the first case.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically we don't need to bump as the existence of the new entry would tell us if new or old code was used, right? 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The version number bump indicates if the block was written by utf-capable prometheus, but it does not indicate if the clients sent utf8 metrics or pre-escaped metrics (or if the block contains a combination of both).

Copy link
Member

@gouthamve gouthamve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking more, I think the implementation could get complicated.

Right now, the TSDB interface has no concept of "ingestion-version", and I would be uncomfortable plumbing it through for just this use-case.

Instead could we do the following:

-promql.utf8_migration.enabled=true
-promql.utf8_migration.until=<date-time> (optional)

The migration would then become:

  1. Set the first flag
  2. Move everything over to UTF-8
  3. Wait a bit
  4. Set the second flag
  5. Once retention is passed, unset the first flag

This might work tbh. The challenge is likely to be that step 2. will take a long time as most folks don't have control over the clients.


We must consider edge cases in which a blocks database has persisted metrics or labels that may have been written by different client versions. There are multiple ways this can (and will) happen:

* A newer client persists names to an older database version. In this case, names would be escaped with the U__ syntax. If the database is upgraded, newer blocks will be written in UTF-8.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by database upgrade? Do you mean a new Prometheus version?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes that's what I meant here, I can fix the language

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@bwplotka
Copy link
Member

Yea, I like that simple logic @gouthamve proposed.

@ywwg
Copy link
Member Author

ywwg commented Nov 20, 2023

-promql.utf8_migration.enabled=true

is the idea that 100% of queries would get replacement-expansion, rather than trying to figure out which blocks are which?

@ywwg
Copy link
Member Author

ywwg commented Nov 20, 2023

or does the date-time determine for what periods the expansion will take place?

@bwplotka
Copy link
Member

bwplotka commented Nov 20, 2023

Both (:

If the -promql.utf8_migration.until=<date-time> is not provided, we query all blocks with extended lookup. If it's provided we perform extended lookup only until given time. (If I understand @gouthamve correctly).

@gouthamve
Copy link
Member

Yup, what Bartek explained.

@ywwg
Copy link
Member Author

ywwg commented Nov 21, 2023

Addressed notes

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Epic, thanks! Final suggestions 🤗 (I hope), otherwise LGTM

proposals/2023-11-13-utf8-migration.md Outdated Show resolved Hide resolved
### Proposed Solution

For queries to return correct data we must differentiate the three cases above, and to do that we first propose to bump the version number in the tsdb meta.json file.
On a per-block basis, the query code can check the version number and know if the data was written with an old version of the Prometheus code.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure how this is useful. What query will do if it sees old block? When UTF-8 is queried? I think we need to still do this "migration broad search" for those, no? As we don't know what escape method clients used.

Is it as an optimization to immediately say no result on those blocks for non escaped UTF-8 lookup?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I suppose the only real benefit is removing the UTF-8 query from the list of possibilities. The other escapings may all be possible. Do you think that maybe we don't need a version number bump at all and can just use the flag/date-based logic?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I am fine with both.

It feels like small change to bump version and might unlock some optimizations, so I am fine keeping this up, just wanted to clarify.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So yeah, I guess the only optimization is that we never have to run the original UTF-8 query on blocks with an older version. Not sure if that is worth the effort of versioning the blocks. But maybe, as @bwplotka said, the effort is actually quite low, and we might see additional uses of the version number later.

proposals/2023-11-13-utf8-migration.md Outdated Show resolved Hide resolved
ywwg and others added 8 commits November 22, 2023 14:53
This proposal expands upon the existing UTF-8 proposal, providing more detail into how we plan to handle the migration of metrics data, ensuring that it remains queryable during the transition period

Signed-off-by: Owen Williams <owen.williams@grafana.com>
Signed-off-by: Owen Williams <owen.williams@grafana.com>
Change approach, use flags instead of trying to auto-detect mixed blocks.

Signed-off-by: Owen Williams <owen.williams@grafana.com>
Signed-off-by: Owen Williams <owen.williams@grafana.com>
Signed-off-by: Owen Williams <owen.williams@grafana.com>
Co-authored-by: Bartlomiej Plotka <bwplotka@gmail.com>
Signed-off-by: Owen Williams <owen-github@ywwg.com>
Signed-off-by: Owen Williams <owen.williams@grafana.com>
Co-authored-by: Bartlomiej Plotka <bwplotka@gmail.com>
Signed-off-by: Owen Williams <owen-github@ywwg.com>
Signed-off-by: Owen Williams <owen.williams@grafana.com>
Signed-off-by: Owen Williams <owen.williams@grafana.com>
@ywwg
Copy link
Member Author

ywwg commented Nov 22, 2023

had to fix some commit signing, hopefully I didn't break anything

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks! 💪🏽

proposals/2023-11-13-utf8-migration.md Outdated Show resolved Hide resolved
proposals/2023-11-13-utf8-migration.md Outdated Show resolved Hide resolved
proposals/2023-11-13-utf8-migration.md Outdated Show resolved Hide resolved
@bwplotka
Copy link
Member

Which brings an interesting question, would this feature be part of an official PromQL spec? (aka require every PromQL exposed backend to have "UTF-8 compatibility broad" lookup, to be compatible) or is it an optional feature?

I assume the answer is NO, it's optional backend feature and not part of the PromQL spec (:

Signed-off-by: Owen Williams <owen.williams@grafana.com>
@ywwg
Copy link
Member Author

ywwg commented Nov 27, 2023

I assume the answer is NO, it's optional backend feature and not part of the PromQL spec (:

I agree, I think this is an optional feature and not a requirement.

@ywwg ywwg requested a review from bwplotka November 27, 2023 19:06
Copy link
Member

@beorn7 beorn7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much. I think this is substantially good to go. My comments are mostly about wordsmithing and clarifications.

proposals/2023-11-13-utf8-migration.md Outdated Show resolved Hide resolved
proposals/2023-11-13-utf8-migration.md Outdated Show resolved Hide resolved
proposals/2023-11-13-utf8-migration.md Outdated Show resolved Hide resolved
proposals/2023-11-13-utf8-migration.md Outdated Show resolved Hide resolved

All of these situations can be summarized as follows:

1. **Old Data** -- Data written with old Prometheus code: all names are guaranteed not to be UTF-8.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like it includes the case where an old Prometheus has ingested from new producers (and names might include escaped names).

Or is this referring to old Prometheus and old producers, so even escaping in names can be ruled out?

Could you clarify?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After having read through all the below, I would say what's meant here is "if at all, we will have escaped names, so we would never query for UTF-8 names, but only try out the specified escaping schemas".

Still, I think the difference to the mixed data case should be made clearer.

In principle, there might be a scenario where the user knows for sure that they won't have any escaping at all, for example if they had a pure Prometheus stack so far (no OTel etc.), but they would like to use UTF-8 names, so they still have a period of mixed versions deployed (new and old producers, new and old ingesters). I guess it's fine to not implement a specific optimization for that use case (which would be that we don't need a broad search for the old blocks at all), but it would be good if that case is described and that we disregard it as an informed decision.

proposals/2023-11-13-utf8-migration.md Outdated Show resolved Hide resolved
proposals/2023-11-13-utf8-migration.md Show resolved Hide resolved
### Proposed Solution

For queries to return correct data we must differentiate the three cases above, and to do that we first propose to bump the version number in the tsdb meta.json file.
On a per-block basis, the query code can check the version number and know if the data was written with an old version of the Prometheus code.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So yeah, I guess the only optimization is that we never have to run the original UTF-8 query on blocks with an older version. Not sure if that is worth the effort of versioning the blocks. But maybe, as @bwplotka said, the effort is actually quite low, and we might see additional uses of the version number later.

proposals/2023-11-13-utf8-migration.md Outdated Show resolved Hide resolved
Signed-off-by: Owen Williams <owen.williams@grafana.com>
@ywwg
Copy link
Member Author

ywwg commented Nov 30, 2023

all notes addressed, please let me know if I missed anything

@ywwg
Copy link
Member Author

ywwg commented Dec 4, 2023

merge ping?

Copy link
Member

@beorn7 beorn7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I assume all other reviewers are also happy at this point. Please follow up if not.

@beorn7 beorn7 merged commit d3e3498 into prometheus:main Dec 5, 2023
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants