prometheus · beorn7 · Dec 5, 2023 · Nov 15, 2023 · Nov 17, 2023 · Nov 21, 2023
diff --git a/proposals/2023-11-13-utf8-migration.md b/proposals/2023-11-13-utf8-migration.md
@@ -0,0 +1,181 @@
+# Amendment: Enabling Smooth Migration of UTF-8 Metric and Label names
+
+* **Owners:**
+  * `<@author: owen.williams@grafana.com>`
+
+* **Implementation Status:** N/A
+
+* **Related Issues and PRs:**
+  * [GH Issue](https://github.com/prometheus/prometheus/issues/12630)
+  * [PR](https://github.com/grafana/mimir-prometheus/pull/476) (TODO: needs to be rebased on upstream prom)
+
+* **Other docs or links:**
+  * [Parent Proposal](https://github.com/prometheus/proposals/blob/main/proposals/2023-08-21-utf8.md)
+
+> TL;DR: This is an amendment to the existing UTF-8 proposal that provides more detail in the backwards compatibility and migration scenarios.
+
+## Why
+
+As part of making Prometheus compatible with UTF-8, we want users to be able to use the full new character set without being locked in to the legacy Prometheus naming requirements.
+This requires that we consider the migration of older metric names to newer ones, and design a way that Prometheus can correctly query data during the transition period.
+This transition period will last an amount of time equal to the retention policy of a given database, and for some deployments, this could be years.
+
+## Goals
+
+* Allow queries to transparently read data from blocks generated by combinations of old and new versions of tsdb and scraping or remote-write clients.
+* Minimize edge cases where behavior is undefined or suboptimal or risks bad results.
+* Don't incur too much configuration or performance overhead in querying mixed data.
+
+### Audience
+
+The audience for this amendment are users that are planning to migrate existing Prometheus deployments to add support for UTF-8 metric and label names who want to ensure continuity in query behavior through the upgrade process.
+
+## Non-Goals
+
+We do not promise smooth accommodation of every edge case, especially pathological ones (see Name Collisions below).
+In those instances, users may not be able to turn on UTF-8 support, or may need to rename metrics or labels.
+
+## How
+
+Given a query for a UTF-8 metric or label name, the tsdb will look for that name in on-disk blocks whether those blocks were written in native UTF-8 or either of two supported name-escaping patterns.
+Those series will be located even in cases when a single block has one metric written in more than one way.
+The tsdb will differentiate those blocks based on entries in the meta.json and a new flag.
+
+The solution described here is valid for all forms of metrics ingestion, may it be via scrape or via remote-write. In the following text, the target being scraped and the remote-write sender are both called metric producers.
+
+### Mixed-Format Scenarios
+
+We must consider edge cases in which a blocks database has persisted metrics or labels that may have been written by different versions of code.
+There are multiple ways this can (and will) happen:
+
+* An older version of Prometheus ingests names from a newer metrics producer. In this case, names would be escaped with any of the available escaping methods.  If Prometheus is upgraded, newer blocks will be written in UTF-8.
+* A newer Prometheus receives names from an older producer, which is later upgraded. In this case, older names might be escaped using the replace-with-underscores method, and newer names will be UTF-8. This will often happen when Prometheus is receiving Open Telemetry metrics.
+* A newer Prometheus receives names from a mix of new and old producers, in which case the same block could contain escaped and UTF-8 data representing the same intended names.
+
+At query time, there will be a problem: some data may be written with UTF-8 and other data was written with an escaping format.
+The query code will not know which encoding to look for.
+In order to ensure consistent querying, the backwards-compatibility design must account for these scenarios, making trade-offs when needed.
+
+All of these situations can be summarized as follows:
+
+1. **Old Data** -- Data written with old Prometheus code: all names will conform to the legacy naming requirements, so we would never query for UTF-8 names. Some names may be escaped, so we will try the specified escaping schemes.
+2. **Mixed Data** -- Data written with new Prometheus code by one or more old producers (and possibly new producers as well): No guarantees, some names could be escaped, others not.
+3. **New Data** -- Data written with new Prometheus code by new producers: all names are guaranteed to be UTF-8-compatible.
+
+### Time Scope
+
+The issue of mixed-format blocks will persist for the retention period of the tsdb.
+For some deployments this means only 14 days, for others it may be on the order of years of persisted old data.
+
+### Proposed Solution
+
+For queries to return correct data we must differentiate the three cases above, and to do that we first propose to bump the version number in the tsdb meta.json file.
+On a per-block basis, the query code can check the version number and know if the data was written with an old version of the Prometheus code.
+This helps distinguish the first case.
+
+Secondly we will add two new flags to help define the range of dates that are affected by mixed blocks and will be used to distinguish the second case from the third.
+
+* `-promql.utf8_broad_lookup.escape_formats`: This flag tells PromQL engine what escaping methods might have been previously used to escape UTF-8 characters. This is then used to transparently repeat series lookups for metric names or label names when UTF-8 characters are spotted, for each escaping format. Available values will be a short enum representing underscores, U__, or dots-only escaping.
+* `-promql.utf8_broad_lookup.until=<date-time>`: This flag indicates the latest date-time (inclusive) for blocks that may contain mixed data. Any data after this moment are exclusively UTF-8. If this flag is unspecified, all data will be queried using the escape_formats list.
+
+#### Migration Timeline
+
+A Prometheus migration to UTF-8 will follow this timeline:
+
+1. Prometheus is upgraded and UTF-8 support enabled. The `-promql.utf8_broad_lookup.escape_formats` is turned on immediately, enabling the multi-lookup behavior and listing the possible escaping schemes.
+2. Producers are gradually upgraded to UTF-8.
+3. `-promql.utf8_broad_lookup.until` is set to the last date-time when data was ingested from a non-UTF-8 producer.
+4. Wait for the retention period to elapse such that the broad-lookup-until date is expired (could be years).
+5. The migration is complete. Remove `-promql.utf8_broad_lookup.escape_formats` and `-promql.utf8_broad_lookup.until` as they are no longer needed.
+
+### Querying Mixed Blocks
+
+The last major challenge is correctly returning data for queries of blocks that contain mixed data.
+For the mixed-format scenarios, at query time, we will look for **all possible** escapings of a name in order to locate the correct data.
+We propose to do this by expanding a lookup for a UTF-8 metric or label name into a limited set of possible escapings as specified by the escape_formats flag:
+
+1. **UTF-8**
+2. **underscore-replaced**: All unsupported characters are converted to underscores.
+3. **U__ escaping**:  As described in the UTF-8 proposal, strings with invalid characters can be escaped by prepending `U__` and replacing all invalid characters with `_[UTF8 value]_`.
+4. **[Datadog proxy](https://github.com/grafana/mimir-proxies/blob/main/pkg/datadog/ddprom/naming.go#L30-L34) escaping pattern**: "`.`" becomes "`_dot_`" and "`_`" becomes "`__`".
+
+In PromQL, the expansion would look something like this under the hood:
+
+User-generated query:
+
+`{"my.utf8.metric", "my.label"="value"}`
+
+Expanded queries (all possible escapings):
+
+* `{"my.utf8.metric", "my.label"="value"}`
+* `{"my_utf8_metric", "my_label"="value"}`
+* `{"U__my_2E_utf8_2E_metric", "U__my_2E__label"="value"}`
+* `{"my_dot_utf8_dot_metric", "my_dot_label"="value"}`
+
+The escape_formats flag mentioned above enables the behavior and specifies which of the escaping schemes might be in use.
+If an administrator knows that no metrics will use the `U__` pattern, it can be safely skipped.
+Hypothetically, if additional replacement patterns are found, they could be easily added to the list of possible configuration options as a minor update.
+
+Redundant lookups will increase query time, but the hope is that index lookups are fast enough that the penalty will be small.
+We will do performance testing to identify possible issues.
+
+### Regex lookups
+
+If the user is querying for metrics using a regex lookup for the `__name__` label, attempting to rewrite that query to account for other name encodings would be overly complex and error-prone.
+Therefore we will not try to rewrite the regex to account for multiple escaping methods and the regex will be passed through as-is.
+Users will need to write custom regex queries to account for metric name changes during the transition period in this case. 
+Since regex queries on metrics names are relatively rare and the domain of advanced users, we feel this is an acceptable approach.
+
+### Name Collisions
+
+In most cases, we do not anticipate bad query results due to name collisions in the case where names are escaped by an old producer using the underscore method.
+This is because collisions would occur at write time, when the colliding names are written to the database.
+Any problems with collisions will occur well before a migration to UTF-8 support takes place.
+Therefore, behavior due to name collisions due to underscore replacement is undefined.
+
+Hypothetically, there could be collisions in the following situation:
+
+1. A database has incoming names generated by an old producer that escapes names with underscores.
+2. That database also has incoming names ingested by a new producer in UTF-8.
+3. There is a UTF-8 name that collides with a similar name ingested by the old producer.
+
+For example, "service.name" is being ingested from an old producer, and that is getting escaped to "service_name" by that producer at ingest time.
+At the same time, a different metric called "service/name" is being ingested in native UTF-8 from a newer producer.
+The error occurs when the user tries to query for "service/name": because both producers were contributing data to the same blocks, the query will be expanded to look for "service_name" and will accidentally grab the metrics meant for "service.name".
+
+The short answer to avoiding this scenario is **don't do that**. Specifically: If possible, if there are any old producers present, do not construct metrics or labels which could cause collisions; and if that is unavoidable, don't mix old and new producers together.
+
+As long as all the producers are new, users do not need to worry about collisions -- "service.name" and "service/name" will be stored separately and the queries will never have to be expanded to include the escaped "service_name" possibility.
+
+This situation seems contrived-enough that we are comfortable not supporting it.
+
+## Discarded Approaches
+
+### Record the oldest producer version used to write data
+
+A previous draft suggested recording the oldest producer version used to ingest data in order to determine which blocks might have mixed data.
+This approach was overly complicated and would require a lot of plumbing to make it work.
+There were also potential issues with block compaction and trying to make sure that the metadata is merged correctly.
+Ultimately we decided that having administrators declare transition dates was an easier approach.
+
+### Rewrite Old Data
+
+We could have required that users rewrite their tsdb blocks to "upgrade" them to UTF-8 and undo the escaping.
+This approach seems tedious, difficult, and dangerous -- what if something goes wrong during rewriting?
+Requiring massive data rewrites is not a reasonable ask of users.
+
+### Lookup Table / Per-Name Config
+
+We considered recording a lookup table or per-name configuration that would describe how UTF-8 metrics and labels might be stored in old data blocks.
+This approach would be faster than doing query expansion, but would create extra operational overhead -- lookup tables would have to be correct and exhaustive.
+
+Because names are stored in the index, query expansion is not expensive enough to justify the extra operational overhead.
+
+### No Migration -- Write Both Versions
+
+We very briefly considered the idea of having the tsdb write all versions for a name as long as the user configured it that way.
+That way queries for both the native UTF-8 name and the escaped name would succeed.
+When the migration was complete, users could turn off double-writing and only write UTF-8.
+
+This approach would cause an explosion of on-disk usage.
+As disk is one of the most expensive resources, this approach was quickly discarded.