Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Design] Expectations Data Model #24

Closed
mikaylathompson opened this issue Dec 1, 2022 · 9 comments
Closed

[Design] Expectations Data Model #24

mikaylathompson opened this issue Dec 1, 2022 · 9 comments

Comments

@mikaylathompson
Copy link
Collaborator

mikaylathompson commented Dec 1, 2022

Meta notes:


Modeling expectations is a complex topic that is inevitably going to evolve with our understanding of the issues that we encounter as we progress.

The proposal below is intended to be a first draft--enough to get us started, and hopefully not something that will hamstring us going forward. I have attempted to prioritize serving our current needs and looking one step ahead over solving all problems at once.

Basics

An Expectation is the fundamental unit of the knowledge base and assessment framework. An Exepctation says "we expect X, given conditions Y" and can be used as part of the assessment tool to tell users: "These are behaviors that change between your current and desired versions"; and used by the validation tool as a checklist of tests to be run to verify the expected behavior of an upgrade or migration.

I think the concept is made more clear by giving a few examples:

  • There should be N documents in the cluster.
  • The type mapping of field Y should be Z.
  • Running the query Q against this index should return the result R.
  • Access to document D should be denied for user U.

Each of these has a number of assumed conditions behind it:

  • it applies to certain versions of ElasticSearch/OpenSearch
  • it depends on the data that was loaded into the index
  • it may depend on the plugins enabled and their configuration
  • it may depend on the configuration of the cluster

An expectation therefore can be thought of as having two components:

  1. the conditions,
  2. the verifiable expectation itself.

Down the line, we'd like to be able to describe the conditions and verifiable in an abstracted format in the Expectation. However, my current opinion is that attempting to do so right now, with our limited knowledge, is premature and will hamstring us as we discover new types of conditions and verifiables.

Additionally, we don't want the implementation details to be in the Exepectation. Implementation details here might mean the exact data or query that would verify the expectation. This ties the expectation to those details and, as discussed below, Expectations are used in many parts of our system where the details may differ or be irrelevant.

Therefore, the current proposal is that Expectations should be as simple as possible--stripped down to their core:

  • a unique identifier
  • a human-oriented description of the expectation
  • optional: a version specification
  • optional: collection of tags that describes its relevancy to plugins, specialty datatypes, or specific applications (See further discussion in Downselecting).

The implementation--exactly what it means to check the conditions and test the verifiable--is left up to the tests associated with each implementation (see Execution).

Format

For the sake of portability and human readability, Expectations are json objects. (Discussion point: any reason why YAML is preferable?)

As an example:

{
	"id": "consistent-document-count",
	"description": "There should be the same number of documents in the each index before and after a migration or upgrade."
}

A slightly more complex example that includes two Expectations, and version ranges:

[
{
	"id": "doy-format-date-range-query-bug",
	"description": "Given 1/ an index created in ES 7.x; 2/ documents with dates uploaded in a yyyy-DDD (day of year) format, and a query that uses an inclusive date range, the results of the query will fail to include documents where the date of the document is the same day as the inclusive endpoint of the range. See https://github.com/opensearch-project/OpenSearch/issues/4285",
	"versions": {
		"gte": "ES7.0.0",
		"lt": "OS2.4.0"
	}
},
{
	"id": "doy-format-date-range-query-bug-fix",
	"description": "In contrast to `doy-format-date-range-query-bug`, inclusive date range queries on documents with yyyy-DDD (day of year) date formats behave correctly after OS 2.4.0, and before ES 7.x (when the bug was introduced).",
	"versions": [{
		"gte": "OS2.4.0"
	},
	{
		"lt": "ES7.0.0"
	}]
}
]

Use in our tools (Assessment, Validation, Testing Framework)

Assessment

The assessments tool is run to give customers a report on what changes they should expect for the upgrade to a given version.
Down the line, the tool may be able to introspect their cluster to determine which version and what plugins or specific data features they use and then provide them with detailed changes and upgrade suggestions.

In the short term, the tool will likely accept a source and target version and pull the expectations that are relevant to that use case and assemble them into a report for the user. The version filtering (see Downselecting below) allows this to happen easily, and will expand to allow for filtering based on specific tags.

Expectations won't be executed as part of the assessment tool - -they'll simply be reported to the user based on the metadata.

Validation (and testing framework)

The term validation is a slightly overloaded one--see [Proposal] Upgrades Project Workstreams Update for discussion of validation vs. the testing framework as context for this discussion

Expectations are relevant to both the testing framework (is this system behaving as expected?) and validation (did the upgrade work?) and their execution looks the same in the short term for these two tools.

The testing framework is--among other things--used as CI/CD testing of both upgrade mechanisms & OpenSearch versions, as an eventual replacement for the backwards compatibility tests. Executing the Expectations against the pre- and post-upgrade clusters gives us confidence that they are behaving as expected and that our assessment reports are accurate.

As new versions are released we can find places where they break our expectations by including the new release as a target version in the testing framework. Eventually, it's important that there is a tight feedback loop between pre-release versions and the testing framework to alert code authors when their changes break our expectations. Some of these will likely be bugs, and others may be intentional feature changes that require updated expectations. This is also a mechanism to ensure that our backwards compatibility guarantees are being upheld.

Execution

Executing an Expectation means verifying whether a given cluster behaves as we expect it to via a reproducible test.

Currently (in the format proposed above), Exepectations don't contain enough information to be run on their own. The goal is for the Expectation description to be precise enough that a human can write a more structured/specific description of the testing protocool (i.e. index a given set of documents, run a specific query, and then compare the results to an expected result). This allows the Expectation to be free of specific data or implementation details. Down the line, there might be multiple implementions for an Expectation--for instance, one for the testing framework and another for testing on a real cluster without injecting data.

At a very high level, our current proposal is to use the Robot Framework as the layer in betwen the Expectation and directly implementing code. Once we have a library of constructs built up (in code) this will allow users to contribue implementations for expectations without needing to write low-level code themselves.

This topic will be elaborated on (and implemented) more in the very near future -- another issue should be opened to discuss exactly what this looks like.

Reporting

Reporting is generally the process of communicating about the expectations to users. It happens in the context of the assessment tool (near future), validation tool (medium future) and testing framework (immediate future).

Assessment Tool

The assessment tool reports the relevant expectations without any information about their implementation or execution. For this purpose, the most relevant information is the description included in the model, which may be reported to the user in full.

As a very rough example:

Behavior changes expected in an upgrade from ES 7.10.2 to OS 2.4.0
In 7.10.2:
	doy-format-date-range-query-bug
	Given 1/ an index created in ES 7.x; 2/ documents with dates uploaded in a yyyy-DDD (day of year) format,
	and a query that uses an inclusive date range, the results of the query will fail to include documents where
	the date of the document is the same day as the inclusive endpoint of the range.
	See https://github.com/opensearch-project/OpenSearch/issues/4285

In 2.4.0:
	doy-format-date-range-query-bug-fix
	In contrast to `doy-format-date-range-query-bug`, inclusive date range queries on documents with
	yyyy-DDD (day of year) date formats behave correctly after OS 2.4.0, and before ES 7.x (when the bug was introduced).

This provides the user with a description of the changes, as well as the Expectation id which allows them to look up more information or the implementation.

Testing Framework

In the Testing Framework, the goal of reporting is to communicate to the user whether the system behaved as expected for all expectations tested.

If we proceed with using the Robot Framework for implementing Expectations as mentioned above, it has built in human-friendly reporting that we can leverage to communicate the results.

Note about Expectations exceeding implementations

We will likely write Expecations faster than we implement them. Particularly as outside contributors report issues they encounter during an upgrade, creating the Expectation is the necessary first step -- this is the equivalent of reporting a bug without having yet reproduced or fixed it.

When we make a report for the testing framework, we should report on all relevant expectations, even those without an implementation. Expectations that can't be tested yet (because they don't have an implementation) should be visibly flagged as such -- this both creates a checklist of what we have left to implement and alerts the user that they may want to do additional testing independently.

Expectation Knowledge Base

The Knowledge Base is the collection of all expectations--capturing our knowledge about the expected behavior of the system.

The goal of the structure of the Knowledge Base is to be human and machine friendly. A script should be able to parse through the knowledge base and pull out all relevant expectations (based on version range/tags), and a human should be able to find the expectation that was called out by the assessment tool or that failed in the testing framework.

The simplest possible structure would be a single JSON file, but this is moderately difficult for human searching and becomes quite unwieldy as the collection grows.

Instead, I suggest that most expectations should live on their own in a JSON file named ${expecation_id}.json, in a single directory. This makes it very easy for human or script to find a specific expectation, and very easy for a script to collect all expectations. If we find that this needs to be subdivided in the future, it's simple to add subdirectories for expecations with a given plugin tag or similar topical categorizations.

There may be some cases, like the date range bug used as an example above, where it's clearer for human comprehension to put two very closely related expectations (i.e. a bug and it's solution) in the same file, I think that's a pragmatic decision and the script(s) reading in the expectations should accomodate that.

Downselecting

We've been using the term downselecting to talk about the process of deciding which expectations should be executed or reported on in a given case. Not all exepctations will be relevant to all users.

A few example dimensions on which we might want to downselect:

  • version
  • datatypes ("include geospatial")
  • plugins ("include k-nn")

Version is a special case--it's both more fundamental to the data model, and it also exists as a semi-continuous range of versions, where it's both annoying and verbose to have to specify every single version to which an expectation applies. For that reason, it's pulled out to its own top level field in the data model, and uses approximately OpenSearch range datatype semantics. Note that the range may be open-ended (e.g. "applies to all versions including and after 2.3" would have just a "gte": "2.3.0", with no lt). Versions are prefaced with "ES" (ElasticSearch) or "OS" (OpenSeach) to avoid collisions going forward.

This format allows us to build tools later that easily select all expectations relevant to a specific version, without having to re-tag many expectations everytime there's a new release.

For other dimensions, an array of tags is included in each Expectation. These should specify any plugins and specialty datatypes (TODO: define this precisely!) to which the expectation is relevant.

Phases of Implementation **Phase 0:** All expectations run (no downselecting). Tags are used for informational purposes.

Phase 1: Expectations are manually selected ("select expectations for version 7.10.2 and tagged geospatial")

Phase 2: Expectations are selected based on customer goals ("I'm upgrading from 7.10 to 2.4 and my data includes geospatial info and I use the security plugin")

In the short term, Expectations have tags, but there's no downselecting mechanism. All of the Exepctations are pulled from the library and provided to the test runner/report generator.

In the medium term, tags can be passed in via the test config. The downselector filters the library to pull out Expectations that have the given tag(s).

In the long term, customer goals and data/config characteristics are either provided by the customer or inferred from their setup. These are mapped to various tags and those tags are passed in to the downselector.

@mikaylathompson
Copy link
Collaborator Author

Flagging @chelma for visibility. No need to comment at this stage unless you see fundamental mismatches. I'll be expanding on this into the empty sub-headings over the next few days.

@chelma
Copy link
Member

chelma commented Dec 6, 2022

Per discussion yesterday - my recommendation would be to split an expectation (number of docs should remain same between starting and ending cluster after an upgrade) from its implementation (there should be exactly 37 docs in the end cluster). The core of the knowledge base would be the library of expectations, which includes an identifier, a human-readable description of the expectation, and the conditions it applies under (e.g. which engine/plugin versions, etc). For the initial cut of the validation tests, we can manually write tests that implement the expectations and put a system in place to ensure all the expectations have an implementing test, and not tests are run without an associated expectation. We can add things like auto-generation of test stubs based on expectations or even fully-automated test generation later on. We can add hooks into the Upgrade Testing Framework to automate ingestion of data at the correct point in the lifecycle.

@mikaylathompson mikaylathompson changed the title [DRAFT] [Design] Expectations Data Model [Design] Expectations Data Model Dec 13, 2022
@chelma
Copy link
Member

chelma commented Dec 21, 2022

Overall, looks good. A few comments.

  • We're getting some better clarity on what the Validation Tool will be and it's likely to be less focused on applying the KB expectations and more focused on validating performance and cost of a new, post-migration cluster. My current thinking is that the Assessment Tool is the ultimate consumer of the KB/Expectations, not the Validation Tool.
  • Syntax for expressing version ranges could be improved; right now it seems easy to make ambiguous and/or contradictory ranges. As an example, is this valid?
{
    "gte": "ES7.0.0",
    "lt": "OS2.4.0",
    "eq": "ES6.3.0"
}
  • Also, is there a difference between these two specifications?
[{
    "gte": "OS2.4.0"
},
{
    "lt": "ES7.0.0"
}]
{
    "gte": "OS2.4.0",
    "lt": "ES7.0.0"
}

@mikaylathompson
Copy link
Collaborator Author

@chelma

  • re: Validation tool -- Yeah, I agree with what you wrote here. I had this section in my original outline and then our stance shifted enormously over that week, so I emphasized testing framework over validation, but that could be clarified more. I'd still say that some of the expectations may be relevant for validation (e.g. document count, maybe expectations around fairly consistent/expected changes in size on disk, etc.). But I'm willing to say that that's a topic we'll deal with down the road when we have a better sense. Fair?

  • re: versioning ranges -- Good call out. I have a sense of the definition in my head, but 1/ I didn't actually write it down, and 2/ you're right about potentially contradictory ranges. I'm going to start with more fully defining it, but let me know if it's still confusing and I can take a step back and try a different approach.


Versioning ranges:

A versions entry can be a list or a single object that defines the one or more relevant versions for a specific Expectation.

If it is a list, the applicable versions are the union of each of the objects within that list (i.e. they're OR-ed together).

For each individual object, the two types are:

  • a single version object, which has one eq statement that specifies an exact version (e.g. {"eq": "ES6.3.0"})
  • a range version object, which has at either one or two statements. The statements can be a minimum bound statement (gt or gte) or a maximum bound statements (lt or lte) or one of each. The applicable versions are the intersection of these criteria (i.e. they're AND-ed together). This means that if it has both a minimum and a maximum statement, the minimum value must be less than the maximum value (or else they have no intersection and it specifies zero versions).

To apply these to your examples:
1.

{
    "gte": "ES7.0.0",
    "lt": "OS2.4.0",
    "eq": "ES6.3.0"
}

is not a valid version object because it combines eq and gt/lt statements.

[{
    "gte": "OS2.4.0"
},
{
    "lt": "ES7.0.0"
}]

is a valid versions entry. It specifies versions that are either greater than OS 2.4.0 or less than ES 7.0.0.

{
    "gte": "OS2.4.0",
    "lt": "ES7.0.0"
}

is a valid entry, but not a useful one because it specifies versions that are both greater than OS 2.4 and less than ES 7.0 and there are no versions that meet that criteria.

Perhaps it would be a useful utility tool to write a quick script that takes in a version range and outputs every released version that meets that criteria.

@chelma
Copy link
Member

chelma commented Dec 21, 2022

I'm inclined towards making our versioning system simple and obvious over flexible, as I'm pretty confident that people will consistently get this wrong in its current suggested form and it's unclear whether we'll ever need to do complex range definitions. It feels like we can increase the flexibility later if/when we need to, yeah? The difference between these two range definitions is impossible to understand at a glance without having carefully read a README:

{
    "gte": "ES7.0.0",
    "lt": "OS2.4.0"
}
[
    {"gte": "ES7.0.0"},
    {"lt": "OS2.4.0"}
]

It also feels like we can get rid of the eq operator by just having folks define a specific version as a very tightly bound range:

{
    "gte": "ES7.0.0",
    "lte": "ES7.0.0"
}

Can we simplify this entire problem in a way to make it harder to get version specification wrong? Or make it more explicit somehow, maybe by including the AND/OR keywords? Keep in mind that if a Cluster Admin wants to add an Expectation, their bandwidth may be quite low and they may give up if they can't intuit this quickly.

Still open to discussion on the topic, to be clear... just have some concerns.

@mikaylathompson
Copy link
Collaborator Author

Totally fine with getting rid of eq -- I agree with your solution there.

I actually do think we need "complex" ranges. I ended up including this because my example expectation (basically a bug that was introduced and then fixed 2 major versions later) needed it. So I don't think that it's that crazy or far off. We could define it as three expectations: expectation-pre-bug-introduction, expecation-with-bug, expectation-post-bug-fix and each has a simple range -- but I think that's significantly worse than two expectations expectation-with-bug expectation-without-bug.

But you're right about it being confusing at first glance. I think explicitly including AND/OR might make sense. I'll take a step back on it and come back with another proposal.

@lewijacn
Copy link
Collaborator

Very onboard with most of what I read, a couple thoughts that came to mind:

  • Definitely feel the proper starting place for linking an expectation to say a robot framework test should simply be the expectation ID. As the robot tests grow and improve, I am curious how far the gap between most expectations and the actual robot tests will be, and if we can bridge that gap. Will be something to look at further down the road with a larger sample size.
  • When we talk about downselecting, I feel almost all the criteria we would have version, datatypes, etc. for the core could also be applied to plugins. Given a case like "I only want to see expectations for version X of k-nn plugin when using Y datatypes". This to say that I'm expecting plugins to have a similar json structure needed for their definition. In which case this could be a bit tricky when plugins don't follow the same release as the core or the complexity with having version ranges for both. Let me know if you have any thoughts here.
  • Going off your point @mikaylathompson , there is this question of what expectations should we create when we discover a bug that I haven't come to a good answer on. We could capture all versions like you mentioned say with-bug and without-bug using version ranges and this would give us confidence that all versions are behaving as we expect. I almost feel we need some sort of label to identify which of these expectations are say "business as usual" expectations or "bug identified" expectations and maybe having a "bug identified" expectation always warrants a "business as usual" expectation as well. As someone coming in looking to write one I would probably write the bug expectation and stop there, so I wonder if having these grouped somehow in a simple pattern would be beneficial

@gregschohn
Copy link
Collaborator

Thanks for putting this together & I think it’s a great start. I love having granular and verifiable building block to bridge what we think is/should be true to the real world. There are a lot of comments below on how to embrace structure as this evolves to keep things consistent, clean, and most of all, easy to reason with. There are some philosophical tenets that may be worth challenging here in this design as we look further into the future. Think of when expectations begin to break down - or more likely, become more specific than was originally thought. It may turn out that not EVERY datatype can be migrated w/out incident between two versions, etc. How we mature into that process is what I’m looking toward…

A huge meta comment is to come up w/ terms for UAT tests on live clusters w/ real data (“validation”) vs proactive tests to assert what we believe to be true (“tests”? In this design’s parlance).

This a distinction being made for version and all other preconditions. Versions are special, but other preconditions seem like they could just be lumped into the description, written by a human with good intentions. That seems artificial and will limit the ability to compose new expectations and to reason in automated ways. That context of a test will be more than just a version, though that might be the first step in establishing context. Datatypes in indices and active plugins will have similar configuration requirements. There could be classes of datatypes; plugins may need to have their own configurations. Infrastructure/performance concerns aren’t identified yet in this document - do you want to make sure that when a node is dropped, that the performance of a set of interactions isn’t more than n% worse than before? Those get more complicated, especially to nail with high-fidelity, but from a modeling perspective, why wouldn’t you want them to be first-class attributes like version?

Taking it every further - what would happen if you extended this to test some expectations between different types of search/analytics services altogether - if I add 10 items to a SOLR instance here & the same 10 to OpenSearch on this target, will the query return the same 3 items? Do I expect it to return different items? This framework seems like it could be super valuable to track all kinds of knowledge.

My hope would be that these attributes eventually drive a lot of the filtering that you discuss in downselecting - which seems like it will be matching deployed services with expectations/tests. Ideally, those same attributes can help in test configurations. If you add post-conditions to actions (maybe not expectations), then you could string pieces together to build up contexts that you need, doing tests along the way.

Separately, you mention “introspecting” on a cluster - that would be great, but in the meantime, you can model what a cluster would look like & allow it to be provided manually.

You mention “suggestions” - how would we know what to suggest? Would that eventually be some more attributes in the expectation model? Would we have a separate documentation store that keys off of expectation ids? Who would maintain those recommendations? Would we want to vet them? (Ids on stackoverlow seem like they could be a great place to start).

I can’t tell what an expectation is yet - can they be both binary, between sources and targets as well as unary (doesn’t matter where I’m coming from). How does one specify metadata and tests for those different cases?

There’s some reliance on best intentions for the description - and whether a test even needs to be provided. These expectations should be able to stand on their own w/out much oversight - I doubt that we’ll be testing this test code to the same degree as the underlying systems under test - especially given the complexity of these tests. That means that your models need to be super solid up-front. When you say that you don’t want to be reliant on implementation details, what implementation details specifically are you trying to avoid? If it's a language or runtime environment - there are other ways to mitigate the risk of lockin. I've had success in modeling constructs in a language of my choice and easy to reason with. I can then provide code to transform my model into whatever runtime(s) that I need to. I can write another program to project one model into another - maybe the test implementations would fall away if I was keeping them inline, but all of the metadata that wasn't dependent upon the runtime can be easily preserved. Now I can start to make sure that my models are complete in a programmatic and easy way - rather than hoping.

Saying that you want to aggressively decouple “implementation details” to be more independent means that you're going to lose a lot of systematic validations that you could otherwise get for free - and not get too much in return.

On the flip side, it seems that you’re hoping to use the robot framework to spit out some human readable contents. Is there a way to get structured contents too? We’ll want to log results for every run and doing it with some structure could help us in the discovery process to find when expectations were faulty or when regressions had occurred. I’d like to see test results (eventually) be added to a datastore like a document DB that can also double as a projection of the knowledge base. That feels like you’re projecting an implementation detail into your presentation layer that will make it much harder to abstract away in the future. Are there already some abstractions to simplify that? Secondly - could one write tests, or carry-over, tests in other languages, like Java (I’m thinking of the ‘BWC’/backward-compatibility tests)?

As I alluded to already - I’m wary about leaving expectations go into the repo without corresponding tests.

Avoid saying "Tags are used for informational purposes." If you say that once, the implication my be that that’s all that they’re ever for.

@gregschohn
Copy link
Collaborator

I actually do think we need "complex" ranges. I ended up including this because my example expectation (basically a bug that was introduced and then fixed 2 major versions later) needed it. So I don't think that it's that crazy or far off. We could define it as three expectations: expectation-pre-bug-introduction, expecation-with-bug, expectation-post-bug-fix and each has a simple range -- but I think that's significantly worse than two expectations expectation-with-bug expectation-without-bug.

Wow - this use case creates a lot of questions. The first is - do we think that this should be a permanent state that we're designing for - or should we say that some versions are recalled because of known bugs that were subsequently fixed? I don't know if it matters that the expectation worked for say N.m and for N.m+1 it now fails. Either way, for N.m+1, the user will see a warning that an expectation didn't hold, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants