-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Design] Expectations Data Model #24
Comments
Flagging @chelma for visibility. No need to comment at this stage unless you see fundamental mismatches. I'll be expanding on this into the empty sub-headings over the next few days. |
Per discussion yesterday - my recommendation would be to split an expectation (number of docs should remain same between starting and ending cluster after an upgrade) from its implementation (there should be exactly 37 docs in the end cluster). The core of the knowledge base would be the library of expectations, which includes an identifier, a human-readable description of the expectation, and the conditions it applies under (e.g. which engine/plugin versions, etc). For the initial cut of the validation tests, we can manually write tests that implement the expectations and put a system in place to ensure all the expectations have an implementing test, and not tests are run without an associated expectation. We can add things like auto-generation of test stubs based on expectations or even fully-automated test generation later on. We can add hooks into the Upgrade Testing Framework to automate ingestion of data at the correct point in the lifecycle. |
Overall, looks good. A few comments.
|
Versioning ranges: A If it is a list, the applicable versions are the union of each of the objects within that list (i.e. they're OR-ed together). For each individual object, the two types are:
To apply these to your examples:
is not a valid version object because it combines
is a valid versions entry. It specifies versions that are either greater than OS 2.4.0 or less than ES 7.0.0.
is a valid entry, but not a useful one because it specifies versions that are both greater than OS 2.4 and less than ES 7.0 and there are no versions that meet that criteria. Perhaps it would be a useful utility tool to write a quick script that takes in a version range and outputs every released version that meets that criteria. |
I'm inclined towards making our versioning system simple and obvious over flexible, as I'm pretty confident that people will consistently get this wrong in its current suggested form and it's unclear whether we'll ever need to do complex range definitions. It feels like we can increase the flexibility later if/when we need to, yeah? The difference between these two range definitions is impossible to understand at a glance without having carefully read a README:
It also feels like we can get rid of the
Can we simplify this entire problem in a way to make it harder to get version specification wrong? Or make it more explicit somehow, maybe by including the AND/OR keywords? Keep in mind that if a Cluster Admin wants to add an Expectation, their bandwidth may be quite low and they may give up if they can't intuit this quickly. Still open to discussion on the topic, to be clear... just have some concerns. |
Totally fine with getting rid of I actually do think we need "complex" ranges. I ended up including this because my example expectation (basically a bug that was introduced and then fixed 2 major versions later) needed it. So I don't think that it's that crazy or far off. We could define it as three expectations: But you're right about it being confusing at first glance. I think explicitly including AND/OR might make sense. I'll take a step back on it and come back with another proposal. |
Very onboard with most of what I read, a couple thoughts that came to mind:
|
Thanks for putting this together & I think it’s a great start. I love having granular and verifiable building block to bridge what we think is/should be true to the real world. There are a lot of comments below on how to embrace structure as this evolves to keep things consistent, clean, and most of all, easy to reason with. There are some philosophical tenets that may be worth challenging here in this design as we look further into the future. Think of when expectations begin to break down - or more likely, become more specific than was originally thought. It may turn out that not EVERY datatype can be migrated w/out incident between two versions, etc. How we mature into that process is what I’m looking toward… A huge meta comment is to come up w/ terms for UAT tests on live clusters w/ real data (“validation”) vs proactive tests to assert what we believe to be true (“tests”? In this design’s parlance). This a distinction being made for version and all other preconditions. Versions are special, but other preconditions seem like they could just be lumped into the description, written by a human with good intentions. That seems artificial and will limit the ability to compose new expectations and to reason in automated ways. That context of a test will be more than just a version, though that might be the first step in establishing context. Datatypes in indices and active plugins will have similar configuration requirements. There could be classes of datatypes; plugins may need to have their own configurations. Infrastructure/performance concerns aren’t identified yet in this document - do you want to make sure that when a node is dropped, that the performance of a set of interactions isn’t more than n% worse than before? Those get more complicated, especially to nail with high-fidelity, but from a modeling perspective, why wouldn’t you want them to be first-class attributes like version? Taking it every further - what would happen if you extended this to test some expectations between different types of search/analytics services altogether - if I add 10 items to a SOLR instance here & the same 10 to OpenSearch on this target, will the query return the same 3 items? Do I expect it to return different items? This framework seems like it could be super valuable to track all kinds of knowledge. My hope would be that these attributes eventually drive a lot of the filtering that you discuss in downselecting - which seems like it will be matching deployed services with expectations/tests. Ideally, those same attributes can help in test configurations. If you add post-conditions to actions (maybe not expectations), then you could string pieces together to build up contexts that you need, doing tests along the way. Separately, you mention “introspecting” on a cluster - that would be great, but in the meantime, you can model what a cluster would look like & allow it to be provided manually. You mention “suggestions” - how would we know what to suggest? Would that eventually be some more attributes in the expectation model? Would we have a separate documentation store that keys off of expectation ids? Who would maintain those recommendations? Would we want to vet them? (Ids on stackoverlow seem like they could be a great place to start). I can’t tell what an expectation is yet - can they be both binary, between sources and targets as well as unary (doesn’t matter where I’m coming from). How does one specify metadata and tests for those different cases? There’s some reliance on best intentions for the description - and whether a test even needs to be provided. These expectations should be able to stand on their own w/out much oversight - I doubt that we’ll be testing this test code to the same degree as the underlying systems under test - especially given the complexity of these tests. That means that your models need to be super solid up-front. When you say that you don’t want to be reliant on implementation details, what implementation details specifically are you trying to avoid? If it's a language or runtime environment - there are other ways to mitigate the risk of lockin. I've had success in modeling constructs in a language of my choice and easy to reason with. I can then provide code to transform my model into whatever runtime(s) that I need to. I can write another program to project one model into another - maybe the test implementations would fall away if I was keeping them inline, but all of the metadata that wasn't dependent upon the runtime can be easily preserved. Now I can start to make sure that my models are complete in a programmatic and easy way - rather than hoping. Saying that you want to aggressively decouple “implementation details” to be more independent means that you're going to lose a lot of systematic validations that you could otherwise get for free - and not get too much in return. On the flip side, it seems that you’re hoping to use the robot framework to spit out some human readable contents. Is there a way to get structured contents too? We’ll want to log results for every run and doing it with some structure could help us in the discovery process to find when expectations were faulty or when regressions had occurred. I’d like to see test results (eventually) be added to a datastore like a document DB that can also double as a projection of the knowledge base. That feels like you’re projecting an implementation detail into your presentation layer that will make it much harder to abstract away in the future. Are there already some abstractions to simplify that? Secondly - could one write tests, or carry-over, tests in other languages, like Java (I’m thinking of the ‘BWC’/backward-compatibility tests)? As I alluded to already - I’m wary about leaving expectations go into the repo without corresponding tests. Avoid saying "Tags are used for informational purposes." If you say that once, the implication my be that that’s all that they’re ever for. |
Wow - this use case creates a lot of questions. The first is - do we think that this should be a permanent state that we're designing for - or should we say that some versions are recalled because of known bugs that were subsequently fixed? I don't know if it matters that the expectation worked for say N.m and for N.m+1 it now fails. Either way, for N.m+1, the user will see a warning that an expectation didn't hold, right? |
Meta notes:
Modeling expectations is a complex topic that is inevitably going to evolve with our understanding of the issues that we encounter as we progress.
The proposal below is intended to be a first draft--enough to get us started, and hopefully not something that will hamstring us going forward. I have attempted to prioritize serving our current needs and looking one step ahead over solving all problems at once.
Basics
An Expectation is the fundamental unit of the knowledge base and assessment framework. An Exepctation says "we expect X, given conditions Y" and can be used as part of the assessment tool to tell users: "These are behaviors that change between your current and desired versions"; and used by the validation tool as a checklist of tests to be run to verify the expected behavior of an upgrade or migration.
I think the concept is made more clear by giving a few examples:
Each of these has a number of assumed conditions behind it:
An expectation therefore can be thought of as having two components:
Down the line, we'd like to be able to describe the conditions and verifiable in an abstracted format in the Expectation. However, my current opinion is that attempting to do so right now, with our limited knowledge, is premature and will hamstring us as we discover new types of conditions and verifiables.
Additionally, we don't want the implementation details to be in the Exepectation. Implementation details here might mean the exact data or query that would verify the expectation. This ties the expectation to those details and, as discussed below, Expectations are used in many parts of our system where the details may differ or be irrelevant.
Therefore, the current proposal is that Expectations should be as simple as possible--stripped down to their core:
Downselecting
).The implementation--exactly what it means to check the conditions and test the verifiable--is left up to the tests associated with each implementation (see
Execution
).Format
For the sake of portability and human readability, Expectations are json objects. (Discussion point: any reason why YAML is preferable?)
As an example:
A slightly more complex example that includes two Expectations, and version ranges:
Use in our tools (Assessment, Validation, Testing Framework)
Assessment
The assessments tool is run to give customers a report on what changes they should expect for the upgrade to a given version.
Down the line, the tool may be able to introspect their cluster to determine which version and what plugins or specific data features they use and then provide them with detailed changes and upgrade suggestions.
In the short term, the tool will likely accept a source and target version and pull the expectations that are relevant to that use case and assemble them into a report for the user. The version filtering (see Downselecting below) allows this to happen easily, and will expand to allow for filtering based on specific tags.
Expectations won't be executed as part of the assessment tool - -they'll simply be reported to the user based on the metadata.
Validation (and testing framework)
The term validation is a slightly overloaded one--see [Proposal] Upgrades Project Workstreams Update for discussion of validation vs. the testing framework as context for this discussion
Expectations are relevant to both the testing framework (is this system behaving as expected?) and validation (did the upgrade work?) and their execution looks the same in the short term for these two tools.
The testing framework is--among other things--used as CI/CD testing of both upgrade mechanisms & OpenSearch versions, as an eventual replacement for the backwards compatibility tests. Executing the Expectations against the pre- and post-upgrade clusters gives us confidence that they are behaving as expected and that our assessment reports are accurate.
As new versions are released we can find places where they break our expectations by including the new release as a target version in the testing framework. Eventually, it's important that there is a tight feedback loop between pre-release versions and the testing framework to alert code authors when their changes break our expectations. Some of these will likely be bugs, and others may be intentional feature changes that require updated expectations. This is also a mechanism to ensure that our backwards compatibility guarantees are being upheld.
Execution
Executing an Expectation means verifying whether a given cluster behaves as we expect it to via a reproducible test.
Currently (in the format proposed above), Exepectations don't contain enough information to be run on their own. The goal is for the Expectation description to be precise enough that a human can write a more structured/specific description of the testing protocool (i.e. index a given set of documents, run a specific query, and then compare the results to an expected result). This allows the Expectation to be free of specific data or implementation details. Down the line, there might be multiple implementions for an Expectation--for instance, one for the testing framework and another for testing on a real cluster without injecting data.
At a very high level, our current proposal is to use the Robot Framework as the layer in betwen the Expectation and directly implementing code. Once we have a library of constructs built up (in code) this will allow users to contribue implementations for expectations without needing to write low-level code themselves.
This topic will be elaborated on (and implemented) more in the very near future -- another issue should be opened to discuss exactly what this looks like.
Reporting
Reporting is generally the process of communicating about the expectations to users. It happens in the context of the assessment tool (near future), validation tool (medium future) and testing framework (immediate future).
Assessment Tool
The assessment tool reports the relevant expectations without any information about their implementation or execution. For this purpose, the most relevant information is the description included in the model, which may be reported to the user in full.
As a very rough example:
This provides the user with a description of the changes, as well as the Expectation id which allows them to look up more information or the implementation.
Testing Framework
In the Testing Framework, the goal of reporting is to communicate to the user whether the system behaved as expected for all expectations tested.
If we proceed with using the Robot Framework for implementing Expectations as mentioned above, it has built in human-friendly reporting that we can leverage to communicate the results.
Note about Expectations exceeding implementations
We will likely write Expecations faster than we implement them. Particularly as outside contributors report issues they encounter during an upgrade, creating the Expectation is the necessary first step -- this is the equivalent of reporting a bug without having yet reproduced or fixed it.
When we make a report for the testing framework, we should report on all relevant expectations, even those without an implementation. Expectations that can't be tested yet (because they don't have an implementation) should be visibly flagged as such -- this both creates a checklist of what we have left to implement and alerts the user that they may want to do additional testing independently.
Expectation Knowledge Base
The Knowledge Base is the collection of all expectations--capturing our knowledge about the expected behavior of the system.
The goal of the structure of the Knowledge Base is to be human and machine friendly. A script should be able to parse through the knowledge base and pull out all relevant expectations (based on version range/tags), and a human should be able to find the expectation that was called out by the assessment tool or that failed in the testing framework.
The simplest possible structure would be a single JSON file, but this is moderately difficult for human searching and becomes quite unwieldy as the collection grows.
Instead, I suggest that most expectations should live on their own in a JSON file named
${expecation_id}.json
, in a single directory. This makes it very easy for human or script to find a specific expectation, and very easy for a script to collect all expectations. If we find that this needs to be subdivided in the future, it's simple to add subdirectories for expecations with a given plugin tag or similar topical categorizations.There may be some cases, like the date range bug used as an example above, where it's clearer for human comprehension to put two very closely related expectations (i.e. a bug and it's solution) in the same file, I think that's a pragmatic decision and the script(s) reading in the expectations should accomodate that.
Downselecting
We've been using the term downselecting to talk about the process of deciding which expectations should be executed or reported on in a given case. Not all exepctations will be relevant to all users.
A few example dimensions on which we might want to downselect:
Version is a special case--it's both more fundamental to the data model, and it also exists as a semi-continuous range of versions, where it's both annoying and verbose to have to specify every single version to which an expectation applies. For that reason, it's pulled out to its own top level field in the data model, and uses approximately OpenSearch range datatype semantics. Note that the range may be open-ended (e.g. "applies to all versions including and after 2.3" would have just a
"gte": "2.3.0"
, with nolt
). Versions are prefaced with "ES" (ElasticSearch) or "OS" (OpenSeach) to avoid collisions going forward.This format allows us to build tools later that easily select all expectations relevant to a specific version, without having to re-tag many expectations everytime there's a new release.
For other dimensions, an array of tags is included in each Expectation. These should specify any plugins and specialty datatypes (TODO: define this precisely!) to which the expectation is relevant.
Phases of Implementation
**Phase 0:** All expectations run (no downselecting). Tags are used for informational purposes.Phase 1: Expectations are manually selected ("select expectations for version
7.10.2
and taggedgeospatial
")Phase 2: Expectations are selected based on customer goals ("I'm upgrading from 7.10 to 2.4 and my data includes geospatial info and I use the security plugin")
In the short term, Expectations have tags, but there's no downselecting mechanism. All of the Exepctations are pulled from the library and provided to the test runner/report generator.
In the medium term, tags can be passed in via the test config. The downselector filters the library to pull out Expectations that have the given tag(s).
In the long term, customer goals and data/config characteristics are either provided by the customer or inferred from their setup. These are mapped to various tags and those tags are passed in to the downselector.
The text was updated successfully, but these errors were encountered: