-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Develop a tool to generate "evil"/edge case datasets for OpenSearch #9
Comments
I like this idea. The verification in our current backwards compatibility tests is closer to a shallow health check (since the focus of the test is typically to see if the cluster is healthy) rather than a deep test that vets the tricky edge cases - which seems to be what is proposing being here 👍 Some high-level thoughts: Firstly, it seems to me that there are several sub-ideas in this proposal:
It's probably better to track these in their own meta-issues since each of them are composed of several facets that change in applicability and expected behavior based on the version of OpenSearch being tested. Wdyt? Secondly, I see that the proposal describes the data set as "emphasizes edge cases" but also "approaches a comprehensive set of data types and use-cases". Given that we've already got test data spread out across the BWC tests and the benchmarks suite, I'm concerned that trying to be comprehensive will end with this being "yet another standard". Instead, could we keep the data set just focused on edge cases? On a similar note, I think we should also be conservative about adding performance-related use-cases to this data set - edge cases that are consistently reproducible are probably the only good candidates to include. Everything else should be covered by the benchmarks suite, IMO. I would also prefer to include such "evil performance" tests after we've fleshed out the "evil functionality" data set, since the performance tests will ideally require tooling to execute them against a cluster. |
I want to make sure I’m understanding the objective correctly before I go too far. This tool would be able to generate a dataset (with particularly “evil” data) that we would expect to potentially cause a change in behavior/output when upgrading with a particular current OS version to a target OS version. This dataset could then be provided for users to test on their own test cluster. After this, I am not sure I follow point 3 concerning a set of queries and expected responses, is this something that would be valuable to a user or would it be mainly for say our own automated testing in OS? I imagine having a query a user could run on their initial cluster with “evil” data and then run again on their upgraded cluster to visually see some differences could be helpful but I’m curious for anyone’s thoughts here as I may be missing the intent. I do share similar concerns about making the divide between BWC tests / Benchmark tests / “Evil” tests clear and not leaving ambiguity as to where a test case should belong. I am not aware of any comprehensive edge case testing that is done in OS, especially with respect to functional changes between versions, so this seems like a good unique area to target that would probably have more overlap with Lucene tests then anything currently in OS. As a starting point it may be useful to focus here even with static data as this gets rounded out. |
This is a good beginning to what could be a very long work stream. Breaking this up into separate but related (cited) work streams, as kartg mentions, will likely provide better visibility. My high level comments would be to carefully consider the complexity of the surfaces that you need to test. If the “devil’s in the details”, you’ll have to consider lots of different ways to tweak various details. Trying to do that in a single monolithic/shared test suite may turn into a maintenance nightmare - making sure that each new case doesn’t cause other cases to regress. Setting up ways to manage, reuse/compose the details and complexity will let others contribute additional tests and data as well as allow for tests that are cheaper to run - meaning that we can run them more often, which is always a good thing. In addition to the composability, modeling the data and validations’ metadata can provide super valuable feedback into other automated systems to provide differences across environments. I understand kartg's concern that we have some datasets and you’re creating yet another one. However, if we want to find specific edge cases that triggered issues, it’s probably going to be easiest if we have a solid palette of stock contexts that we can pull from and then add one detail to create a breaking edge case. I’d recommend looking into how BWC are specified and what they’re testing. Long term, they should be unified and in such a way that minimizes the total effort. Meaning that if BWC tests are specified and have data already ready - use that. If there ARE additional things that you need to add, go through that refactoring exercise and start to prepare for a back port. There are a couple key reasons that these edge tests can’t just lift the BWC tests wholesale. 1) Migration tests are expected to break, BWC tests are expected to work. 2) When tests break - or show differences - will those differences be expected. If they are, what metadata do we want to keep track of to let our tools reason about what the differences are. 3) Migration test scenarios will have many more contexts than the BWC tests - including many cases that we don’t know about a priori. Smaller specific bits of feedback are as follows…
Some specific thoughts on modeling:
|
Thank you all for your feedback, this has been really helpful. Addressing one clarification point first from @lewijacn 's comment:
Yes, I think the queries are going to be an integral part. Some issues might pop up from just indexing the dataset, but a lot more of the issues are going to be queries that break or data that is interpreted as a different type, for instance. Users won't be aware of these issues unless 1/ they're already running extensive tests and very proficient at looking into this (in which case, this might not benefit them much) or 2/ we provide them with the queries and the "answer key" to know what they should expect to see. On to some of the other points, this is my summary of some of the common threads I see in the feedback.
Do you feel like that captures the bulk of your points? I'm working on an updated proposal that slims this down to a much narrower first deliverable that emphasizes providing the minimum necessary data and queries to illustrate each issue, while focusing purely on edge cases. This will likely start as a purely static dataset, where the only axis of configurability is selecting which tests to include. |
I've updated the proposal in the top comment to reflect the feedback and suggest a significantly different course. Please take a look and provide feedback on the new proposal. |
This proposal has been significantly modified as of 11/5. The original proposal can be expanded at the bottom of this post. Note that comments before 11/5 refer to the original proposal.
Is your feature request related to a problem? Please describe.
The behavior of ElasticSearch/OpenSearch changes between versions in ways both intentional (new features or datatypes, deprecated features) and unintentional (bugs). These changes impact upgrades and the decisions users make around them.
It would be useful to have a dataset that intentionally probed edge cases and behavior changes and could be used to verify the behavior of various versions and in various situations. As an example of how we would use this dataset: we could upload it to a cluster, run a series of queries, migrate/upgrade the cluster, and re-run the queries to ensure that the behavior has stayed the same (or changed only in expected ways).
This would be a living dataset—there will be many cases relevant to future versions of OpenSearch that we're not aware of today.
Describe alternatives you've considered
The backwards compatibility tests (especially the Full Cluster Restart tests) use some randomized testing data (source), including a few that seem targeted towards specific edge cases (e.g. the "field.with.dots" field name), but they're quite small and limited. Additionally, they largely don't focus on edge cases and there's no concept of behavior changing across versions.
The other related material I've found is the OpenSearch Benchmark Workloads. Per my understanding, this a collection of datasets with accompanying operations—index, update, aggregations to run, etc. The datasets seem to cover a broad list of realistic usecase scenarios, and are therefore interesting, but tailored to a different purpose. None of them seem to intentionally target the edge cases of interest in this case.
A previous version of this proposal suggested that datasets should be randomly generated with the option to test scale or performance related limits. In this version, the initial suggestion has been scaled down to focus specifically on edge cases that are specifically reproducible.
Describe the solution you'd like
I'd like to create a library of data "points" that can be used independently or together to
Each datapoint (not necessarily a single document) would be a directory that contains one or more of:
Datapoints would generally create and use their own index to prevent interference between various tests. This also allows it to be hard-coded into the bulk json file and the query.
Phase 0:
Data points are created with data & queries to illustrate known bugs, features, and API changes. They are run manually by a user (documentation provided) for each use case the user is interested in and the actual query result can be compared to the expected result.
Phase 1:
A "runner" script is added that can take a list of test cases, run them all, and show which did not give the expected result.
Phase 2:
Test cases can be tagged with specific versions or areas of interest (e.g. test cases for a specific plugin) and the runner script can select all datapoints meeting a specific use case.
After phase 1, this has a large amount of potential overlap with the future of #24 and the validation framework, so I haven't attempted to extrapolate too far down the path of what comes next.
Original Proposal
Is your feature request related to a problem? Please describe.
I (and my team) would like to make use of a consistent dataset for testing on OpenSearch that emphasizes edge cases--in our case, this would be very helpful for testing migrations and upgrades. While there are plenty of sample datasets out there (some mentioned below), our hope for this one is that it's a fairly comprehensive dataset that can capture intentional or unintentional differences in behavior in various settings, such as different versions.
A few categories we're aware of wanting to test: all currently existing data types, cases where dynamic field mapping behavior has changed, cases where new data formats were added, cases that approach the size limits for each field type, anywhere bugs have been fixed in various versions for ingestion or storage of specific field types. We're expecting to find more as we go and would love suggestions.
This would be a living dataset—there will be cases relevant to future versions of OpenSearch that we're not aware of today.
As an example of how we would use this dataset: we could upload it to a cluster, run a series of queries, migrate/upgrade the cluster, and re-run the queries to ensure that the behavior has stayed the same (or changed only in expected ways).
Describe alternatives you've considered
The backwards compatibility tests (especially the Full Cluster Restart tests) use some randomized testing data (source), including a few that seem targeted towards specific edge cases (e.g. the "field.with.dots" field name).
The other related material I've found is the OpenSearch Benchmark Workloads. Per my understanding, this a collection of datasets with accompanying operations—index, update, aggregations to run, etc. The datasets seem to cover a broad list of realistic usecase scenarios, and are therefore interesting, but tailored to a different purpose. None of them seem to intentionally target the edge cases of interest in this case.
I haven't come across other similar datasets, but would love to be pointed in their direction if they exist.
Describe the solution you'd like
Requirements for the dataset:
Given the requirements outlined above, it seems more feasible to create a script to randomly generate appropriate data on demand than a fixed dataset.
With this approach in place, there's a 7th requirement:
The user can provide (likely via a CLI) their requirements. For an MVP, this is probably just the number of documents (or total size of data) and an optional seed. Future iterations could accept the set of fields to include (requirement 4 above) and the output format (requirement 5).
Setting aside input and export related functionality, the core of this script would be very similar to libraries like faker.js/python faker that generate realistic fake data, and looking into their architecture may be helpful. For some specialized fields, it's possible that leveraging one of these libraries could be useful.
In the code, there needs to be a mapping between fields and functions to generate appropriate data. Many of these will be very basic—random alphanumeric string, random int, etc. with some more complicated ones (e.g. ip ranges or data that satisfies a specific edge case). Adding a new field to the dataset will require creating the generator function and adding it to the mapping with the field name.
For each field that's added, there also may (or may not) be 1/ one or more queries associated with the field (and their expected values), and 2/ an index field mapping entry.
It's possible that for some types of queries ("how many times does 'elephant' occur"), the randomized data is a poor match. As we encounter these cases, I think having a second, static dataset would be helpful. Adopting the benchmark workloads might be a good fit for this use case.
Specific Questions
The text was updated successfully, but these errors were encountered: