Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Single JSON schema validation in early boot #1175

Merged
merged 20 commits into from
Jan 18, 2022

Conversation

blackboxsw
Copy link
Collaborator

@blackboxsw blackboxsw commented Jan 7, 2022

Move full JSON schema validation of user-data to just after user-data has been collected.

The goals or this refactor are the following:

  • Provide a single cohesive static representation of all cloud-config schema in one file.
  • Avoid repetitive validate_cloudconfig_schema calls out of each individual cc_* module and run full schema validation once against user-data
  • Eventually publish full static versioned JSON schema file both in cloud-init packaging and to external JSON validation services

TODO

  • Create a useful single commit message for this PR
  • Add integration test
  • Followup PR: Pull in the rest of @TheRealFalcon's schema definitions for cloudinit/config/cc_[a-d].* modules into config/cloud-init-schema*.json

Proposed Commit Message

schema: single JSON schema validation in early boot

Package a single versioned JSON schema file at
/etc/cloud/schema/cloud-init-schema-<version>.json.

Add a Paths.schema_dir attribute and handle upgrade path
pickle deserialization.

Perform validate_cloudconfig_schema call to just after the
user-data is consumed. This will allow single validation of all
user-data against the full schema instead of
repetitive validatation calls against each cloud-config module
(cloudinit.config.cc_*) sub-schemas.

This branch defines the simple apt_pipelining schema and
migrates existing cc_apk_configure into cloud-init-schema-1.0.json.

The expectation will be additional branches to migrate from legacy
"schema" attributes inside each cloud-config module toward unique
cc_<module_name> definitions in the global shema file under "$defs"
of cloud-init-schema-X.Y..json.

Before legacy sub-schema definitions are migrated the following
funcs grew support to read sub-schemas from both static
cloud-init-schema.json and the individual cloud-config module
"schema" attributes:

- get_schema: source base schema file from cloud-init-schema.json
  and supplement with all legacy cloud-config module "schema" defs
- get_meta_doc: optional schema param so cloud-config modules
  no longer provide the own local sub-schemas
- _get_property_doc: render only documentation of sub-schema based
  on meta['id'] provided
- validate_cloudconfig_schema: allow optional schema param


Additionally, fix two minor bugs in _schemapath_for_cloudconfig:
- `cloud-init devel schema --annotate` which results in a Traceback
  if two keys at the same indent level have invalid types.
- exit early on empty cloud-config to avoid a Traceback on the CLI

Individual commits

Author: Chad Smith chad.smith@canonical.com
Date: Fri Jan 7 13:47:22 2022 -0700

schema: define schema meta for cc_apt_pipelining

commit bc6fcf5
Author: Chad Smith chad.smith@canonical.com
Date: Fri Jan 7 13:36:35 2022 -0700

schema: support legacy module schema and compisite global schema

Ultimately cloud-init wants a single JSON schema file defining
composite schema for all cc_* modules instead of each module
defining it's own sub-schema.

Legacy schema definitions took the form or a "schema" attribute
in each cc_* module. The "new" way would be adding schema
$defs to the single config/cloud-init-schema-<version> file.

End goal is to publish those static composite JSON schema files
to schema validation services to allow externals to consume and
validate user-data before spending resources on invalid
instance launches.

During migration from legacy schema definitions in each cc_*
module, cloudinit needs to support cc_* modules which define
schema localling within the cc_* module as a "schema" attribute.

- get_schema: source config/cloud-init-schema* and supplement with
  legacy cc_*.schema definitions to bridge gap between legacy
  cc_*.schema defitions and the "new" full config/cloud-init-schema.
- Add support for load_doc to handle legacy and combined schema defs
- get_meta_doc will validate against global composite schema when 'schema'
  param absent
- get_meta_doc will also limit scope of documentation to the specific
  meta["id"] attribute as defined in global schema $defs. To allow modules
  to avoid passing in their own sub-schema
- get_meta_doc needs to find_modules containing "meta" instead of "schema" as

s
non-legacy schema definitions exist in config/cloud-init-schema* instead of
in python dicts as a cc_*.schema attribute

commit e0f754b
Author: Chad Smith chad.smith@canonical.com
Date: Fri Jan 7 12:52:41 2022 -0700

schema: deliver single JSON schema file and Paths.schema_dir

- Define single JSON schema file to allow for validation of full
  schema for provided user-data.
- Package schema file in /etc/cloud/cloud-init-1.0.json
- Add Paths.schema_dir and unpickle upgrade test handling

commit f1a7c75
Author: Chad Smith chad.smith@canonical.com
Date: Wed Jan 5 12:45:29 2022 -0700

schema: perform validate_cloudcfg_schema once in early boot

Validate full schema one time early in boot just after user-data is
processed. This avoid 51+ repeated calls to validate_cloudcfg_schema
on subsections of the schema.

Moving validation to one place against the whole schema will allow
for a single concise error/warning about schema violations in one place

commit e0f754b
Author: Chad Smith chad.smith@canonical.com
Date: Fri Jan 7 12:52:41 2022 -0700

schema: deliver single JSON schema file and Paths.schema_dir

- Define single JSON schema file to allow for validation of full
  schema for provided user-data.
- Package schema file in /etc/cloud/cloud-init-1.0.json
- Add Paths.schema_dir and unpickle upgrade test handling

commit f1a7c75
Author: Chad Smith chad.smith@canonical.com
Date: Wed Jan 5 12:45:29 2022 -0700

schema: perform validate_cloudcfg_schema once in early boot

Validate full schema one time early in boot just after user-data is
processed. This avoid 51+ repeated calls to validate_cloudcfg_schema
on subsections of the schema.

Moving validation to one place against the whole schema will allow
for a single concise error/warning about schema violations in one place
in logs.

Additional Context

Test Steps

# validate schema docs and validation work for cc_apt_pipelining by setting up Paths to point to your local schema
cat > /etc/cloud/cloud.cfg.d/dev-schema.cfg <<EOF
system_info:
 paths:
   schema_dir: /YOUR_CLOUD_INIT_REPO_PATH/cloud-init/config
EOF
python3 -m cloudinit.cmd.main devel schema --docs cc_apt_pipelining
# check legacy schema docs still show up
python3 -m cloudinit.cmd.main devel schema --docs cc_apt_bootcmd

# Check annotation failures on legacy and "new" schema definitions
$ cat > invalid-legacy-and-new-schema.yaml <<EOF
#cloud-config
apt_pipelining: bogus
bootcmd: -1
EOF

$ PYTHONPATH=. python3 -m cloudinit.cmd.main devel schema -c  invalid-legacy-and-new-schema.yaml --annotate

# integration test 
tox -e integration-tests tests/integration_tests/modules/test_cli.py::test_invalid_userdata_schema

Checklist:

  • My code follows the process laid out in the documentation
  • I have updated or added any unit tests accordingly
  • I have updated or added any documentation accordingly

@blackboxsw blackboxsw force-pushed the single-schema-validation branch 5 times, most recently from 9a04f5b to 2f75ba4 Compare January 10, 2022 02:51
Copy link
Member

@TheRealFalcon TheRealFalcon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good with some minor comments inline.

One other issue. I noticed tests/unittests/config/test_schema.py isn't validating the cc_apt_pipelining examples. If I add something like apt_pipelining: ['abc', 1.5] to the examples list, it still passes, though I'm not sure why.

os.path.join(paths.schema_dir, "cloud-init-schema-*")
)
full_schema = None
for schema_file in schema_files:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we error or warning if we find more than one schema file? As written, this will loop through all of the schema files and use the last one found as the full_schema. Is there ever a reason to have more than one schema file?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TheRealFalcon what I was uncertain of when I first wrote this was whether we'd support plugin schemas in the future. I think we ended up having a discussion afterward that we would not look to support a parts-dir style composite schema. We would prefer downstream/private module owners to upstream their schema if they want schema validation coverage. So, I think it's safe to read one schema file only, certainly for this iteration. I'll drop the loop and hardcode the schema we read.

cloudinit/config/schema.py Show resolved Hide resolved
@blackboxsw blackboxsw force-pushed the single-schema-validation branch 2 times, most recently from 1fd4d48 to aa319ee Compare January 12, 2022 23:08
Copy link
Collaborator Author

@blackboxsw blackboxsw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notes/comments for reviewers.

}

__doc__ = get_meta_doc(meta, schema)
__doc__ = get_meta_doc(meta)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -10,7 +10,7 @@

from cloudinit import log as logging
from cloudinit import temp_utils, templater, util
from cloudinit.config.schema import get_meta_doc, validate_cloudconfig_schema
from cloudinit.config.schema import get_meta_doc
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Migrated schema modules will no longer define "schema" attribute or run validate_cloudconfig_schema since we run schema validation one-time against the full schema after the user-data is processed in main.py:480

@@ -593,7 +618,7 @@ def load_doc(requested_modules: list) -> str:
for mod_name in all_modules:
if "all" in requested_modules or mod_name in requested_modules:
(mod_locs, _) = importer.find_module(
mod_name, ["cloudinit.config"], ["schema"]
mod_name, ["cloudinit.config"], ["meta"]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New-style schema definitions in cc_* modules will only have a "meta" attribute, not a local schema.

)
full_schema = None
if len(schema_files) > 1:
LOG.warning(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to break the world if we don't have proper schema definitions. A warning will suffice.

Comment on lines 402 to 411
read_cfg_paths.return_value = paths
# New-style schema $defs exist in config/cloud-init-schema*.json
for schema_file in Path(cloud_init_project_dir("config/")).glob(
"cloud-init-schema*.json"
):
util.copy(schema_file, paths.schema_dir)
schema = get_schema()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to make a json_schema fixture at some point, but I didn't think to create that in this branch. This is a lot of boilerplate setup for us to pull into all 51 test_cc_* modules.



class TestAptPipelining(CiTestCase):
class TestAptPipelining:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally want to move away for CiTestCase subclassese wherever possible to get the benefits of pytest features/fixtures/etc.

@blackboxsw blackboxsw force-pushed the single-schema-validation branch from e3150cf to d5ae0d8 Compare January 13, 2022 03:00
@blackboxsw
Copy link
Collaborator Author

Copy link
Member

@TheRealFalcon TheRealFalcon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good, except we're missing the schema property descriptions from the docs.
image

That's the only major thing. Additionally:

For the parametrized tests, I would consider only testing for invalid schemas, and adding the valid schemas to the meta examples. These get tested via test_schema.py, and add some additional documentation for how to use the module. Not necessary for this PR, but something to consider.

Make sure you rebase if you haven't recently. I merged a new community provided keyboard module yesterday that included its own schema. It shouldn't cause any problems, but just make sure no unit tests are affected with the new structure.

cloudinit/config/cc_apt_pipelining.py Show resolved Hide resolved
Validate full schema one time early in boot just after user-data is
processed. This avoid 51+ repeated calls to validate_cloudcfg_schema
on subsections of the schema.

Moving validation to one place against the whole schema will allow
for a single concise error/warning about schema violations in one place
in logs.
- Define single JSON schema file to allow for validation of full
  schema for provided user-data.
- Package schema file in /etc/cloud/cloud-init-1.0.json
- Add Paths.schema_dir and unpickle upgrade test handling
Ultimately cloud-init wants a single JSON schema file defining
composite schema for all cc_* modules instead of each module
defining it's own sub-schema.

Legacy schema definitions took the form or a "schema" attribute
in each cc_* module. The "new" way would be adding schema
$defs to the single config/cloud-init-schema-<version> file.

End goal is to publish those static composite JSON schema files
to schema validation services to allow externals to consume and
validate user-data before spending resources on invalid
instance launches.

During migration from legacy schema definitions in each cc_*
module, cloudinit needs to support cc_* modules which define
schema localling within the cc_* module as a "schema" attribute.

- get_schema: source config/cloud-init-schema* and supplement with
  legacy cc_*.schema definitions to bridge gap between legacy
  cc_*.schema defitions and the "new" full config/cloud-init-schema.
- Add support for load_doc to handle legacy and combined schema defs
- get_meta_doc will validate against global composite schema when 'schema'
  param absent
- get_meta_doc will also limit scope of documentation to the specific
  meta["id"] attribute as defined in global schema $defs. To allow modules
  to avoid passing in their own sub-schema
- get_meta_doc needs to find_modules containing "meta" instead of "schema" as
  non-legacy schema definitions exist in config/cloud-init-schema* instead of
  in python dicts as a cc_*.schema attribute
Add tests for _schemapath_for_cloudconfig as it was
incorrectly constructing paths prefixes for list items.

Prevent concatonation of path_prefix to a key if the
key is the same indent_depth as prior line.
To allow for building docs locally in the project-dir, get_schema
needs to source ./config/cloud-init-schema*json which is data files
outside of the typical python module paths.

get_schema will prefer CLOUD_INIT_SCHEMA_DIR environment
variable over Paths.schema_dir if provided.

Allow tox.ini to passenv CLOUD_INIT_SCHEMA_DIR into the tox env.

To generate docs from project directory:
 CLOUD_INIT_SCHEMA_DIR=config tox -e doc

To validate schema docs from project directory:
 CLOUD_INIT_SCHEMA_DIR=config PYTHONPATH=. python3 \
     -m cloudinit.cmd.main devel schema --docs all
@blackboxsw blackboxsw force-pushed the single-schema-validation branch from d141767 to 2b43a2e Compare January 13, 2022 21:38
@TheRealFalcon
Copy link
Member

@blackboxsw , hmmm, I can't build docs now.

(cloud-init36) newt:cloud-init (single-schema-validation) $ tox -re doc
GLOB sdist-make: /home/james/c/cloud-init/setup.py
ERROR: invocation failed (exit code 1), logfile: /home/james/c/cloud-init/.tox/log/GLOB-0.log
========================================================================================== log start ==========================================================================================
git describe version (None) differs from cloudinit.version (21.4)
Please get the latest upstream tags.
As an example, this can be done with the following:
$ git remote add upstream https://git.launchpad.net/cloud-init
$ git fetch upstream --tags
Traceback (most recent call last):
  File "setup.py", line 331, in <module>
    version=get_version(),
  File "setup.py", line 73, in get_version
    ver = subprocess.check_output(cmd)
  File "/home/james/.pyenv/versions/3.6.10/lib/python3.6/subprocess.py", line 356, in check_output
    **kwargs).stdout
  File "/home/james/.pyenv/versions/3.6.10/lib/python3.6/subprocess.py", line 438, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['/home/james/envs/cloud-init36/bin/python', 'tools/read-version']' returned non-zero exit status 1.

=========================================================================================== log end ===========================================================================================
ERROR: FAIL could not package project - v = InvocationError('/home/james/envs/cloud-init36/bin/python setup.py sdist --formats=zip --dist-dir .tox/dist', 1)

If I switch to another branch I don't get this. Have you seen this?

@blackboxsw
Copy link
Collaborator Author

blackboxsw commented Jan 13, 2022

git describe version (None) differs from cloudinit.version (21.4)

@TheRealFalcon, Yes your poor trace there is telling you that tools/read-version can't find git tags to properly git describe the local directory. But the ordering of your output seems mixed up. I would not have expected to see the toos/read-version non-zero exit output as a result of the git pull --tags call though. To you have githooks in your env that trigger something ?

I'd suspect a second git pull --tags should help there?

git describe version (None) differs from cloudinit.version (21.4)
...
subprocess.CalledProcessError: Command '['/home/james/envs/cloud-init36/bin/python', 'tools/read-version']' returned non-zero exit status 1.

I also can't reproduce this on a system with

git clone --depth 1 git@github.com:canonical/cloud-init.git
cd cloud-init
git remote add ups https://git.launchpad.net/cloud-init
git fetch ups --tags

Copy link
Member

@TheRealFalcon TheRealFalcon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @blackboxsw , this looks good.

I think we can remove the schema_dir from the conftest. I'll let you do that and then you're free to merge.

conftest.py Outdated Show resolved Hide resolved
@blackboxsw blackboxsw merged commit 4ba6fd2 into canonical:main Jan 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants