-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support --empty flag for schema-only dry runs #8971
Conversation
Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the contributing guide. |
This comment was marked as outdated.
This comment was marked as outdated.
184b6d9
to
8c99a48
Compare
Spike Report The crux of this issue is in resolving
I've opted for (2) in this spike and would recommend it for its relative simplicity and given there is some precedence for doing a small amount of sql templating in the adapter python module. |
691942a
to
05606ee
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a couple questions but generally LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does using a limit of sys.maxsize
instead of None
make the code a little cleaner?
Then we might not even need resolve_limit
.
I can see how that could clean this up but it would mean that every rendered relation would be wrapped in a (select * limit max_size) - whether we are using the
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline, 🚢
Adapters need to be provided the ability to override the SQL statement here at minimum, because using a LIMIT clause in this way in BigQuery means that users will still be billed the full cost as if they had run their model for real and they will be very confused by this. BigQuery does provide some facility to avoid this but it would require either overriding the SQL statement here, or overriding the behaviour of |
I just happened to land that PR randomly looking at the latest commits and I think I've an interesting opened discussion that correlate with your intent here: That feature is a great step toward easily dry run.
Yet it's very manual (but working fairly well because developers love copy pasting existing models). However I would like to do the following on CI:
It feels like to do so properly, having |
Opened a new issue in dbt-labs/docs.getdbt.com: dbt-labs/docs.getdbt.com#4572 |
@github-christophe-oudar That's exactly the motivation behind this @mwstanleyft In my experience, while |
@mwstanleyft -- thanks for highlighting this issue for adapters in general and BigQuery specifically. For BigQuery, I believe the default implementation of No limit => full scan Actually - it seems like BigQuery scans 0 bytes if a limit 0 is provided: Regardless though - adapter implementations can overwrite the default behaviour (in python, not in user-space) of We'll definitely keep this issue in mind as we continue to refine sampling / limiting beyond this 'empty' use case. |
* moving types_pb2.py to common/events * Restore warning on unpinned git packages (#9157) * Support --empty flag for schema-only dry runs (#8971) * Fix ensuring we produce valid jsonschema artifacts for manifest, catalog, sources, and run-results (#9155) * Drop `all_refs=True` from jsonschema-ization build process Passing `all_refs=True` makes it so that Everything is a ref, even the top level schema. In jsonschema land, this essentially makes the produced artifact not a full schema, but a fractal object to be included in a schema. Thus when `$id` is passed in, jsonschema tools blow up because `$id` is for identifying a schema, which we explicitly weren't creating. The alternative was to drop the inclusion of `$id`. Howver, we're intending to create a schema, and having an `$id` is recommended best practice. Additionally since we were intending to create a schema, not a fractal, it seemed best to create to full schema. * Explicity produce jsonschemas using DRAFT_2020_12 dialect Previously were were implicitly using the `DRAFT_2020_12` dialect through mashumaro. It felt wise to begin explicitly specifying this. First, it is closest in available mashumaro provided dialects to what we produced pre 1.7. Secondly, if mashumaro changes its default for whatever reason (say a new dialect is added, and mashumaro moves to that), we don't want to automatically inherit that. * Bump manifest version to v12 Core 1.7 released with manifest v11, and we don't want to be overriding that with 1.8. It'd be weird for 1.7 and 1.8 to both have v11 manifests, but for them to be different, right? * Begin including schema dialect specification in produced jsonschema In jsonschema's documentation they state > It's not always easy to tell which draft a JSON Schema is using. > You can use the $schema keyword to declare which version of the JSON Schema specification the schema is written to. > It's generally good practice to include it, though it is not required. and > For brevity, the $schema keyword isn't included in most of the examples in this book, but it should always be used in the real world. Basically, to know how to parse a schema, it's important to include what schema dialect is being used for the schema specification. The change in this commit ensures we include that information. * Create manifest v12 jsonschema specification * Add change documentation for jsonschema schema production fix * Bump run-results version to v6 * Generate new v6 run-results jsonschema * Regenerate catalog v1 and sources v3 with fixed jsonschema production * Update tests to handle bumped versions of manifest and run-results --------- Co-authored-by: Jeremy Cohen <jeremy@dbtlabs.com> Co-authored-by: Michelle Ark <MichelleArk@users.noreply.github.com> Co-authored-by: Quigley Malcolm <QMalcolm@users.noreply.github.com>
* remove dbt.contracts.connection imports from adapter module * Move events to common (#8676) * Move events to common * More Type Annotations (#8536) * Extend use of type annotations in the events module. * Add return type of None to more __init__ definitions. * Still more type annotations adding -> None to __init__ * Tweak per review * Allow adapters to include python package logging in dbt logs (#8643) * add set_package_log_level functionality * set package handler * set package handler * add logging about stting up logging * test event log handler * add event log handler * add event log level * rename package and add unit tests * revert logfile config change * cleanup and add code comments * add changie * swap function for dict * add additional unit tests * fix unit test * update README and protos * fix formatting * update precommit --------- Co-authored-by: Peter Webb <peter.webb@dbtlabs.com> * fix import * move types_pb2.py from events to common/events * move agate_helper into common * Add utils module (#8910) * moving types_pb2.py to common/events * split out utils into core/common/adapters * add changie * remove usage of dbt.config.PartialProject from dbt/adapters (#8909) * remove usage of dbt.config.PartialProject from dbt/adapters * add changie --------- Co-authored-by: Colin <colin.rogers@dbtlabs.com> * move agate_helper unit tests under tests/unit/common * move agate_helper into common (#8911) * move agate_helper into common * add changie --------- Co-authored-by: Colin <colin.rogers@dbtlabs.com> * remove dbt.flags.MP_CONTEXT usage in dbt/adapters (#8931) * remove dbt.flags.LOG_CACHE_EVENTS usage in dbt/adapters (#8933) * Refactor Base Exceptions (#8989) * moving types_pb2.py to common/events * Refactor Base Exceptions * update make_log_dir_if_missing to handle str * move remaining adapters exception imports to common/adapters --------- Co-authored-by: Michelle Ark <michelle.ark@dbtlabs.com> * Remove usage of dbt.deprecations in dbt/adapters, enable core & adapter-specific (#9051) * Decouple adapter constraints from core (#9054) * Move constraints to dbt.common * Move constraints to contracts folder, per review * Add a changelog entry. * move include/global_project to adapters (#8930) * remove adapter.get_compiler (#9134) * Move adapter logger to adapters (#9165) * moving types_pb2.py to common/events * Move AdapterLogger to adapter folder * add changie * delete accidentally merged types_pb2.py * Move the semver package to common and alter references. (#9166) * Move the semver package to common and alter references. * Alter leftover references to dbt.semver, this time using from syntax. --------- Co-authored-by: Mila Page <versusfacit@users.noreply.github.com> * Refactor EventManager setup and interaction (#9180) * moving types_pb2.py to common/events * move event manager setup back to core, remove ref to global EVENT_MANAGER and clean up event manager functions * move invocation_id from events to first class common concept * move lowercase utils to common * move lowercase utils to common * ref CAPTURE_STREAM through method * add changie * first pass: adapter migration script (#9160) * Decouple macro generator from adapters (#9149) * Remove usage of dbt.contracts.relation in dbt/adapters (#9207) * Remove ResultNode usage from connections (#9211) * Add RelationConfig Protocol for use in Relation.create_from (#9210) * move relation contract to dbt.adapters * changelog entry * first pass: clean up relation.create_from * type ignores * type ignore * changelog entry * update RelationConfig variable names * Merge main into feature/decouple-adapters-from-core (#9240) * moving types_pb2.py to common/events * Restore warning on unpinned git packages (#9157) * Support --empty flag for schema-only dry runs (#8971) * Fix ensuring we produce valid jsonschema artifacts for manifest, catalog, sources, and run-results (#9155) * Drop `all_refs=True` from jsonschema-ization build process Passing `all_refs=True` makes it so that Everything is a ref, even the top level schema. In jsonschema land, this essentially makes the produced artifact not a full schema, but a fractal object to be included in a schema. Thus when `$id` is passed in, jsonschema tools blow up because `$id` is for identifying a schema, which we explicitly weren't creating. The alternative was to drop the inclusion of `$id`. Howver, we're intending to create a schema, and having an `$id` is recommended best practice. Additionally since we were intending to create a schema, not a fractal, it seemed best to create to full schema. * Explicity produce jsonschemas using DRAFT_2020_12 dialect Previously were were implicitly using the `DRAFT_2020_12` dialect through mashumaro. It felt wise to begin explicitly specifying this. First, it is closest in available mashumaro provided dialects to what we produced pre 1.7. Secondly, if mashumaro changes its default for whatever reason (say a new dialect is added, and mashumaro moves to that), we don't want to automatically inherit that. * Bump manifest version to v12 Core 1.7 released with manifest v11, and we don't want to be overriding that with 1.8. It'd be weird for 1.7 and 1.8 to both have v11 manifests, but for them to be different, right? * Begin including schema dialect specification in produced jsonschema In jsonschema's documentation they state > It's not always easy to tell which draft a JSON Schema is using. > You can use the $schema keyword to declare which version of the JSON Schema specification the schema is written to. > It's generally good practice to include it, though it is not required. and > For brevity, the $schema keyword isn't included in most of the examples in this book, but it should always be used in the real world. Basically, to know how to parse a schema, it's important to include what schema dialect is being used for the schema specification. The change in this commit ensures we include that information. * Create manifest v12 jsonschema specification * Add change documentation for jsonschema schema production fix * Bump run-results version to v6 * Generate new v6 run-results jsonschema * Regenerate catalog v1 and sources v3 with fixed jsonschema production * Update tests to handle bumped versions of manifest and run-results --------- Co-authored-by: Jeremy Cohen <jeremy@dbtlabs.com> Co-authored-by: Michelle Ark <MichelleArk@users.noreply.github.com> Co-authored-by: Quigley Malcolm <QMalcolm@users.noreply.github.com> * Move BaseConfig to Common (#9224) * moving types_pb2.py to common/events * move BaseConfig and assorted dependencies to common * move ShowBehavior and OnConfigurationChange to common * add changie * Remove manifest from catalog and connection method signatures (#9242) * Add MacroResolverProtocol, remove lazy loading of manifest in adapter.execute_macro (#9243) * remove manifest from adapter.execute_macro, replace with MacroResolver + remove lazy loading * rename to MacroResolverProtocol * pass MacroResolverProtcol in adapter.calculate_freshness_from_metadata * changelog entry * fix adapter.calculate_freshness call * pass context to MacroQueryStringSetter (#9248) * moving types_pb2.py to common/events * remove manifest from adapter.execute_macro, replace with MacroResolver + remove lazy loading * rename to MacroResolverProtocol * pass MacroResolverProtcol in adapter.calculate_freshness_from_metadata * changelog entry * fix adapter.calculate_freshness call * pass context to MacroQueryStringSetter * changelog entry --------- Co-authored-by: Colin <colin.rogers@dbtlabs.com> * add macro_context_generator on adapter (#9251) * moving types_pb2.py to common/events * remove manifest from adapter.execute_macro, replace with MacroResolver + remove lazy loading * rename to MacroResolverProtocol * pass MacroResolverProtcol in adapter.calculate_freshness_from_metadata * changelog entry * fix adapter.calculate_freshness call * add macro_context_generator on adapter * fix adapter test setup * changelog entry * Update parser to support conversion metrics (#9173) * added ConversionTypeParams classes * updated parser for ConversionTypeParams * added step to populate input_measure for conversion metrics * version bump on DSI * comment back manifest generating line * updated v12 schemas * added tests * added changelog * Add typing for macro_context_generator, fix query_header_context --------- Co-authored-by: Colin <colin.rogers@dbtlabs.com> Co-authored-by: William Deng <33618746+WilliamDee@users.noreply.github.com> * Pass mp_context to adapter factory (#9275) * moving types_pb2.py to common/events * require core to pass mp_context to adapter factory * add changie * fix SpawnContext annotation * Fix include for decoupling (#9286) * moving types_pb2.py to common/events * fix include path in MANIFEST.in * Fix include for decoupling (#9288) * moving types_pb2.py to common/events * fix include path in MANIFEST.in * add index.html to in MANIFEST.in * move system client to common (#9294) * moving types_pb2.py to common/events * move system.py to common * add changie update README * remove dbt.utils from semver.py * remove aliasing connection_exception_retry * Update materialized views to use RelationConfigs and remove refs to dbt.utils (#9291) * moving types_pb2.py to common/events * add AdapterRuntimeConfig protocol and clean up dbt-postgress core imports * add changie * remove AdapterRuntimeConfig * update changelog * Add config field to RelationConfig (#9300) * moving types_pb2.py to common/events * add config field to RelationConfig * merge main into feature/decouple-adapters-from-core (#9305) * moving types_pb2.py to common/events * Update parser to support conversion metrics (#9173) * added ConversionTypeParams classes * updated parser for ConversionTypeParams * added step to populate input_measure for conversion metrics * version bump on DSI * comment back manifest generating line * updated v12 schemas * added tests * added changelog * Remove `--dry-run` flag from `dbt deps` (#9169) * Rm --dry-run flag for dbt deps * Add changelog entry * Update test * PR feedback * adding clean_up methods to basic and unique_id tests (#9195) * init attempt of adding clean_up methods to basic and unique_id tests * swapping cleanup method drop of test_schema to unique_schema to test breakage on docs_generate test * moving the clean_up method down into class BaseDocsGenerate * remove drop relation for unique_schema * manually define alternate_schema for clean_up as not being seen as part of project_config * add changelog * remove unneeded changelog * uncomment line that generates new manifest and delete manifest our changes created * make sure the manifest test is deleted and readd older version of manifest.json to appease test * manually revert file to previous commit * Revert "manually revert file to previous commit" This reverts commit a755419. --------- Co-authored-by: William Deng <33618746+WilliamDee@users.noreply.github.com> Co-authored-by: Jeremy Cohen <jeremy@dbtlabs.com> Co-authored-by: Matthew McKnight <91097623+McKnight-42@users.noreply.github.com> * resolve merge conflict on unparsed.py (#9309) * moving types_pb2.py to common/events * Update parser to support conversion metrics (#9173) * added ConversionTypeParams classes * updated parser for ConversionTypeParams * added step to populate input_measure for conversion metrics * version bump on DSI * comment back manifest generating line * updated v12 schemas * added tests * added changelog * Remove `--dry-run` flag from `dbt deps` (#9169) * Rm --dry-run flag for dbt deps * Add changelog entry * Update test * PR feedback * adding clean_up methods to basic and unique_id tests (#9195) * init attempt of adding clean_up methods to basic and unique_id tests * swapping cleanup method drop of test_schema to unique_schema to test breakage on docs_generate test * moving the clean_up method down into class BaseDocsGenerate * remove drop relation for unique_schema * manually define alternate_schema for clean_up as not being seen as part of project_config * add changelog * remove unneeded changelog * uncomment line that generates new manifest and delete manifest our changes created * make sure the manifest test is deleted and readd older version of manifest.json to appease test * manually revert file to previous commit * Revert "manually revert file to previous commit" This reverts commit a755419. --------- Co-authored-by: William Deng <33618746+WilliamDee@users.noreply.github.com> Co-authored-by: Jeremy Cohen <jeremy@dbtlabs.com> Co-authored-by: Matthew McKnight <91097623+McKnight-42@users.noreply.github.com> * Resolve unparsed.py conflict (#9311) * Update parser to support conversion metrics (#9173) * added ConversionTypeParams classes * updated parser for ConversionTypeParams * added step to populate input_measure for conversion metrics * version bump on DSI * comment back manifest generating line * updated v12 schemas * added tests * added changelog * Remove `--dry-run` flag from `dbt deps` (#9169) * Rm --dry-run flag for dbt deps * Add changelog entry * Update test * PR feedback * adding clean_up methods to basic and unique_id tests (#9195) * init attempt of adding clean_up methods to basic and unique_id tests * swapping cleanup method drop of test_schema to unique_schema to test breakage on docs_generate test * moving the clean_up method down into class BaseDocsGenerate * remove drop relation for unique_schema * manually define alternate_schema for clean_up as not being seen as part of project_config * add changelog * remove unneeded changelog * uncomment line that generates new manifest and delete manifest our changes created * make sure the manifest test is deleted and readd older version of manifest.json to appease test * manually revert file to previous commit * Revert "manually revert file to previous commit" This reverts commit a755419. --------- Co-authored-by: William Deng <33618746+WilliamDee@users.noreply.github.com> Co-authored-by: Jeremy Cohen <jeremy@dbtlabs.com> Co-authored-by: Matthew McKnight <91097623+McKnight-42@users.noreply.github.com> --------- Co-authored-by: colin-rogers-dbt <111200756+colin-rogers-dbt@users.noreply.github.com> Co-authored-by: Peter Webb <peter.webb@dbtlabs.com> Co-authored-by: Colin <colin.rogers@dbtlabs.com> Co-authored-by: Mila Page <67295367+VersusFacit@users.noreply.github.com> Co-authored-by: Mila Page <versusfacit@users.noreply.github.com> Co-authored-by: Jeremy Cohen <jeremy@dbtlabs.com> Co-authored-by: Quigley Malcolm <QMalcolm@users.noreply.github.com> Co-authored-by: William Deng <33618746+WilliamDee@users.noreply.github.com> Co-authored-by: Matthew McKnight <91097623+McKnight-42@users.noreply.github.com> Co-authored-by: Chenyu Li <chenyu.li@dbtlabs.com>
I finally tried this feature and found one major caveat: There are at least few solutions that could be explored to solve that problem:
|
Hey @github-christophe-oudar, I'm curious about your issue but I'm not following, can you clarify? Why would |
@seub it's because of dbt-labs/dbt-adapters#124, it's fixed since then |
resolves #8980
Problem
Solution
BaseRelation
changeslimit
toBaseRelation
BaseRelation.render_limited
: depending on the value ofself.limit
, this method templates sql to wrap a rendered relation in an 'empty' select statement. Further reasoning in the spike reportBaseRelation.__str__
to useself.render_limited
ifself.limit
is set--empty
flag torun
andbuild
commands.resolve_limit
on theBaseResolver
BaseResolver
plumbs itsresolve_limit
to theRelation
instance it is creating, which is responsible for rendering itself given an optionallimit
parameter.Checklist