Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Change datatype of column type in BaseColumn to allow larger datatype names for complexed columns #17360

Merged

Conversation

cccs-joel
Copy link
Contributor

SUMMARY

Some columns such as the ones representing a complex structure (array, struct, enum or a combination of these) may require more than 32 chars to store the datatype. Changing datatype to TEXT and no limit was suggested by @villebro in the 1st associated issue listed below.

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

Before:
image
After:
image

TESTING INSTRUCTIONS

After the migration, easiest way to test is to edit an existing dataset and change the type value of a column (using the legacy datasource editor) to something larger than 32 characters, Superset should accept the change and confirm the row was changed in the database.

ADDITIONAL INFORMATION

  • Has associated issue:
  • Required feature flags:
  • Changes UI
  • Includes DB Migration (follow approval process in SIP-59)
    • Migration is atomic, supports rollback & is backwards-compatible
    • Confirm DB migration upgrade and downgrade tested
    • Runtime estimates and downtime expectations provided: downtime is minimal, takes a second to execute the script.
  • Introduces new feature or API
  • Removes existing feature or API

@cccs-joel cccs-joel requested a review from a team as a code owner November 5, 2021 17:43
@betodealmeida betodealmeida changed the title Fix: Change datatype of column type in BaseColumn to allow larger datatype names for complexed columns fix: Change datatype of column type in BaseColumn to allow larger datatype names for complexed columns Nov 5, 2021
@betodealmeida betodealmeida added the risk:db-migration PRs that require a DB migration label Nov 5, 2021
@codecov
Copy link

codecov bot commented Nov 5, 2021

Codecov Report

Merging #17360 (bb42450) into master (485852d) will decrease coverage by 0.04%.
The diff coverage is 100.00%.

❗ Current head bb42450 differs from pull request most recent head 7aebfa5. Consider uploading reports for the commit 7aebfa5 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17360      +/-   ##
==========================================
- Coverage   68.11%   68.07%   -0.05%     
==========================================
  Files        1653     1653              
  Lines       66374    66374              
  Branches     7121     7121              
==========================================
- Hits        45211    45182      -29     
- Misses      19266    19295      +29     
  Partials     1897     1897              
Flag Coverage Δ
hive 81.78% <100.00%> (ø)
postgres ?
presto 82.07% <100.00%> (+<0.01%) ⬆️
python 82.55% <100.00%> (-0.10%) ⬇️
sqlite 81.88% <100.00%> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
superset/connectors/base/models.py 88.19% <100.00%> (ø)
superset/datasets/schemas.py 96.61% <100.00%> (ø)
superset/sql_validators/postgres.py 50.00% <0.00%> (-50.00%) ⬇️
superset/databases/commands/update.py 85.71% <0.00%> (-8.17%) ⬇️
superset/common/utils/dataframe_utils.py 85.71% <0.00%> (-7.15%) ⬇️
superset/databases/commands/create.py 82.35% <0.00%> (-5.89%) ⬇️
superset/reports/commands/log_prune.py 85.71% <0.00%> (-3.58%) ⬇️
superset/commands/importers/v1/utils.py 89.13% <0.00%> (-2.18%) ⬇️
superset/databases/api.py 90.94% <0.00%> (-2.10%) ⬇️
superset/db_engine_specs/postgres.py 96.36% <0.00%> (-0.91%) ⬇️
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0d2299c...7aebfa5. Read the comment docs.

@etr2460
Copy link
Member

etr2460 commented Nov 5, 2021

Should we use a VARCHAR here with a longer char limit instead of TEXT? I think there are performance implications of using TEXT, but i'm not knowledgeable enough of a backend engineer to know for sure

@cccs-joel
Copy link
Contributor Author

Should we use a VARCHAR here with a longer char limit instead of TEXT? I think there are performance implications of using TEXT, but i'm not knowledgeable enough of a backend engineer to know for sure

We could... but how longer? We deal with complexed columns (think many levels of nested elements) and the schema becomes the default type when saving the query as a dataset in sqllab. Others reported similar symptoms in the above issues. But yeah, more than happy to hear from the engineers for the potential side effects of this change.

@etr2460
Copy link
Member

etr2460 commented Nov 5, 2021

how long are we talking? more than 255 characters? more than 1000? sorry, i don't know just how complex this can get

@ktmud
Copy link
Member

ktmud commented Nov 5, 2021

I'm wondering if we can have two columns for this... store 95% cases in VACHAR with a reasonable limit and use TEXT to store large ENUMs and more advanced structs. Then there can be some helper functions to translate the complex types to more generic types to be used by the UI.

@cccs-joel
Copy link
Contributor Author

how long are we talking? more than 255 characters? more than 1000? sorry, i don't know just how complex this can get

I have use cases with more than 3000 characters, hard to predict.

@github-actions
Copy link
Contributor

⚠️ @cccs-joel Your base branch master has just also updated superset/migrations.

Please consider rebasing your branch to avoid db migration conflicts.

@betodealmeida
Copy link
Member

Should we use a VARCHAR here with a longer char limit instead of TEXT? I think there are performance implications of using TEXT, but i'm not knowledgeable enough of a backend engineer to know for sure

I know for fact that for Postgres there's no cost in using TEXT instead of VARCHAR, and it might even be faster in some cases. Not sure about MySQL and other DBs.

Copy link
Member

@betodealmeida betodealmeida left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

For reference, in the models for SIP-68 I'm also using TEXT for column types.

@villebro
Copy link
Member

Should we use a VARCHAR here with a longer char limit instead of TEXT? I think there are performance implications of using TEXT, but i'm not knowledgeable enough of a backend engineer to know for sure

I know for fact that for Postgres there's no cost in using TEXT instead of VARCHAR, and it might even be faster in some cases. Not sure about MySQL and other DBs.

It's also my experience that VARCHAR and TEXT have pretty similar performance on all databases I've used. I don't think it will have any performance impact in this case.

Copy link
Member

@villebro villebro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - just wondering if we should add a note in UPDATING.md, as this migration will probably may take some time to complete on large deployments?

@etr2460
Copy link
Member

etr2460 commented Nov 16, 2021

Thanks to people smarter than I double checking the perf implications.

We definitely should have a note in UPDATING though, as this will probably be a table lock on mysql dbs. otherwise, lgtm

@ktmud
Copy link
Member

ktmud commented Nov 16, 2021

IIRC, there does have some implications on search performance for VARCHAR vs TEXT in MySQL. Search w/ indexing is either not possible or much slower doe TEXT depending on which storage engine you use and which MySQL version you are on.

@cccs-joel
Copy link
Contributor Author

Thanks to people smarter than I double checking the perf implications.

We definitely should have a note in UPDATING though, as this will probably be a table lock on mysql dbs. otherwise, lgtm

Sorry, I'm new to this process, should I write this note?

@cccs-joel
Copy link
Contributor Author

IIRC, there does have some implications on search performance for VARCHAR vs TEXT in MySQL. Search w/ indexing is either not possible or much slower doe TEXT depending on which storage engine you use and which MySQL version you are on.

Thanks for you input, I appreciate as this is not my expertise, I still need guidance whether or not we should use a VARCHJAR with a higher number or a TEXT for that specific column.

@github-actions
Copy link
Contributor

⚠️ @cccs-joel Your base branch master has just also updated superset/migrations.

Please consider rebasing your branch to avoid db migration conflicts.

@betodealmeida
Copy link
Member

Thanks to people smarter than I double checking the perf implications.
We definitely should have a note in UPDATING though, as this will probably be a table lock on mysql dbs. otherwise, lgtm

Sorry, I'm new to this process, should I write this note?

Yeah, just add it to the file (https://github.com/apache/superset/blob/master/UPDATING.md) under the next release.

@betodealmeida
Copy link
Member

IIRC, there does have some implications on search performance for VARCHAR vs TEXT in MySQL. Search w/ indexing is either not possible or much slower doe TEXT depending on which storage engine you use and which MySQL version you are on.

Thanks for you input, I appreciate as this is not my expertise, I still need guidance whether or not we should use a VARCHJAR with a higher number or a TEXT for that specific column.

Looks like with MySQL we can use still TEXT, but we need to specify a length in order to have an index: https://dev.mysql.com/doc/refman/8.0/en/create-index.html#create-index-column-prefixes. So we could change to TEXT, and later still add an index if needed.

Performance-wise there seems to be an extra cost when operating on TEXT (https://dba.stackexchange.com/a/222182), but I think it's safe to assume that we're only going to do simple scans on this table, so it should be fine from what I understand.

Since we don't know the maximum expected size of this column I think it's OK to:

  1. Switch to TEXT
  2. If needed in the future add an index prefix, after consulting the community on the size

@cccs-joel
Copy link
Contributor Author

cccs-joel commented Nov 24, 2021

Thanks to people smarter than I double checking the perf implications.
We definitely should have a note in UPDATING though, as this will probably be a table lock on mysql dbs. otherwise, lgtm

Sorry, I'm new to this process, should I write this note?

Yeah, just add it to the file (https://github.com/apache/superset/blob/master/UPDATING.md) under the next release.
Done here: #17541 assuming unless you want me to do it in this pull request.

@cccs-joel cccs-joel mentioned this pull request Nov 24, 2021
9 tasks
@github-actions
Copy link
Contributor

⚠️ @cccs-joel Your base branch master has just also updated superset/migrations.

Please consider rebasing your branch to avoid db migration conflicts.

@github-actions
Copy link
Contributor

github-actions bot commented Dec 3, 2021

⚠️ @cccs-joel Your base branch master has just also updated superset/migrations.

Please consider rebasing your branch to avoid db migration conflicts.

@cccs-joel
Copy link
Contributor Author

Can someone take a look at this PR, some checks didn't pass for obscure reasons but other than that, seems ready to go.

Copy link
Member

@betodealmeida betodealmeida left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cccs-joel this looks great, but we need to update the down revision in your migration before merging.

UPDATING.md Outdated Show resolved Hide resolved

# revision identifiers, used by Alembic.
revision = "3ba29ecbaac5"
down_revision = "b92d69a6643c"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your tests are failing because your migration is now introducing a second HEAD. You can change the down revision here and in line 20 to abe27eaf93db (which you can see if you run superset db heads).

@github-actions
Copy link
Contributor

⚠️ @cccs-joel Your base branch master has just also updated superset/migrations.

Please consider rebasing your branch to avoid db migration conflicts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels risk:db-migration PRs that require a DB migration size/M 🚢 1.5.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants