Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🎉 Update destination-s3 to handle the new data types protocol #20088

Merged
merged 18 commits into from
Dec 15, 2022

Conversation

suhomud
Copy link
Contributor

@suhomud suhomud commented Dec 5, 2022

What

New version protocol support V1. See https://github.com/airbytehq/airbyte/blob/master/airbyte-protocol/protocol-models/src/main/resources/airbyte_protocol/well_known_types.yaml

How

In order to support new protocol the base-java-s3 module should handle new json schema types:

   STRING_V1("WellKnownTypes.json#/definitions/String", Schema.Type.STRING),
   INTEGER_V1("WellKnownTypes.json#/definitions/Integer", Schema.Type.INT),
   NUMBER_V1("WellKnownTypes.json#/definitions/Number", Schema.Type.DOUBLE),
   BOOLEAN_V1("WellKnownTypes.json#/definitions/Boolean", Schema.Type.BOOLEAN),
   BINARY_DATA_V1("WellKnownTypes.json#/definitions/BinaryData", Schema.Type.BYTES),
   DATE_V1("WellKnownTypes.json#/definitions/Date", Schema.Type.INT),
   TIMESTAMP_WITH_TIMEZONE_V1("WellKnownTypes.json#/definitions/TimestampWithTimezone", Schema.Type.LONG),
   TIMESTAMP_WITHOUT_TIMEZONE_V1("WellKnownTypes.json#/definitions/TimestampWithoutTimezone", Schema.Type.LONG),
   TIME_WITH_TIMEZONE_V1("WellKnownTypes.json#/definitions/TimeWithTimezone", Schema.Type.LONG),
   TIME_WITHOUT_TIMEZONE_V1("WellKnownTypes.json#/definitions/TimeWithoutTimezone", Schema.Type.LONG),

V0 protocol is still supported but marked as @Deprecated:

  @Deprecated
   STRING_V0("string", null, Schema.Type.STRING),
   @Deprecated
   NUMBER_INT_V0("number", "integer", Schema.Type.INT),
   @Deprecated
   NUMBER_BIGINT_V0("string", "big_integer", Schema.Type.STRING),
   @Deprecated
   NUMBER_FLOAT_V0("number", "float", Schema.Type.FLOAT),
   @Deprecated
   NUMBER_V0("number", null, Schema.Type.DOUBLE),
   @Deprecated
   INTEGER_V0("integer", null, Schema.Type.INT),
   @Deprecated
   BOOLEAN_V0("boolean", null, Schema.Type.BOOLEAN),
   @Deprecated
   NULL("null", null, Schema.Type.NULL);

This PR Support era and string numbers for json-aero-converter adding the string numbers and new Date type support. Released version is 1.0.2

Recommended reading order

  1. JsonSchemaType.java new datatypes
  2. JsonToAvroSchemaConverter.java build avro schema with new protocol.
    Test related changes:
  3. AdvancedTestDataComparator.java Updated regex to match new date format. Fixed isDateTimeWithTzValue since it was not working and was not able to match date time with timezone. New compareTime method. Added workoround TEST_DATASET_IGNORE_LIST those values are used as string value in the dataset i.e. in exchange_rate_messages.txt and cannot pass some tests (probably need to be re-visit)
  4. S3AvroParquetTestDataComparator.java new S3 specific test data comparator (the old one renamed to S3BaseAvroParquetTestDataComparator)
  5. S3 destination integration test now return ProtocolVersion.V1 v1 test dataset

🚨 User Impact 🚨

The only noticable thing is when Avro schema has [int, string] types it will try to produce number from the string so "123" become 123 in result Avro object.

Pre-merge Checklist

Expand the relevant checklist and delete the others.

New Connector

Community member or Airbyter

  • Community member? Grant edit access to maintainers (instructions)
  • Secrets in the connector's spec are annotated with Airbyte_secret
  • Unit & integration tests added and passing. Community members, please provide proof of success locally e.g: screenshot or copy-paste unit, integration, and acceptance test output. To run acceptance tests for a Python connector, follow instructions in the README. For java connectors run ./gradlew :airbyte-integrations:connectors:<name>:integrationTest.
  • Code reviews completed
  • Documentation updated
    • Connector's README.md
    • Connector's bootstrap.md. See description and examples
    • docs/integrations/<source or destination>/<name>.md including changelog. See changelog example
    • docs/integrations/README.md
    • airbyte-integrations/builds.md
  • PR name follows PR naming conventions

Airbyter

If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.

  • Create a non-forked branch based on this PR and test the below items on it
  • Build is successful
  • If new credentials are required for use in CI, add them to GSM. Instructions.
  • /test connector=connectors/<name> command is passing
  • New Connector version released on Dockerhub by running the /publish command described here
  • After the connector is published, connector added to connector index as described here
  • Seed specs have been re-generated by building the platform and committing the changes to the seed spec files, as described here
Updating a connector

Community member or Airbyter

  • Grant edit access to maintainers (instructions)
  • Secrets in the connector's spec are annotated with Airbyte_secret
  • Unit & integration tests added and passing. Community members, please provide proof of success locally e.g: screenshot or copy-paste unit, integration, and acceptance test output. To run acceptance tests for a Python connector, follow instructions in the README. For java connectors run ./gradlew :airbyte-integrations:connectors:<name>:integrationTest.
  • Code reviews completed
  • Documentation updated
    • Connector's README.md
    • Connector's bootstrap.md. See description and examples
    • Changelog updated in docs/integrations/<source or destination>/<name>.md including changelog. See changelog example
  • PR name follows PR naming conventions

Airbyter

If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.

  • Create a non-forked branch based on this PR and test the below items on it
  • Build is successful
  • If new credentials are required for use in CI, add them to GSM. Instructions.
  • /test connector=connectors/<name> command is passing
  • New Connector version released on Dockerhub and connector version bumped by running the /publish command described here
Connector Generator
  • Issue acceptance criteria met
  • PR name follows PR naming conventions
  • If adding a new generator, add it to the list of scaffold modules being tested
  • The generator test modules (all connectors with -scaffold in their name) have been updated with the latest scaffold by running ./gradlew :airbyte-integrations:connector-templates:generator:testScaffoldTemplates then checking in your changes
  • Documentation which references the generator is updated as needed

Tests

Unit

Put your unit tests output here.

Integration

Put your integration tests output here.

Acceptance

Put your acceptance tests output here.

@github-actions
Copy link
Contributor

github-actions bot commented Dec 5, 2022

Affected Connector Report

NOTE ⚠️ Changes in this PR affect the following connectors. Make sure to do the following as needed:

  • Run integration tests
  • Bump connector or module version
  • Add changelog
  • Publish the new version

✅ Sources (0)

Connector Version Changelog Publish
  • See "Actionable Items" below for how to resolve warnings and errors.

❌ Destinations (47)

Connector Version Changelog Publish
destination-aws-datalake 0.1.1
destination-azure-blob-storage 0.1.6
destination-bigquery 1.2.8
destination-bigquery-denormalized 1.2.8
destination-cassandra 0.1.4
destination-clickhouse 0.2.1
destination-clickhouse-strict-encrypt 0.2.1 🔵
(ignored)
🔵
(ignored)
destination-csv 0.2.10
destination-databricks 0.3.1
destination-dev-null 0.2.7 🔵
(ignored)
🔵
(ignored)
destination-doris 0.1.0
destination-dynamodb 0.1.7
destination-e2e-test 0.2.4
destination-elasticsearch 0.1.6
destination-elasticsearch-strict-encrypt 0.1.6 🔵
(ignored)
🔵
(ignored)
destination-gcs 0.2.12
destination-iceberg 0.1.0
destination-jdbc 0.3.14 🔵
(ignored)
🔵
(ignored)
destination-kafka 0.1.10
destination-keen 0.2.4
destination-kinesis 0.1.5
destination-local-json 0.2.11
destination-mariadb-columnstore 0.1.7
destination-mongodb 0.1.9
destination-mongodb-strict-encrypt 0.1.9 🔵
(ignored)
🔵
(ignored)
destination-mqtt 0.1.3
destination-mssql 0.1.22
destination-mssql-strict-encrypt 0.1.22 🔵
(ignored)
🔵
(ignored)
destination-mysql 0.1.20
destination-mysql-strict-encrypt 0.1.21
(mismatch: 0.1.20)
🔵
(ignored)
🔵
(ignored)
destination-oracle 0.1.19
destination-oracle-strict-encrypt 0.1.19 🔵
(ignored)
🔵
(ignored)
destination-postgres 0.3.26
destination-postgres-strict-encrypt 0.3.26 🔵
(ignored)
🔵
(ignored)
destination-pubsub 0.2.0
destination-pulsar 0.1.3
destination-r2 0.1.0
destination-redis 0.1.4
destination-redpanda 0.1.0
destination-redshift 0.3.51
destination-rockset 0.1.4
destination-s3 0.3.18
destination-s3-glue 0.1.1
destination-scylla 0.1.3
destination-snowflake 0.4.40
destination-tidb 0.1.0
destination-yugabytedb 0.1.0
  • See "Actionable Items" below for how to resolve warnings and errors.

👀 Other Modules (1)

  • base-normalization

Actionable Items

(click to expand)

Category Status Actionable Item
Version
mismatch
The version of the connector is different from its normal variant. Please bump the version of the connector.

doc not found
The connector does not seem to have a documentation file. This can be normal (e.g. basic connector like source-jdbc is not published or documented). Please double-check to make sure that it is not a bug.
Changelog
doc not found
The connector does not seem to have a documentation file. This can be normal (e.g. basic connector like source-jdbc is not published or documented). Please double-check to make sure that it is not a bug.

changelog missing
There is no chnagelog for the current version of the connector. If you are the author of the current version, please add a changelog.
Publish
not in seed
The connector is not in the seed file (e.g. source_definitions.yaml), so its publication status cannot be checked. This can be normal (e.g. some connectors are cloud-specific, and only listed in the cloud seed file). Please double-check to make sure that it is not a bug.

diff seed version
The connector exists in the seed file, but the latest version is not listed there. This usually means that the latest version is not published. Please use the /publish command to publish the latest version.

@suhomud
Copy link
Contributor Author

suhomud commented Dec 6, 2022

/test connector=connectors/destination-s3

🕑 connectors/destination-s3 https://github.com/airbytehq/airbyte/actions/runs/3627141020
✅ connectors/destination-s3 https://github.com/airbytehq/airbyte/actions/runs/3627141020
No Python unittests run

Build Passed

Test summary info:

All Passed

@suhomud
Copy link
Contributor Author

suhomud commented Dec 6, 2022

/test connector=connectors/destination-aws-datalake

🕑 connectors/destination-aws-datalake https://github.com/airbytehq/airbyte/actions/runs/3627244532
❌ connectors/destination-aws-datalake https://github.com/airbytehq/airbyte/actions/runs/3627244532
🐛 https://gradle.com/s/ys5y5wp6xhv4c

Build Failed

Test summary info:

Could not find result summary

@suhomud
Copy link
Contributor Author

suhomud commented Dec 6, 2022

/test connector=connectors/destination-azure-blob-storage

🕑 connectors/destination-azure-blob-storage https://github.com/airbytehq/airbyte/actions/runs/3627244867
❌ connectors/destination-azure-blob-storage https://github.com/airbytehq/airbyte/actions/runs/3627244867
🐛 https://gradle.com/s/qpsxz42ririwe

Build Failed

Test summary info:

Could not find result summary

@suhomud
Copy link
Contributor Author

suhomud commented Dec 6, 2022

/test connector=connectors/destination-bigquery

🕑 connectors/destination-bigquery https://github.com/airbytehq/airbyte/actions/runs/3627247085
✅ connectors/destination-bigquery https://github.com/airbytehq/airbyte/actions/runs/3627247085
Python tests coverage:

Name                                                              Stmts   Miss  Cover
-------------------------------------------------------------------------------------
normalization/transform_config/__init__.py                            2      0   100%
normalization/transform_catalog/reserved_keywords.py                 14      0   100%
normalization/transform_catalog/__init__.py                           2      0   100%
normalization/destination_type.py                                    14      0   100%
normalization/__init__.py                                             4      0   100%
normalization/transform_catalog/destination_name_transformer.py     166      8    95%
normalization/transform_catalog/table_name_registry.py              174     34    80%
normalization/transform_config/transform.py                         189     48    75%
normalization/transform_catalog/utils.py                             51     14    73%
normalization/transform_catalog/dbt_macro.py                         22      7    68%
normalization/transform_catalog/catalog_processor.py                147     80    46%
normalization/transform_catalog/transform.py                         61     38    38%
normalization/transform_catalog/stream_processor.py                 595    400    33%
-------------------------------------------------------------------------------------
TOTAL                                                              1441    629    56%

Build Passed

Test summary info:

All Passed

@suhomud
Copy link
Contributor Author

suhomud commented Dec 6, 2022

/test connector=connectors/destination-azure-blob-storage

🕑 connectors/destination-azure-blob-storage https://github.com/airbytehq/airbyte/actions/runs/3630532492
✅ connectors/destination-azure-blob-storage https://github.com/airbytehq/airbyte/actions/runs/3630532492
No Python unittests run

Build Passed

Test summary info:

All Passed

@suhomud
Copy link
Contributor Author

suhomud commented Dec 6, 2022

/test connector=connectors/destination-bigquery-denormalized

🕑 connectors/destination-bigquery-denormalized https://github.com/airbytehq/airbyte/actions/runs/3630538852
✅ connectors/destination-bigquery-denormalized https://github.com/airbytehq/airbyte/actions/runs/3630538852
Python tests coverage:

Name                                                              Stmts   Miss  Cover
-------------------------------------------------------------------------------------
normalization/transform_config/__init__.py                            2      0   100%
normalization/transform_catalog/reserved_keywords.py                 14      0   100%
normalization/transform_catalog/__init__.py                           2      0   100%
normalization/destination_type.py                                    14      0   100%
normalization/__init__.py                                             4      0   100%
normalization/transform_catalog/destination_name_transformer.py     166      8    95%
normalization/transform_catalog/table_name_registry.py              174     34    80%
normalization/transform_config/transform.py                         189     48    75%
normalization/transform_catalog/utils.py                             51     14    73%
normalization/transform_catalog/dbt_macro.py                         22      7    68%
normalization/transform_catalog/catalog_processor.py                147     80    46%
normalization/transform_catalog/transform.py                         61     38    38%
normalization/transform_catalog/stream_processor.py                 595    400    33%
-------------------------------------------------------------------------------------
TOTAL                                                              1441    629    56%

Build Passed

Test summary info:

All Passed

@suhomud
Copy link
Contributor Author

suhomud commented Dec 6, 2022

/test connector=connectors/destination-databricks

🕑 connectors/destination-databricks https://github.com/airbytehq/airbyte/actions/runs/3630575136
✅ connectors/destination-databricks https://github.com/airbytehq/airbyte/actions/runs/3630575136
No Python unittests run

Build Passed

Test summary info:

All Passed

@suhomud
Copy link
Contributor Author

suhomud commented Dec 6, 2022

/test connector=connectors/destination-gcs

🕑 connectors/destination-gcs https://github.com/airbytehq/airbyte/actions/runs/3630609252
❌ connectors/destination-gcs https://github.com/airbytehq/airbyte/actions/runs/3630609252
🐛 https://gradle.com/s/5gwwtad3zztzw

Build Failed

Test summary info:

Could not find result summary

@suhomud
Copy link
Contributor Author

suhomud commented Dec 6, 2022

/test connector=connectors/destination-r2

🕑 connectors/destination-r2 https://github.com/airbytehq/airbyte/actions/runs/3630705152
❌ connectors/destination-r2 https://github.com/airbytehq/airbyte/actions/runs/3630705152
🐛 https://gradle.com/s/flnd2r2aixmrg

Build Failed

Test summary info:

Could not find result summary

@suhomud
Copy link
Contributor Author

suhomud commented Dec 7, 2022

/test connector=connectors/destination-gcs

🕑 connectors/destination-gcs https://github.com/airbytehq/airbyte/actions/runs/3640853305
❌ connectors/destination-gcs https://github.com/airbytehq/airbyte/actions/runs/3640853305
🐛

@suhomud
Copy link
Contributor Author

suhomud commented Dec 8, 2022

/test connector=connectors/destination-gcs

🕑 connectors/destination-gcs https://github.com/airbytehq/airbyte/actions/runs/3645715065
❌ connectors/destination-gcs https://github.com/airbytehq/airbyte/actions/runs/3645715065
🐛

@suhomud
Copy link
Contributor Author

suhomud commented Dec 8, 2022

/test connector=connectors/destination-gcs

🕑 connectors/destination-gcs https://github.com/airbytehq/airbyte/actions/runs/3647818498
❌ connectors/destination-gcs https://github.com/airbytehq/airbyte/actions/runs/3647818498
🐛 https://gradle.com/s/lrknh25sq762i

Build Failed

Test summary info:

Could not find result summary

@suhomud
Copy link
Contributor Author

suhomud commented Dec 12, 2022

/test connector=connectors/destination-s3

🕑 connectors/destination-s3 https://github.com/airbytehq/airbyte/actions/runs/3677576537
✅ connectors/destination-s3 https://github.com/airbytehq/airbyte/actions/runs/3677576537
No Python unittests run

Build Passed

Test summary info:

All Passed

@suhomud
Copy link
Contributor Author

suhomud commented Dec 12, 2022

/test connector=connectors/destination-bigquery

🕑 connectors/destination-bigquery https://github.com/airbytehq/airbyte/actions/runs/3677579432
✅ connectors/destination-bigquery https://github.com/airbytehq/airbyte/actions/runs/3677579432
Python tests coverage:

Name                                                              Stmts   Miss  Cover
-------------------------------------------------------------------------------------
normalization/transform_config/__init__.py                            2      0   100%
normalization/transform_catalog/reserved_keywords.py                 14      0   100%
normalization/transform_catalog/__init__.py                           2      0   100%
normalization/destination_type.py                                    14      0   100%
normalization/__init__.py                                             4      0   100%
normalization/transform_catalog/destination_name_transformer.py     166      8    95%
normalization/transform_catalog/table_name_registry.py              174     34    80%
normalization/transform_config/transform.py                         189     48    75%
normalization/transform_catalog/utils.py                             51     14    73%
normalization/transform_catalog/dbt_macro.py                         22      7    68%
normalization/transform_catalog/catalog_processor.py                147     80    46%
normalization/transform_catalog/transform.py                         61     38    38%
normalization/transform_catalog/stream_processor.py                 595    400    33%
-------------------------------------------------------------------------------------
TOTAL                                                              1441    629    56%

Build Passed

Test summary info:

All Passed

@suhomud
Copy link
Contributor Author

suhomud commented Dec 12, 2022

how hard would it be to add a parameter to only use v0/v1 mode? Right now the logic for both is mixed together, so when we (eventually) migrate everything to v1, it'll be hard to remove the v0 logic.

Then destination connectors would explicitly need to be upgraded to use v1. Which also limits the blast radius in case there's a hidden bug here, since they're not going to mysteriously have new behavior after getting published.

Not sure that I fully understand (probably let's sync about it). Generally speaking the only adaptation that I see for removal of V0 logic is to remove old schema json to avro types https://github.com/airbytehq/airbyte/pull/20088/files#diff-ccb0476def819bb3177cf9458fb6939bc25c42ea516305171e1e11be2a3d1a39R32 (It's marked as deprecated) Plus some places here https://github.com/airbytehq/airbyte/pull/20088/files#diff-d25b772db4efd1799ea8a085d4acc27f7f5393fda2dca2498c2d27391e6148deR235
The json-aero-converter is not required adaptation

Copy link
Contributor

@edgao edgao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚛

@suhomud
Copy link
Contributor Author

suhomud commented Dec 14, 2022

/test connector=connectors/destination-s3

🕑 connectors/destination-s3 https://github.com/airbytehq/airbyte/actions/runs/3698229535
❌ connectors/destination-s3 https://github.com/airbytehq/airbyte/actions/runs/3698229535
🐛 https://gradle.com/s/6jdh75x3x2asa

Build Failed

Test summary info:

Could not find result summary

@suhomud
Copy link
Contributor Author

suhomud commented Dec 15, 2022

@edgao as was discussed I run destination s3 integration tests with dataset for V0. Looks good to me:
image

@octavia-squidington-iv octavia-squidington-iv added the area/documentation Improvements or additions to documentation label Dec 15, 2022
@suhomud
Copy link
Contributor Author

suhomud commented Dec 15, 2022

/publish connector=connectors/destination-s3

🕑 Publishing the following connectors:
connectors/destination-s3
https://github.com/airbytehq/airbyte/actions/runs/3704497254


Connector Did it publish? Were definitions generated?
connectors/destination-s3

if you have connectors that successfully published but failed definition generation, follow step 4 here ▶️

@suhomud
Copy link
Contributor Author

suhomud commented Dec 15, 2022

/publish connector=connectors/destination-s3

🕑 Publishing the following connectors:
connectors/destination-s3
https://github.com/airbytehq/airbyte/actions/runs/3705336937


Connector Did it publish? Were definitions generated?
connectors/destination-s3

if you have connectors that successfully published but failed definition generation, follow step 4 here ▶️

@octavia-squidington-iii octavia-squidington-iii temporarily deployed to more-secrets December 15, 2022 15:51 — with GitHub Actions Inactive
@octavia-squidington-iii octavia-squidington-iii temporarily deployed to more-secrets December 15, 2022 15:51 — with GitHub Actions Inactive
@suhomud suhomud merged commit 1b8376e into master Dec 15, 2022
@suhomud suhomud deleted the suhomud/destination-s3_support_v1_protocol branch December 15, 2022 16:27
@edgao
Copy link
Contributor

edgao commented Dec 23, 2022

@suhomud - I did a test with some more complex data types and I think there's some (maybe?) slightly wrong behavior with union types. Specifically, if you have a schema with oneOf: [{$ref: String}, {$ref: Integer}], it will always become an avro string. But If you have oneOf: [Integer, String] then it will parse integers when possible, and fall back to string the rest of the time. Similar behavior for e.g. an array's items schema.

Though... we get the same thing with timestamps, which is maybe(?) preexisting behavior. So maybe we're just OK with this?

Test procedure:

  1. Create a directory ~/Desktop/protocol_test
  2. Write three files into that directory:
    1. catalog.json
      {
        "streams": [
          {
            "stream": {
              "name": "protocol_test",
              "namespace": "public",
              "json_schema": {
                "type": "object",
                "properties": {
                  "id": {"$ref": "WellKnownTypes.json#/definitions/Integer"},
                  "str_int_data": {
                    "oneOf": [
                      {"$ref": "WellKnownTypes.json#/definitions/String"},
                      {"$ref": "WellKnownTypes.json#/definitions/Integer"}
                    ]
                  },
                  "str_timestamp_data": {
                    "oneOf": [
                      {"$ref": "WellKnownTypes.json#/definitions/String"},
                      {"$ref": "WellKnownTypes.json#/definitions/TimestampWithTimezone"}
                    ]
                  }
                }
              },
              "default_cursor_field": [],
              "supported_sync_modes": [
                "full_refresh",
                "incremental"
              ],
              "source_defined_primary_key": [["id"]]
            },
            "sync_mode": "full_refresh",
            "primary_key": [["id"]],
            "cursor_field": [],
            "destination_sync_mode": "overwrite"
          }
        ]
      }
    2. config.json (replace access_key_id and secret_access_key with real values from https://console.cloud.google.com/security/secret-manager/secret/SECRET_DESTINATION-S3_CREDS/versions?project=dataline-integration-testing)
      {
        "format": {
          "format_type": "Avro",
          "compression_codec": {
            "codec": "no compression"
          }
        },
        "s3_endpoint": "",
        "access_key_id": "<redacted>",
        "s3_bucket_name": "airbyte-integration-test-destination-s3",
        "s3_bucket_path": "edgao_test",
        "s3_bucket_region": "us-east-2",
        "secret_access_key": "<redacted>"
      }
    3. inp.jsonl
      {"type": "RECORD", "record": {"stream": "protocol_test", "namespace": "public", "emitted_at": 1602637589000, "data": {"id": "1","str_int_data": "42", "str_timestamp_data": "2022-01-23T12:34:56.789Z"}}}
  3. cat ~/Desktop/protocol_test/inp.jsonl | docker run -i -v ~/Desktop/protocol_test:/in airbyte/destination-s3:dev write --catalog /in/catalog.json --config /in/config.json
  4. Download https://dlcdn.apache.org/avro/avro-1.11.1/java/avro-tools-1.11.1.jar to somewhere
  5. Download the avro file (e.g. aws s3 cp s3://airbyte-integration-test-destination-s3/edgao_test/public/protocol_test/2022_12_23_1671832609043_0.avro ~/Desktop/replicated.avro)
  6. java -jar path/to/avro-tools.jar tojson ~/Desktop/replicated.avro (you can also use java -jar path/to/avro-tools.jar getschema ~/Desktop/replicated.avro to see the avro schema)

Observe that str_int_data and str_timestamp_data both got converted to avro strings. But if you edit catalog.json to swap the order of the oneOf, then they get converted to int/long.

@suhomud
Copy link
Contributor Author

suhomud commented Dec 27, 2022

@edgao great analysis! I think this behaviour was here before this PR and it is related to json-avro-converter lib. So if the first type in union is a string then it checks if it is an instance of string. In our case it is true, therefor nothing prevent converter to put it as a string because it is a valid string

We can think about making readUnion more intelligent one the json-avro-converter side. What I can see is to make some parsing order for example if union has TimestampWithTimezone and String types then first it will "try" to parse TimestampWithTimezone and if failed String. More complex example is numbers since user can has a data with numbers stored as string so here we can still have a mismatch

@edgao
Copy link
Contributor

edgao commented Jan 2, 2023

nice, makes sense. Let's just keep this behavior for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update destination-s3 to handle the new data types protocol
5 participants