Skip to content

Commit

Permalink
[docs] Vendor-Specific limitations added to troubleshooting sections (#…
Browse files Browse the repository at this point in the history
…38351)

Co-authored-by: Stephane Geneix <147216312+stephane-airbyte@users.noreply.github.com>
  • Loading branch information
evantahler and stephane-airbyte authored May 21, 2024
1 parent 5dac10a commit a914027
Show file tree
Hide file tree
Showing 12 changed files with 285 additions and 85 deletions.
25 changes: 12 additions & 13 deletions docs/integrations/destinations/postgres.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,21 +2,11 @@

This page guides you through the process of setting up the Postgres destination connector.

:::caution
## Warning

Postgres, while an excellent relational database, is not a data warehouse. Please only consider using postgres as a destination for small data volumes (e.g. less than 10GB) or for testing purposes. For larger data volumes, we recommend using a data warehouse like BigQuery, Snowflake, or Redshift.
:::warning

1. Postgres is likely to perform poorly with large data volumes. Even postgres-compatible
destinations (e.g. AWS Aurora) are not immune to slowdowns when dealing with large writes or
updates over ~100GB. Especially when using [typing and deduplication](/using-airbyte/core-concepts/typing-deduping) with `destination-postgres`, be sure to
monitor your database's memory and CPU usage during your syncs. It is possible for your
destination to 'lock up', and incur high usage costs with large sync volumes.
2. When attempting to scale a postgres database to handle larger data volumes, scaling IOPS (disk throughput) is as important as increasing memory and compute capacity.
3. Postgres column [name length limitations](https://www.postgresql.org/docs/current/limits.html)
are likely to cause collisions when used as a destination receiving data from highly-nested and
flattened sources, e.g. `{63 byte name}_a` and `{63 byte name}_b` will both be truncated to
`{63 byte name}` which causes postgres to throw an error that a duplicate column name was
specified. This limit is applicable to table names too.
Postgres, while an excellent relational database, is not a data warehouse. Please only consider using postgres as a destination for small data volumes (e.g. less than 10GB) or for testing purposes. For larger data volumes, we recommend using a data warehouse like BigQuery, Snowflake, or Redshift. Learn more [here](/integrations/destinations/postgres/postgres-troubleshooting#postgres-is-not-a-data-warehouse).

:::

Expand Down Expand Up @@ -261,6 +251,15 @@ Now that you have set up the Postgres destination connector, check out the follo
- [Migrate from mysql to postgres](https://airbyte.com/tutorials/migrate-from-mysql-to-postgresql)
- [Postgres replication](https://airbyte.com/tutorials/postgres-replication)

## Vendor-Specific Connector Limitations

:::warning

Not all implementations or deployments of a database will be the same. This section lists specific limitations and known issues with the connector based on _how_ or
_where_ it is deployed.

:::

## Changelog

| Version | Date | Pull Request | Subject |
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# Troubleshooting S3 Destinations

## Connector Limitations

### Postgres is not a Data Warehouse

:::danger

Postgres, while an excellent relational database, is not a data warehouse. Please only consider using postgres as a destination for small data volumes (e.g. less than 10GB) or for testing purposes. For larger data volumes, we recommend using a data warehouse like BigQuery, Snowflake, or Redshift.

:::

1. Postgres is likely to perform poorly with large data volumes. Even postgres-compatible
destinations (e.g. AWS Aurora) are not immune to slowdowns when dealing with large writes or
updates over ~100GB. Especially when using [typing and deduplication](/using-airbyte/core-concepts/typing-deduping) with `destination-postgres`, be sure to
monitor your database's memory and CPU usage during your syncs. It is possible for your
destination to 'lock up', and incur high usage costs with large sync volumes.
2. When attempting to scale a postgres database to handle larger data volumes, scaling IOPS (disk throughput) is as important as increasing memory and compute capacity.
3. Postgres column [name length limitations](https://www.postgresql.org/docs/current/limits.html)
are likely to cause collisions when used as a destination receiving data from highly-nested and
flattened sources, e.g. `{63 byte name}_a` and `{63 byte name}_b` will both be truncated to
`{63 byte name}` which causes postgres to throw an error that a duplicate column name was
specified. This limit is applicable to table names too.

### Vendor-Specific Connector Limitations

:::warning

Not all implementations or deployments of a database will be the same. This section lists specific limitations and known issues with the connector based on _how_ or _where_ it is deployed.

:::

#### Disk Access

The Airbyte Postgres destination relies on sending files to the database's temporary storage to then load in bulk. If your Postgres database does not have access to the `/tmp` file system, data loading will not succeed.
23 changes: 18 additions & 5 deletions docs/integrations/destinations/s3.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ Otherwise, if you are using AWS credentials you must provide the following:
- **Access Key ID**
- **Secret Access Key**

If you are using an Instance Profile, you may omit the Access Key ID and Secret Access Key,
If you are using an Instance Profile, you may omit the Access Key ID and Secret Access Key,
as well as, the Role ARN.

Additionally the following prerequisites are required:
Expand All @@ -46,9 +46,11 @@ support for AWS's
[shared responsibility model](https://aws.amazon.com/compliance/shared-responsibility-model/)

#### Create bucket a Policy

1. Open the [IAM console](https://console.aws.amazon.com/iam/home#home).
2. In the IAM dashboard, select **Policies**, then click **Create Policy**.
3. Select the **JSON** tab, then paste the following JSON into the Policy editor (be sure to substitute in your bucket name):

```json
{
"Version": "2012-10-17",
Expand Down Expand Up @@ -83,14 +85,17 @@ At this time, object-level permissions alone are not sufficient to successfully
#### Authentication Option 1: Using an IAM Role (Most secure)

<!-- env:cloud -->

:::note
This authentication method is currently in the testing phase. To enable it for your workspace, please contact our Support Team.
:::

<!-- /env:cloud -->

1. In the IAM dashboard, click **Roles**, then **Create role**. <!-- env:oss -->
2. Choose the appropriate trust entity and attach the policy you created.
3. Set up a trust relationship for the role. For example for **AWS account** trusted entity use default AWS account on your instance (it will be used to assume role). To use **External ID** set it to environment variables as `export AWS_ASSUME_ROLE_EXTERNAL_ID="{your-external-id}"`. Edit the trust relationship policy to reflect this:

```
{
"Version": "2012-10-17",
Expand All @@ -109,11 +114,14 @@ This authentication method is currently in the testing phase. To enable it for y
}
]
}
```
```

<!-- /env:oss -->
<!-- env:cloud -->

2. Choose the **AWS account** trusted entity type.
3. Set up a trust relationship for the role. This allows the Airbyte instance's AWS account to assume this role. You will also need to specify an external ID, which is a secret key that the trusting service (Airbyte) and the trusted role (the role you're creating) both know. This ID is used to prevent the "confused deputy" problem. The External ID should be your Airbyte workspace ID, which can be found in the URL of your workspace page. Edit the trust relationship policy to include the external ID:

```
{
"Version": "2012-10-17",
Expand All @@ -133,11 +141,12 @@ This authentication method is currently in the testing phase. To enable it for y
]
}
```

<!-- /env:cloud -->

4. Complete the role creation and note the Role ARN.
5. Select **Attach policies directly**, then find and check the box for your new policy. Click **Next**, then **Add permissions**.


##### Authentication Option 2: Using an IAM User

Use an existing or create new
Expand Down Expand Up @@ -212,7 +221,7 @@ Use an existing or create new
on how to create a instanceprofile. _ We recommend creating an Airbyte-specific user. This user
will require
[read and write permissions](https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_policies_examples_s3_rw-bucket.html)
to objects in the staging bucket. _ If the Access Key and Secret Access Key are not provided, the
to objects in the staging bucket. \_ If the Access Key and Secret Access Key are not provided, the
authentication will rely either on the Role ARN using STS Assume Role or on the instanceprofile.
5. _ **Secret Access Key** _ Corresponding key to
the above key id. _ Make sure your S3 bucket is accessible from the machine running Airbyte. _
Expand All @@ -237,7 +246,7 @@ Use an existing or create new
placeholders, as they won't recognized.
<!-- /env:oss -->

5. Click `Set up destination`.
6. Click `Set up destination`.

The full path of the output data with the default S3 Path Format
`${NAMESPACE}/${STREAM_NAME}/${YEAR}_${MONTH}_${DAY}_${EPOCH}_` is:
Expand Down Expand Up @@ -494,6 +503,10 @@ In order for everything to work correctly, it is also necessary that the user wh
}
```

## Limitations & Troubleshooting

To see connector limitations, or troubleshoot your S3 connector, see more [in our s3 troubleshooting guide](/integrations/destinations/s3/s3-troubleshooting).

## CHANGELOG

| Version | Date | Pull Request | Subject |
Expand Down
16 changes: 16 additions & 0 deletions docs/integrations/destinations/s3/s3-troubleshooting.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# Troubleshooting S3 Destinations

## Connector Limitations

### Vendor-Specific Connector Limitations

:::warning

Not all implementations or deployments an "S3-compatible destinations" will be the same. This section lists specific limitations and known issues with the connector based on _how_ or
_where_ it is deployed.

:::

#### Linode Object Storage

Liniode Object Storage does not properly return etags after setting them, which Airbyte relies on to verify the integrity of the data. This makes this destination currently incompatible with Airbyte.
42 changes: 3 additions & 39 deletions docs/integrations/sources/mongodb-v2.md
Original file line number Diff line number Diff line change
Expand Up @@ -172,43 +172,7 @@ When Schema is not enforced there is not way to deselect fields as all fields ar

## Limitations & Troubleshooting

### MongoDB Oplog and Change Streams

[MongoDB's Change Streams](https://www.mongodb.com/docs/manual/changeStreams/) are based on the [Replica Set Oplog](https://www.mongodb.com/docs/manual/core/replica-set-oplog/). This has retention limitations. Syncs that run less frequently than the retention period of the Oplog may encounter issues with missing data.

We recommend adjusting the Oplog size for your MongoDB cluster to ensure it holds at least 24 hours of changes. For optimal results, we suggest expanding it to maintain a week's worth of data. To adjust your Oplog size, see the corresponding tutorials for [MongoDB Atlas](https://www.mongodb.com/docs/atlas/cluster-additional-settings/#set-oplog-size) (fully-managed) and [MongoDB shell](https://www.mongodb.com/docs/manual/tutorial/change-oplog-size/) (self-hosted).

If you are running into an issue similar to "invalid resume token", it may mean you need to:

1. Increase the Oplog retention period.
2. Increase the Oplog size.
3. Increase the Airbyte sync frequency.

You can run the commands outlined [in this tutorial](https://www.mongodb.com/docs/manual/tutorial/troubleshoot-replica-sets/#check-the-size-of-the-oplog) to verify the current of your Oplog. The expect output is:

```yaml
configured oplog size: 10.10546875MB
log length start to end: 94400 (26.22hrs)
oplog first event time: Mon Mar 19 2012 13:50:38 GMT-0400 (EDT)
oplog last event time: Wed Oct 03 2012 14:59:10 GMT-0400 (EDT)
now: Wed Oct 03 2012 15:00:21 GMT-0400 (EDT)
```

When importing a large MongoDB collection for the first time, the import duration might exceed the Oplog retention period. The Oplog is crucial for incremental updates, and an invalid resume token will require the MongoDB collection to be re-imported to ensure no source updates were missed.

### Supported MongoDB Clusters

- Only supports [replica set](https://www.mongodb.com/docs/manual/replication/) cluster type.
- TLS/SSL is required by this connector. TLS/SSL is enabled by default for MongoDB Atlas clusters. To enable TSL/SSL connection for a self-hosted MongoDB instance, please refer to [MongoDb Documentation](https://docs.mongodb.com/manual/tutorial/configure-ssl/).
- Views, capped collections and clustered collections are not supported.
- Empty collections are excluded from schema discovery.
- Collections with different data types for the values in the `_id` field among the documents in a collection are not supported. All `_id` values within the collection must be the same data type.
- Atlas DB cluster are only supported in a dedicated M10 tier and above. Lower tiers may fail during connection setup.

### Schema Discovery & Enforcement

- Schema discovery uses [sampling](https://www.mongodb.com/docs/manual/reference/operator/aggregation/sample/) of the documents to collect all distinct top-level fields. This value is universally applied to all collections discovered in the target database. The approach is modelled after [MongoDB Compass sampling](https://www.mongodb.com/docs/compass/current/sampling/) and is used for efficiency. By default, 10,000 documents are sampled. This value can be increased up to 100,000 documents to increase the likelihood that all fields will be discovered. However, the trade-off is time, as a higher value will take the process longer to sample the collection.
- When Running with Schema Enforced set to `false` there is no attempt to discover any schema. See more in [Schema Enforcement](#Schema-Enforcement).
To see connector limitations, or troubleshoot your MongoDB connector, see more [in our MongoDB troubleshooting guide](/integrations/sources/mongodb-v2/mongodb-v2-troubleshooting).

## Configuration Parameters

Expand All @@ -231,8 +195,8 @@ For more information regarding configuration parameters, please see [MongoDb Doc
## Changelog

| Version | Date | Pull Request | Subject |
|:--------|:-----------| :------------------------------------------------------- |:----------------------------------------------------------------------------------------------------------|
| 1.3.13 | 2024-05-09 | [36851](https://github.com/airbytehq/airbyte/pull/36851) | Support reading collection with a binary _id type. |
| :------ | :--------- | :------------------------------------------------------- | :-------------------------------------------------------------------------------------------------------- |
| 1.3.13 | 2024-05-09 | [36851](https://github.com/airbytehq/airbyte/pull/36851) | Support reading collection with a binary \_id type. |
| 1.3.12 | 2024-05-07 | [36851](https://github.com/airbytehq/airbyte/pull/36851) | Upgrade debezium to version 2.5.1. |
| 1.3.11 | 2024-05-02 | [37753](https://github.com/airbytehq/airbyte/pull/37753) | Chunk size(limit) should correspond to ~1GB of data. |
| 1.3.10 | 2024-05-02 | [37781](https://github.com/airbytehq/airbyte/pull/37781) | Adopt latest CDK. |
Expand Down
Original file line number Diff line number Diff line change
@@ -1,16 +1,16 @@
# MongoDb Migration Guide
# Mongo DB Migration Guide

## Upgrading to 1.0.0

This version introduces a general availability version of the MongoDB V2 source connector, which leverages
[Change Data Capture (CDC)](https://docs.airbyte.com/understanding-airbyte/cdc) to improve the performance and
[Change Data Capture (CDC)](/understanding-airbyte/cdc) to improve the performance and
reliability of syncs. This version provides better error handling, incremental delivery of data and improved
reliability of large syncs via frequent checkpointing.

**THIS VERSION INCLUDES BREAKING CHANGES FROM PREVIOUS VERSIONS OF THE CONNECTOR!**

The changes will require you to reconfigure your existing MongoDB V2 configured source connectors. To review the
breaking changes and to learn how to upgrade the connector, refer to the [MongoDB V2 source connector documentation](mongodb-v2#upgrade-from-previous-version).
breaking changes and to learn how to upgrade the connector, refer to the [MongoDB V2 source connector documentation](/integrations/sources/mongodb-v2#upgrade-from-previous-version).
Additionally, you can manually update existing connections prior to the next scheduled sync to perform the upgrade or
re-create the source using the new configuration.

Expand All @@ -22,4 +22,4 @@ Worthy of specific mention, this version includes:
- Sampling of fields for schema discovery
- Required SSL/TLS connections

Learn more about what's new in the connection, view the updated documentation [here](mongodb-v2).
Learn more about what's new in the connection, view the updated documentation [here](/integrations/sources/mongodb-v2/).
Loading

0 comments on commit a914027

Please sign in to comment.