Primary Keys

IF

A table without a Primary Keys is replicated.

THEN
  • Initial job: Table is created without Primary Key and no NOT NULL columns.
  • Subsequent jobs: If using Key-based Incremental or Log-based Incremental Replication, data will be added to the table in an Append-Only fashion.

    If using Full Table Replication, the table will be overwritten in its entirety during each job.

IF

A table with a single Primary Key is replicated.

THEN
  • Initial job: Table is created without Primary Key and no NOT NULL columns. Primary Key info is stored as a comment on the table.
  • Subsequent jobs: If using Key-based Incremental or Log-based Incremental Replication, data will be de-duped based on the Primary Key column and upserted into the table.

    If using Full Table Replication, the table will be overwritten in its entirety during each sync.

IF

A table with multiple Primary Keys is replicated.

THEN
  • Initial job: Table is created without Primary Key and no NOT NULL columns. Primary Key info is stored as a comment on the table.
  • Subsequent jobs: If using Key-based Incremental or Log-based Incremental Replication, data will be de-duped based on the Primary Key columns and upserted into the table.

    If using Full Table Replication, the table will be overwritten in its entirety during each job.

IF

The table’s Primary Key(s) is/are changed.

THEN

If using Key-based Incremental or Log-based Incremental Replication, data will be added to the table in an Append-Only fashion.

If using Full Table Replication, the table will be overwritten in its entirety during each job.

IF

The table’s Primary Key comment is either removed or changed to be incorrectly formatted.

THEN

Data will continue to be loaded into the table in an Append-Only fashion.


Replication Keys

IF

A table is replicated where the Replication Key column contains NULL values.

THEN
  • During the initial job, the table will be created and all rows will be replicated.
  • During subsequent jobs, only rows with populated Replication Keys will be replicated and persisted to Redshift.

Tables

IF

A table is replicated where the Replication Key column contains NULL values.

THEN
  • During the initial job, the table will be created and all rows will be replicated.
  • During subsequent jobs, only rows with populated Replication Keys will be replicated and persisted to Redshift.
IF

A table name contains more characters than allowed by Redshift.

THEN

Redshift will reject all data for the table.

Rejected records are logged in the _sdc_rejected table of the integration's schema. Learn more.

AND

Redshift will surface the following error:

Table name [TABLE] is too long for Redshift
FIX IT

If possible, change the table name in the source to be less than Redshift’s character limit of 127 characters characters.

Use the _sdc_rejected table to identify the root of the issue.

IF

A table arrives with more columns than Redshift allows.

THEN

Redshift will reject all data for the table.

Rejected records are logged in the _sdc_rejected table of the integration's schema. Learn more.

AND

Redshift will surface the following error:

ERROR: too many columns
FIX IT

If possible, deselect some columns to allow Stitch to load data into Redshift for the table. Redshift has a limit of 1,600 columns columns per table.

Use the _sdc_rejected table to identify the root of the issue.

IF

A column contains only NULL values.

THEN

Only columns that are populated in at least one record will be created in Redshift.

For example: On the left is a table as it exists in the source; on the right is the table in Redshift as replicated by Stitch:

SOURCE     FIRST JOB  
id updated_at [rep-key] confirmed id updated_at [rep-key]
1 2017-06-01   1 2017-06-01

Prior to the next job, a new record is added to the table with a populated value in the confirmed column. Because the confirmed column contains at least one value, Stitch will create this column in Redshift:

SOURCE     SECOND JOB    
id updated_at [rep-key] confirmed id updated_at [rep-key] confirmed
1 2017-06-01   1 2017-06-01  
2 2017-06-02 true 2 2017-06-02 true

Note In this example, confirmed is NULL for record 1 in Redshift. If record 1’s confirmed value is ever updated, its Replication Key value must also change to ensure Stitch detects the changes.

IF

A table is tracked that contains entirely NULL columns.

THEN
  • During the initial job, the table will be created with only Stitch’s _sdc columns.
  • During subsequent job, only columns that are populated in at least one record will be created in Redshift.
IF

Columns in a table using Incremental Replication are backfilled.

THEN

Note: While this example is typical of Key-based Incremental Replication, it is also true for tables using Log-based Incremental Replication. In the latter scenario, think of the updated_at column as a log position, which is what Stitch uses as a Replication Key when reading from database logs.

If the Replication Key used by the table is updated to a value greater than or equal to the last recorded maximum value, the backfilled records will be selected for replication.

Consider the example below. On the left is a table as it exists in the source; on the right is the table in Redshift as replicated by Stitch:

SOURCE     FIRST JOB    
id updated_at [rep-key] confirmed id updated_at [rep-key] confirmed
1     1    
2     2    
3 2017-06-01 true 3 2017-06-01 true

A new record is added in the source with a populated Replication Key value. Stitch replicates record 4 and saves 2017-06-02 as the Replication Key value:

SOURCE     SECOND JOB    
id updated_at [rep-key] confirmed id updated_at [rep-key] confirmed
1     1    
2     2    
3 2017-06-01 true 3 2017-06-01 true
4 2017-06-02 false 4 2017-06-02 false

Before the next job, the confirmed column is backfilled for records 1 and 2. However, because the Replication Key for record 1 is still NULL, only record 2 will be replicated:

SOURCE     THIRD JOB    
id updated_at [rep-key] confirmed id updated_at [rep-key] confirmed
1   true 1    
2 2017-06-03 false 2 2017-06-03 false
3 2017-06-01 true 3 2017-06-01 true
4 2017-06-02 false 4 2017-06-02 false

For record 1’s confirmed value to be reflected in Redshift, its Replication Key in the source must be populated with a value that is greater than or equal to the last maximum saved value. In this example, that’s 2017-06-03:

SOURCE     THIRD JOB    
id updated_at [rep-key] confirmed id updated_at [rep-key] confirmed
1 2017-06-03 true 1 2017-06-03 true
2 2017-06-03 false 2 2017-06-03 false
3 2017-06-01 true 3 2017-06-01 true
4 2017-06-02 false 4 2017-06-02 false

If there are a large number of backfilled records in a source table, resetting the table’s Replication Keys or temporarily using Full Table Replication may be required to replicate the backfilled values.

IF

A new column is added in a SaaS integration/data source to a table that’s currently set to replicate.

THEN
  1. Stitch will detect the new column during the integration’s next structure sync.
  2. The column will be automatically set to replicate.
  3. If the table uses Key-based Incremental Replication, only records with a Replication Key value greater than or equal to the last saved bookmark value will be selected for replication.

    This means that if values for the new column were backfilled in existing records, the only way to have Stitch replicate these values is to either:

    • Update the Replication Key values for all records in the source, OR
    • Reset the integration’s Replication Keys in Stitch
  4. In the destination, the column will be appended to the end of the table.
IF

A new column is added by you to a Stitch-generated table in Redshift.

THEN

Columns may be added to tables created by Stitch as long as they are nullable, meaning columns don’t have NOT NULL constraints.

IF

A column is deleted at the source.

THEN

How a deleted column is reflected in Redshift depends on the table’s Replication Method.

Consider the example below. On the left is a table as it exists in the source; on the right is the table in Redshift:

SOURCE     DESTINATION    
id updated_at [rep-key] confirmed id updated_at [rep-key] confirmed
1 2017-06-03 true 1 2017-06-03 true
2 2017-06-03 false 2 2017-06-03 false

Before the next job, the confirmed column will be deleted in the source. Additionally, new records will be added to the table.

Log-based Incremental Replication

Due to how binary logs are structured, changes to a source table - including adding or removing columns, changing data types, etc. - require manual intervention before replication can continue. Refer to the Log-based Incremental Replication documentation for more info.

Key-based Incremental Replication

For tables using Key-based Incremental Replication, default NULLs will be placed into the column going forward and the column will remain in the destination.

Below is the source table with the deleted confirmed column and how this would be reflected in Redshift:

SOURCE   DESTINATION    
id updated_at [rep-key] id updated_at [rep-key] confirmed
1 2017-06-03 1 2017-06-03 true
2 2017-06-03 2 2017-06-03 false
3 2017-06-04 3 2017-06-04  
4 2017-06-04 4 2017-06-04  

Full Table Replication

For tables using Full Table Replication: during the next job, the entire table will be overwritten. The table’s schema will reflect the column’s removal in the source:

SOURCE   DESTINATION  
id updated_at [rep-key] id updated_at [rep-key]
1 2017-06-03 1 2017-06-03
2 2017-06-03 2 2017-06-03
3 2017-06-04 3 2017-06-04
4 2017-06-04 4 2017-06-04
IF

You remove a column from a Stitch-replicated table in your destination.

THEN

The result of deleting a column from a Stitch-generated table depends on the type of column being removed:

  • Primary Key columns: Removing Primary Key columns will result in data being added to the table in an Append-Only fashion. As Stitch uses Primary Keys to de-dupe data, deleting Primary Key columns may result in duplicate data.
  • General columns: This refers to all columns that are not prepended by _sdc or suffixed by a data type. For example: customer_zip, but not customer_zip__st

    If new data is detected for the removed column, Stitch will re-create it in Redshift.

    Note: An integration must support tracking individual columns AND you must deselect the column in Stitch for the column removal to be permanent.

  • _sdc columns: Removing a Stitch replication column will prevent Stitch from upserting replicated data into Redshift.
  • Columns with data type suffixes: Removing a column created as result of accommodating multiple data types will prevent Stitch from loading replicated data into the table. This applies to columns with names such as: customer_zip__st, customer_zip__int, etc.

Columns

IF

A table is replicated where the Replication Key column contains NULL values.

THEN
  • During the initial job, the table will be created and all rows will be replicated.
  • During subsequent jobs, only rows with populated Replication Keys will be replicated and persisted to Redshift.
IF

Stitch detects multiple data types for a single column.

THEN

For example: Stitch first detected that order_confirmed contained BOOLEAN data, but during a subsequent job, detected STRING values. To accommodate data of varying types, Stitch will:

  1. Rename the original column by appending the first-detected data type to the column name. Ex: order_confirmed__bl
  2. Store data for the first-detected data type in the renamed column. In this example, only BOOLEAN values will be stored in order_confirmed__bl
  3. Create additional columns to store the other data types - one for each data type detected - and append the data type to the column name. In this example, a order_confirmed__st column will be created to store STRING values.
IF

A VARCHAR column is replicated to Redshift.

THEN

Redshift will type the column as VARCHAR(n), where n is the width of the widest record in the column.

For example: During the initial replication job, data in a VARCHAR column is found to have a width greater than 128. Redshift will create type this column as VARCHAR(256) to accommodate the data.

IF

VARCHAR data is loaded that exceeds the current maximum size for the column.

THEN

Redshift will widen the column to accommodate the width of the largest record up to the limit of 65K.

For example: during the initial replication job, a VARCHAR column is created and typed as VARCHAR(128).

During a subsequent job, Stitch identifies data for this column that exceeds the current size maximum of 128 but is less than 256.

To accommodate the data, the column will become VARCHAR(256).

IF

A VARCHAR column is replicated with data that exceeds Redshift’s size limit of 65K.

THEN

Redshift will truncate the data to the maximumm width of 65K.

IF

A column containing date data with timezone info is replicated to Redshift.

THEN

Redshift will store the data as TIMESTAMP WITHOUT TIME ZONE. In Redshift, TIMESTAMP WITHOUT TIME ZONE data is stored without timezone information and expressed as UTC.

Consider the following timestamp:

2017-08-14 11:24:02 GMT-0400 (EDT)

This data would be stored as the following in Redshift:

2017-08-14 11:24:02

For more info on this data type, refer to Amazon’s Datetime Type documentation.

IF

A column contains timestamp data that is outside Redshift’s supported range.

THEN

Redshift will reject the records that fall outside the supported range.

Rejected records are logged in the _sdc_rejected table of the integration's schema. Learn more.

AND

Redshift will surface the following error:

timestamp out of range for Redshift on [TIMESTAMP]
FIX IT

To resolve the error, offending values in the source must be changed to be within Redshift’s timestamp range of 4713 BC to 294276 AD.

Use the _sdc_rejected table to identify the root of the issue.

IF

A column contains integer data that is outside Redshift’s supported range.

THEN

Redshift will reject the records that fall outside the supported range.

Rejected records are logged in the _sdc_rejected table of the integration's schema. Learn more.

AND

Redshift will surface the following error:

integer out of range for Redshift on [INTEGER]
FIX IT

To resolve the error, offending values in the source must be changed to be within Redshift’s limit of -9223372036854775808 to 9223372036854775807.

Use the _sdc_rejected table to identify the root of the issue.

IF

A column contains decimal data.

THEN

Redshift will truncate the data.

IF

A column contains decimal data that is outside Redshift’s supported range.

THEN

Redshift will reject the records that fall outside the supported range.

Rejected records are logged in the _sdc_rejected table of the integration's schema. Learn more.

AND

Redshift will surface the following error:

decimal out of range for Redshift on [DECIMAL]
FIX IT

To resolve the error, offending values in the source must be changed to be within Redshift’s limit for decimal data.

Use the _sdc_rejected table to identify the root of the issue.

IF

A table arrives with more columns than Redshift allows.

THEN

Redshift will reject all data for the table.

Rejected records are logged in the _sdc_rejected table of the integration's schema. Learn more.

AND

Redshift will surface the following error:

ERROR: too many columns
FIX IT

If possible, deselect some columns to allow Stitch to load data into Redshift for the table. Redshift has a limit of 1,600 columns columns per table.

Use the _sdc_rejected table to identify the root of the issue.

IF

A column contains only NULL values.

THEN

Only columns that are populated in at least one record will be created in Redshift.

For example: On the left is a table as it exists in the source; on the right is the table in Redshift as replicated by Stitch:

SOURCE     FIRST JOB  
id updated_at [rep-key] confirmed id updated_at [rep-key]
1 2017-06-01   1 2017-06-01

Prior to the next job, a new record is added to the table with a populated value in the confirmed column. Because the confirmed column contains at least one value, Stitch will create this column in Redshift:

SOURCE     SECOND JOB    
id updated_at [rep-key] confirmed id updated_at [rep-key] confirmed
1 2017-06-01   1 2017-06-01  
2 2017-06-02 true 2 2017-06-02 true

Note In this example, confirmed is NULL for record 1 in Redshift. If record 1’s confirmed value is ever updated, its Replication Key value must also change to ensure Stitch detects the changes.

IF

A table is tracked that contains entirely NULL columns.

THEN
  • During the initial job, the table will be created with only Stitch’s _sdc columns.
  • During subsequent job, only columns that are populated in at least one record will be created in Redshift.
IF

Columns in a table using Incremental Replication are backfilled.

THEN

Note: While this example is typical of Key-based Incremental Replication, it is also true for tables using Log-based Incremental Replication. In the latter scenario, think of the updated_at column as a log position, which is what Stitch uses as a Replication Key when reading from database logs.

If the Replication Key used by the table is updated to a value greater than or equal to the last recorded maximum value, the backfilled records will be selected for replication.

Consider the example below. On the left is a table as it exists in the source; on the right is the table in Redshift as replicated by Stitch:

SOURCE     FIRST JOB    
id updated_at [rep-key] confirmed id updated_at [rep-key] confirmed
1     1    
2     2    
3 2017-06-01 true 3 2017-06-01 true

A new record is added in the source with a populated Replication Key value. Stitch replicates record 4 and saves 2017-06-02 as the Replication Key value:

SOURCE     SECOND JOB    
id updated_at [rep-key] confirmed id updated_at [rep-key] confirmed
1     1    
2     2    
3 2017-06-01 true 3 2017-06-01 true
4 2017-06-02 false 4 2017-06-02 false

Before the next job, the confirmed column is backfilled for records 1 and 2. However, because the Replication Key for record 1 is still NULL, only record 2 will be replicated:

SOURCE     THIRD JOB    
id updated_at [rep-key] confirmed id updated_at [rep-key] confirmed
1   true 1    
2 2017-06-03 false 2 2017-06-03 false
3 2017-06-01 true 3 2017-06-01 true
4 2017-06-02 false 4 2017-06-02 false

For record 1’s confirmed value to be reflected in Redshift, its Replication Key in the source must be populated with a value that is greater than or equal to the last maximum saved value. In this example, that’s 2017-06-03:

SOURCE     THIRD JOB    
id updated_at [rep-key] confirmed id updated_at [rep-key] confirmed
1 2017-06-03 true 1 2017-06-03 true
2 2017-06-03 false 2 2017-06-03 false
3 2017-06-01 true 3 2017-06-01 true
4 2017-06-02 false 4 2017-06-02 false

If there are a large number of backfilled records in a source table, resetting the table’s Replication Keys or temporarily using Full Table Replication may be required to replicate the backfilled values.

IF

A new column is added in a SaaS integration/data source to a table that’s currently set to replicate.

THEN
  1. Stitch will detect the new column during the integration’s next structure sync.
  2. The column will be automatically set to replicate.
  3. If the table uses Key-based Incremental Replication, only records with a Replication Key value greater than or equal to the last saved bookmark value will be selected for replication.

    This means that if values for the new column were backfilled in existing records, the only way to have Stitch replicate these values is to either:

    • Update the Replication Key values for all records in the source, OR
    • Reset the integration’s Replication Keys in Stitch
  4. In the destination, the column will be appended to the end of the table.
IF

A new column is added by you to a Stitch-generated table in Redshift.

THEN

Columns may be added to tables created by Stitch as long as they are nullable, meaning columns don’t have NOT NULL constraints.

IF

A column is deleted at the source.

THEN

How a deleted column is reflected in Redshift depends on the table’s Replication Method.

Consider the example below. On the left is a table as it exists in the source; on the right is the table in Redshift:

SOURCE     DESTINATION    
id updated_at [rep-key] confirmed id updated_at [rep-key] confirmed
1 2017-06-03 true 1 2017-06-03 true
2 2017-06-03 false 2 2017-06-03 false

Before the next job, the confirmed column will be deleted in the source. Additionally, new records will be added to the table.

Log-based Incremental Replication

Due to how binary logs are structured, changes to a source table - including adding or removing columns, changing data types, etc. - require manual intervention before replication can continue. Refer to the Log-based Incremental Replication documentation for more info.

Key-based Incremental Replication

For tables using Key-based Incremental Replication, default NULLs will be placed into the column going forward and the column will remain in the destination.

Below is the source table with the deleted confirmed column and how this would be reflected in Redshift:

SOURCE   DESTINATION    
id updated_at [rep-key] id updated_at [rep-key] confirmed
1 2017-06-03 1 2017-06-03 true
2 2017-06-03 2 2017-06-03 false
3 2017-06-04 3 2017-06-04  
4 2017-06-04 4 2017-06-04  

Full Table Replication

For tables using Full Table Replication: during the next job, the entire table will be overwritten. The table’s schema will reflect the column’s removal in the source:

SOURCE   DESTINATION  
id updated_at [rep-key] id updated_at [rep-key]
1 2017-06-03 1 2017-06-03
2 2017-06-03 2 2017-06-03
3 2017-06-04 3 2017-06-04
4 2017-06-04 4 2017-06-04
IF

You remove a column from a Stitch-replicated table in your destination.

THEN

The result of deleting a column from a Stitch-generated table depends on the type of column being removed:

  • Primary Key columns: Removing Primary Key columns will result in data being added to the table in an Append-Only fashion. As Stitch uses Primary Keys to de-dupe data, deleting Primary Key columns may result in duplicate data.
  • General columns: This refers to all columns that are not prepended by _sdc or suffixed by a data type. For example: customer_zip, but not customer_zip__st

    If new data is detected for the removed column, Stitch will re-create it in Redshift.

    Note: An integration must support tracking individual columns AND you must deselect the column in Stitch for the column removal to be permanent.

  • _sdc columns: Removing a Stitch replication column will prevent Stitch from upserting replicated data into Redshift.
  • Columns with data type suffixes: Removing a column created as result of accommodating multiple data types will prevent Stitch from loading replicated data into the table. This applies to columns with names such as: customer_zip__st, customer_zip__int, etc.
IF

Two columns are replicated that canonicalize to the same name.

THEN

For example: a table containing both CustomerId and customerid columns.

Redshift will reject the records and create a log for the rejected records in the _sdc_rejected table in that integration’s schema.

Rejected records are logged in the _sdc_rejected table of the integration's schema. Learn more.

AND

Redshift will surface the following error:

Field collision on [COLUMN_NAME]
FIX IT

If possible, re-name one of the columns in the source so that both column names will be unique when replicated to Redshift.

Use the _sdc_rejected table to identify the root of the issue.

IF

A column is replicated that has a mixed-case name.

THEN

Redshift will convert letters to lowercase. For example:

Columns in Source Columns in Redshift
CuStOmErId customerid
customerID customerid
IF

A column is replicated that has a name with spaces.

THEN

Redshift will maintain spaces. For example:

Columns in Source Columns in Redshift
customer id customer id
CUSTOMER ID customer id
IF

A column is replicated with a name that contains unsupported special characters.

THEN

Redshift will remove all special characters. Note that this doesn’t include underscores or dollar signs. For example:

Columns in Source   Columns in Redshift
customer!id   customerid
!CUSTOMERID   customerid
IF

A column is replicated with a name that begins with a non-letter.

THEN

This scenario is unremarkable to Redshift.

IF

A column name contains more characters than allowed by Redshift.

THEN

Redshift will reject columns with names that exceed the column character limit. Other columns in the table will persist to Redshift.

Rejected records are logged in the _sdc_rejected table of the integration's schema. Learn more.

AND

Redshift will surface the following error:

Column name [COLUMN] is too long for Redshift
FIX IT

If possible, change the column name in the source to be less than Redshift’s character limit of 115 characters characters.

Use the _sdc_rejected table to identify the root of the issue.


Data Typing

IF

Stitch detects multiple data types for a single column.

THEN

For example: Stitch first detected that order_confirmed contained BOOLEAN data, but during a subsequent job, detected STRING values. To accommodate data of varying types, Stitch will:

  1. Rename the original column by appending the first-detected data type to the column name. Ex: order_confirmed__bl
  2. Store data for the first-detected data type in the renamed column. In this example, only BOOLEAN values will be stored in order_confirmed__bl
  3. Create additional columns to store the other data types - one for each data type detected - and append the data type to the column name. In this example, a order_confirmed__st column will be created to store STRING values.
IF

A VARCHAR column is replicated to Redshift.

THEN

Redshift will type the column as VARCHAR(n), where n is the width of the widest record in the column.

For example: During the initial replication job, data in a VARCHAR column is found to have a width greater than 128. Redshift will create type this column as VARCHAR(256) to accommodate the data.

IF

VARCHAR data is loaded that exceeds the current maximum size for the column.

THEN

Redshift will widen the column to accommodate the width of the largest record up to the limit of 65K.

For example: during the initial replication job, a VARCHAR column is created and typed as VARCHAR(128).

During a subsequent job, Stitch identifies data for this column that exceeds the current size maximum of 128 but is less than 256.

To accommodate the data, the column will become VARCHAR(256).

IF

A VARCHAR column is replicated with data that exceeds Redshift’s size limit of 65K.

THEN

Redshift will truncate the data to the maximumm width of 65K.

IF

A column containing date data with timezone info is replicated to Redshift.

THEN

Redshift will store the data as TIMESTAMP WITHOUT TIME ZONE. In Redshift, TIMESTAMP WITHOUT TIME ZONE data is stored without timezone information and expressed as UTC.

Consider the following timestamp:

2017-08-14 11:24:02 GMT-0400 (EDT)

This data would be stored as the following in Redshift:

2017-08-14 11:24:02

For more info on this data type, refer to Amazon’s Datetime Type documentation.

IF

A column contains timestamp data that is outside Redshift’s supported range.

THEN

Redshift will reject the records that fall outside the supported range.

Rejected records are logged in the _sdc_rejected table of the integration's schema. Learn more.

AND

Redshift will surface the following error:

timestamp out of range for Redshift on [TIMESTAMP]
FIX IT

To resolve the error, offending values in the source must be changed to be within Redshift’s timestamp range of 4713 BC to 294276 AD.

Use the _sdc_rejected table to identify the root of the issue.

IF

A column contains integer data that is outside Redshift’s supported range.

THEN

Redshift will reject the records that fall outside the supported range.

Rejected records are logged in the _sdc_rejected table of the integration's schema. Learn more.

AND

Redshift will surface the following error:

integer out of range for Redshift on [INTEGER]
FIX IT

To resolve the error, offending values in the source must be changed to be within Redshift’s limit of -9223372036854775808 to 9223372036854775807.

Use the _sdc_rejected table to identify the root of the issue.

IF

A column contains decimal data.

THEN

Redshift will truncate the data.

IF

A column contains decimal data that is outside Redshift’s supported range.

THEN

Redshift will reject the records that fall outside the supported range.

Rejected records are logged in the _sdc_rejected table of the integration's schema. Learn more.

AND

Redshift will surface the following error:

decimal out of range for Redshift on [DECIMAL]
FIX IT

To resolve the error, offending values in the source must be changed to be within Redshift’s limit for decimal data.

Use the _sdc_rejected table to identify the root of the issue.


Schema Changes

IF

Stitch detects multiple data types for a single column.

THEN

For example: Stitch first detected that order_confirmed contained BOOLEAN data, but during a subsequent job, detected STRING values. To accommodate data of varying types, Stitch will:

  1. Rename the original column by appending the first-detected data type to the column name. Ex: order_confirmed__bl
  2. Store data for the first-detected data type in the renamed column. In this example, only BOOLEAN values will be stored in order_confirmed__bl
  3. Create additional columns to store the other data types - one for each data type detected - and append the data type to the column name. In this example, a order_confirmed__st column will be created to store STRING values.
IF

Data is replicated to Redshift that is nested, containing many top-level properties and potentially nested sub-properties.

THEN

The core object - or the top-level properties - will be split into columns in a table.

Any properties contained within a nested array - or sub-properties - will be “flattened” into subtables.

For example: below is the raw JSON data for an orders table:

{
  "order_id":"1234",                  // order_id and customer_id are top-level properties
  "customer_id":"100",
  "line_items":[                      // start of nested data (array)
      {
          "product_id":"5008798",     // sub-property
          "price":"5.99"              // sub-property
      }
    ]
}

Because there is a nested array in the JSON data, two tables will be created: one named orders containing the top-level properties, and another sub-table named orders___line_items that contains the sub-properties.

A tutorial on how Stitch de-nests and stores these types of records can be found in the Nested JSON Structures & Row Count Impact Guide.

IF

VARCHAR data is loaded that exceeds the current maximum size for the column.

THEN

Redshift will widen the column to accommodate the width of the largest record up to the limit of 65K.

For example: during the initial replication job, a VARCHAR column is created and typed as VARCHAR(128).

During a subsequent job, Stitch identifies data for this column that exceeds the current size maximum of 128 but is less than 256.

To accommodate the data, the column will become VARCHAR(256).

IF

A new column is added in a SaaS integration/data source to a table that’s currently set to replicate.

THEN
  1. Stitch will detect the new column during the integration’s next structure sync.
  2. The column will be automatically set to replicate.
  3. If the table uses Key-based Incremental Replication, only records with a Replication Key value greater than or equal to the last saved bookmark value will be selected for replication.

    This means that if values for the new column were backfilled in existing records, the only way to have Stitch replicate these values is to either:

    • Update the Replication Key values for all records in the source, OR
    • Reset the integration’s Replication Keys in Stitch
  4. In the destination, the column will be appended to the end of the table.
IF

A new column is added by you to a Stitch-generated table in Redshift.

THEN

Columns may be added to tables created by Stitch as long as they are nullable, meaning columns don’t have NOT NULL constraints.

IF

A column is deleted at the source.

THEN

How a deleted column is reflected in Redshift depends on the table’s Replication Method.

Consider the example below. On the left is a table as it exists in the source; on the right is the table in Redshift:

SOURCE     DESTINATION    
id updated_at [rep-key] confirmed id updated_at [rep-key] confirmed
1 2017-06-03 true 1 2017-06-03 true
2 2017-06-03 false 2 2017-06-03 false

Before the next job, the confirmed column will be deleted in the source. Additionally, new records will be added to the table.

Log-based Incremental Replication

Due to how binary logs are structured, changes to a source table - including adding or removing columns, changing data types, etc. - require manual intervention before replication can continue. Refer to the Log-based Incremental Replication documentation for more info.

Key-based Incremental Replication

For tables using Key-based Incremental Replication, default NULLs will be placed into the column going forward and the column will remain in the destination.

Below is the source table with the deleted confirmed column and how this would be reflected in Redshift:

SOURCE   DESTINATION    
id updated_at [rep-key] id updated_at [rep-key] confirmed
1 2017-06-03 1 2017-06-03 true
2 2017-06-03 2 2017-06-03 false
3 2017-06-04 3 2017-06-04  
4 2017-06-04 4 2017-06-04  

Full Table Replication

For tables using Full Table Replication: during the next job, the entire table will be overwritten. The table’s schema will reflect the column’s removal in the source:

SOURCE   DESTINATION  
id updated_at [rep-key] id updated_at [rep-key]
1 2017-06-03 1 2017-06-03
2 2017-06-03 2 2017-06-03
3 2017-06-04 3 2017-06-04
4 2017-06-04 4 2017-06-04
IF

You remove a column from a Stitch-replicated table in your destination.

THEN

The result of deleting a column from a Stitch-generated table depends on the type of column being removed:

  • Primary Key columns: Removing Primary Key columns will result in data being added to the table in an Append-Only fashion. As Stitch uses Primary Keys to de-dupe data, deleting Primary Key columns may result in duplicate data.
  • General columns: This refers to all columns that are not prepended by _sdc or suffixed by a data type. For example: customer_zip, but not customer_zip__st

    If new data is detected for the removed column, Stitch will re-create it in Redshift.

    Note: An integration must support tracking individual columns AND you must deselect the column in Stitch for the column removal to be permanent.

  • _sdc columns: Removing a Stitch replication column will prevent Stitch from upserting replicated data into Redshift.
  • Columns with data type suffixes: Removing a column created as result of accommodating multiple data types will prevent Stitch from loading replicated data into the table. This applies to columns with names such as: customer_zip__st, customer_zip__int, etc.

Data Warehouse Changes

IF

The table’s Primary Key(s) is/are changed.

THEN

If using Key-based Incremental or Log-based Incremental Replication, data will be added to the table in an Append-Only fashion.

If using Full Table Replication, the table will be overwritten in its entirety during each job.

IF

The table’s Primary Key comment is either removed or changed to be incorrectly formatted.

THEN

Data will continue to be loaded into the table in an Append-Only fashion.

IF

A new column is added by you to a Stitch-generated table in Redshift.

THEN

Columns may be added to tables created by Stitch as long as they are nullable, meaning columns don’t have NOT NULL constraints.

IF

You remove a column from a Stitch-replicated table in your destination.

THEN

The result of deleting a column from a Stitch-generated table depends on the type of column being removed:

  • Primary Key columns: Removing Primary Key columns will result in data being added to the table in an Append-Only fashion. As Stitch uses Primary Keys to de-dupe data, deleting Primary Key columns may result in duplicate data.
  • General columns: This refers to all columns that are not prepended by _sdc or suffixed by a data type. For example: customer_zip, but not customer_zip__st

    If new data is detected for the removed column, Stitch will re-create it in Redshift.

    Note: An integration must support tracking individual columns AND you must deselect the column in Stitch for the column removal to be permanent.

  • _sdc columns: Removing a Stitch replication column will prevent Stitch from upserting replicated data into Redshift.
  • Columns with data type suffixes: Removing a column created as result of accommodating multiple data types will prevent Stitch from loading replicated data into the table. This applies to columns with names such as: customer_zip__st, customer_zip__int, etc.
IF

Indices are applied to Stitch-generated columns in the destination.

THEN

Stitch will respect the index application.

IF

SORT and/or DISTRIBUTION Keys are applied to Stitch-generated columns in the destination.

THEN

This is not applicable to Redshift.

IF

You switch to a different destination of the same type.

THEN

This means the destination type is still Redshift, Stitch may just be connected a different database in Redshift.

  • For tables using Key-based or Log-based Incremental Replication, replication will continue using the Replication’s Key last saved maximum value. To re-replicate historical data, resetting Replication Keys is required.
  • For tables using Full Table Replication, the table will be fully replicated into the new destination during the next successful job.
  • For webhook integrations, some data loss may be possible due to the continuous, real-time nature of webhooks. Historical data must either be backfilled or re-played.
IF

You revoke Stitch’s schema and/or table permissions to Redshift.

THEN

Data will fail to load and you will be notified.

AND

Redshift will surface the following error:

ERROR: permission denied for relation [SCHEMA/TABLE]
FIX IT

Verify and, if necessary, re-grant the Stitch user’s permissions.

IF

You revoke Stitch’s access to Redshift entirely.

THEN

Data will fail to load and you will be notified.

AND

Redshift will surface the following error:

FATAL: password authentication failed for user '[STITCH_USER]'
FIX IT

Re-authorize Stitch’s access to Redshift by ensuring the Stitch user’s credentials are correct and the required permissions are granted.



Questions? Feedback?

Did this article help? If you have questions or feedback, feel free to submit a pull request with your suggestions, open an issue on GitHub, or reach out to us.