Amazon S3 CSV feature snapshot

A high-level look at Stitch's Amazon S3 CSV integration, including release status, useful links, and the features supported in Stitch.

STITCH
Release Status

Released

Supported By

Stitch

Stitch Plan

Free

Supported Versions

n/a

Singer GitHub Repository

Amazon S3 CSV Repository

CONNECTION METHODS
SSH Connections

Unsupported

SSL Connections

Unsupported

REPLICATION SETTINGS
Anchor Scheduling

Supported

Advanced Scheduling

Unsupported

Table-level Reset

Unsupported

Configurable Replication Methods

Unsupported

REPLICATION METHODS
Log-based Replication

Unsupported

Key-based Replication

Supported

Full Table Replication

Unsupported

DATA SELECTION
Table Selection

Supported

Column Selection

Supported

View Replication

Unsupported

TRANSPARENCY
Extraction Logs

Supported

Loading Reports

Supported

Connecting Amazon S3 CSV

Amazon S3 CSV setup requirements

To set up Amazon S3 CSV in Stitch, you need:

  • An Amazon Web Services (AWS) account. Signing up is free - click here or go to https://aws.amazon.com to create an account if you don’t have one already.

  • Permissions in AWS Identity Access Management (IAM) that allow you to create policies, create roles, and attach policies to roles. This is required to grant Stitch authorization to your S3 bucket.

  • To verify that column names in CSV files adhere to your destination’s length limit for column names. If a column name exceeds the destination’s limit, the destination will reject the column. Compliant columns will persist to the destination.

    Column name limits vary by destination:

    • Amazon S3 - Not applicable to this destination

    • BigQuery - Limited to 128 characters characters

    • Azure SQL Data Warehouse - Limited to 128 characters characters

    • Panoply - Limited to 115 characters characters

    • PostgreSQL - Limited to 59 characters characters

    • Redshift - Limited to 115 characters characters

    • Snowflake - Limited to 251 characters characters


Step 1: Retrieve your Amazon Web Services account ID

  1. Sign into your Amazon Web Services (AWS) account.
  2. Click the user menu, located between the bell and Global menus in the top-right corner of the page.
  3. Click My Account.
  4. In the Account Settings section of the page, locate the Account Id field:

    An AWS account ID, highlighted in the AWS Account Settings page

Keep this handy - you’ll need it to complete the next step.

Step 2: Add Amazon S3 CSV as a Stitch data source

  1. If you aren’t signed into your Stitch account, sign in now.
  2. On the Stitch Dashboard page, click the Add Integration button.

  3. Locate and click the Amazon S3 icon.
  4. Fill in the fields as follows:

    • Integration Name: Enter a name for the integration. This is the name that will display on the Stitch Dashboard for the integration; it’ll also be used to create the schema in your destination.

      For example, the name “Stitch Amazon S3 CSV” would create a schema called stitch_amazon_s3_csv in the destination. Note: The schema name cannot be changed after the integration is saved.

    • S3 Bucket: Enter the name of bucket you want to replicate data from. Enter only the bucket name: No URLs, https, or S3 parts. For example: com-test-stitch-bucket

    • AWS Account ID: Paste the AWS account ID you retrieved in Step 1.

Step 3: Configure tables

Next, you’ll indicate which CSV file(s) you want to include for replication. You can include a single CSV file, or map several CSV files to a table.

In the following sections, we’ll walk you through how to configure a table in Stitch.

Step 3.1: Define the table's search settings

In this step, you’ll tell Stitch which files in your S3 bucket you want to replicate data from. To do this, you’ll use the Search Pattern and Directory fields.

Step 3.1.1: Define the Search Pattern

The Search Pattern field defines the search criteria Stitch should use for selecting and replicating CSV files. This field accepts regular expressions, which can be used to include a single file or multiple CSV files.

The search pattern you use depends on how data for a particular entity is updated. Consider these examples:

Scenario Single file, periodically updated Multiple files, generated daily
How updates are made

A single CSV file is periodically updated with new and updated customer data.

A new CSV file is created every day that contains new and updated customer data. Old files are never updated after they’re created.

File name

customers.csv

customers-<string>.csv, where <string> is a unique, random string

Search pattern

Because there will only ever be one file, you could enter the exact name of the file in your S3 bucket:

customers\.csv

To ensure new and updated files are identified, you’d want to enter a search pattern that would match all files beginning with customers, regardless of the string in the file name:

(customers-).*\.csv
Matches

customer.csv, exactly

  • customers-reQDSwNG6U.csv
  • customers-xaPTXfN4tD.csv
  • customers-MBJMhCbNCp.csv
  • etc.

When creating a search pattern, keep the following in mind:

  • Special characters such as periods (.) have special meaning in regular expressions. To match exactly, they’ll need to be escaped. For example: .\
  • Stitch uses Python for regular expressions, which may vary in syntax from other varieties. Try using PyRegex to test your expressions before saving the integration in Stitch.
Step 3.1.2: Limit file search to a specific directory

The Directory field limits the location of the file search Stitch performs during replication jobs. When defined, Stitch will only search for files in this location and select the ones that match the search pattern.

To define a specific location in the S3 bucket, enter the directory path into the Directory field. For example: data-exports/lists. Note: This field is not a regular expression.

While using this field is optional, limiting the search to a single location may make extraction more efficient.

Step 3.2: Define the table's name

In the Table Name field, enter a name for the table. Keep in mind that each destination has its own rules for how tables can be named. For example: Amazon Redshift table names can’t exceed 127 characters.

If the table name exceeds the destination’s character limit, the destination will reject the table entirely. Refer to the documentation for your destination for more info about table naming rules.

Step 3.3: Define the table's Primary Key

In the Primary Key field, enter one or more header fields (separated by commas) Stitch can use to identify unique rows. For example:

account_id,date

If undefined, Stitch will load data into the table in an append-only fashion. This means that existing rows in the destination won’t be updated, but will be appended to the end of the table. Refer to the Primary Keys and Append-Only Replication section below for more info and examples.

Step 3.4: Specify datetime fields

In the Specify datetime fields field, enter one or more header fields (separated by commas) that should appear in the destination table as datetime fields instead of strings. For example:

created_at,updated_at

If columns are not specified and values cannot be parsed as dates, Stitch will load them as nullable strings. Refer to the Determining data types section for more info on how Stitch identifies data types.

Step 3.5: Configure additional tables

If you want to add another table, click the Configure another table? link below the Specify datetime fields field. Otherwise, move onto the Sync historical data section.

Stitch doesn’t enforce a limit on the number of tables that you can configure for a single integration.

Step 4: Define the historical sync

For example: You’ve added a customers.*\csv search pattern and set the integration’s historical Start Date to 1 year. During the initial replication job, Stitch will fully replicate the contents of all files that match the search pattern that have been modified in the past year.

During subsequent replication jobs, Stitch will only replicate the files that have been modified since the last job ran.

As files included in a replication job are replicated in full during each job, how data is added to updated files can impact your row count. Refer to the Incremental Replication for Amazon S3 CSV section for more info on initial and subsequent replication jobs for Amazon S3 CSV.

Step 5: Create a replication schedule

In the Replication Frequency section, you’ll create the integration’s replication schedule. An integration’s replication schedule determines how often Stitch runs a replication job, and the time that job begins.

Amazon S3 CSV integrations support the following replication scheduling methods:

To keep your row usage low, consider setting the integration to replicate less frequently. See the Understanding and Reducing Your Row Usage guide for tips on reducing your usage.

Step 6: Grant access to your bucket using AWS IAM

Next, Stitch will display a Grant Access to Your Bucket page. This page contains the info you need to configure bucket access for Stitch, which is accomplished via an IAM policy and role.

Note: Saving the integration before you’ve completed the steps below will result in connection errors.

Step 6.1: Create an IAM policy

An IAM policy is JSON-based access policy language to manage permissions to bucket resources. The policy Stitch provides is an auto-generated policy unique to the specific bucket you entered in the setup page.

For more info about the top-level permissions the Stitch IAM policy grants, click the link below.

Permission Name Operation Operation Description
s3:GetObject GET Object

Allows for the retrieval of objects from Amazon S3.

HEAD Object

Allows for the retrieval of metadata from an object without returning the object itself.

s3:ListBucket GET Bucket (List Objects)

Allows for the return of some or all (up to 1,000) of the objects in a bucket.

HEAD Bucket

Used to determine if a bucket exists and access is allowed.

To create the IAM policy:

  1. In AWS, navigate to the IAM service by clicking the Services menu and typing IAM.
  2. Click IAM once it displays in the results.
  3. On the IAM home page, click Policies in the menu on the left side of the page.
  4. Click Create Policy.
  5. In the Create Policy page, click the JSON tab.
  6. Select everything currently in the text field and delete it.
  7. In the text field, paste the Stitch IAM policy.
  8. Click Review policy.
  9. On the Review Policy page, give the policy a name. For example: stitch_s3
  10. Click Create policy.

Step 6.2: Create an IAM role for Stitch

In this step, you’ll create an IAM role for Stitch and apply the IAM policy from the previous step. This will ensure that Stitch is visible in any logs and audits.

To create the role, you’ll need the Account ID, External ID, and Role Name values provided on the Stitch Grant Access to Your Bucket page.

  1. In AWS, navigate to the IAM Roles page.
  2. Click Create Role.
  3. On the Create Role page:
    1. In the Select type of trusted entity section, click the Another AWS account option.
    2. In the Account ID field, paste the Account ID from Stitch. Note: This isn’t your AWS account ID from Step 1 - this is the Account ID that displays in Stitch on the Grant Access to Your Bucket page.
    3. In the Options section, check the Require external ID box.
    4. In the External ID field that displays, paste the External ID from the Stitch Grant Access to Your Bucket page: Account ID and External ID fields mapped from Stitch to AWS
    5. Click Next: Permissions.
  4. On the Attach permissions page:
    1. Search for the policy you created in Step 6.1.
    2. Once located, check the box next to it in the table.
    3. Click Next: Tags.
  5. If you want to enter any tags, do so on the Add tags page. Otherwise, click Next: Review.
  6. On the Review page:
    1. In the Role name field, paste the Role Name from the Stitch Grant Access to Your Bucket page: Role name field mapped from Stitch to AWS

      Remember: Role names are unique to the Stitch Amazon S3 CSV integration they’re created for. Attempting to use the same role for multiple integrations will cause connection errors.

    2. Enter a description in the Role description field. For example: Stitch role for Amazon S3 CSV integration.
    3. Click Create role.

Step 6.3: Check and save the connection in Stitch

After you’ve created the IAM policy and role, you can save the integration in Stitch. When finished, click Check and Save.

Step 7: Select data to replicate

The last step is to select select the tables and columns you want to replicate.

When you track a table, you’ll also need to define its Replication Key.

You can select tables and columns by:

  1. In the Integration Details page, click the Tables to Replicate tab.
  2. Locate a table you want to replicate.
  3. Click the checkbox next to the object’s name. A green checkmark means the object is set to replicate.
  4. If there are child objects, they’ll automatically display and you’ll be prompted to select some.
  5. After you set a table to replicate, the Settings page will display. Note: When you track a table, by default all columns will also be tracked.

  6. In the Settings page, define the table’s Replication Method and, if using Key-based Incremental Replication, its Replication Key.

  7. Repeat this process for every table you want to replicate.

  8. Click the Finalize Your Selections button to save your data selections.

Initial and historical replication jobs

After you finish setting up Amazon S3 CSV, its Sync Status may show as Pending on either the Stitch Dashboard or in the Integration Details page.

For a new integration, a Pending status indicates that Stitch is in the process of scheduling the initial replication job for the integration. This may take some time to complete.

Free historical data loads

The first seven days of replication, beginning when data is first replicated, are free. Rows replicated from the new integration during this time won’t count towards your quota. Stitch offers this as a way of testing new integrations, measuring usage, and ensuring historical data volumes don’t quickly consume your quota.


Amazon S3 CSV Replication

In this section:

Determining data types

To determine a column’s data type, Stitch will analyze the first 1,000 lines of (up to) the first five files included for a given table.

Stitch’s Amazon S3 CSV integration will load data from CSV files and type it as one of the following data types:

  • DATETIME - If a value can’t be parsed as a date, Stitch will load the column as a nullable string. To ensure dates are typed properly, specify them during setup.

  • INTEGER

  • NUMBER

  • STRING

Incremental Replication for Amazon S3 CSV

While data from Amazon S3 CSV integrations is replicated using Key-based Incremental Replication, the behavior for this integration differs subtly from other integrations.

The table below compares Key-based Incremental Replication and Replication Key behavior for Amazon S3 CSV to that of other integrations.

Amazon S3 CSV Other integrations
What's used as a Replication Key?

The time a file is modified.

A column or columns in a table.

Are Replication Keys inclusive?

No. Only files with a modification timestamp value greater than the last saved bookmark are replicated.

Yes. Rows with a Replication Key value greater than or equal to the last saved bookmark are replicated.

What's replicated during a replication job?

The entire contents of a modified file.

Only new or updated rows in a table.

Primary Keys and Append-Only Replication

For destinations that support upserts (that is, updating existing rows), Stitch uses Primary Keys to de-dupe data during loading. Primary Keys are used to identify unique rows within a table and ensure that only the most recently updated version of that record appears in your data warehouse.

If Primary Keys aren’t specified during setup, Stitch will load data using Append-Only Replication. This means that existing rows in the destination won’t be updated, but instead appended to the end of the table.

Additionally, Stitch will append a column (__sdc_primary_key) to the table to function as a Primary Key. Note: Appending this column will not enable Stitch to de-dupe data, as a unique value will be inserted every time a row is loaded, regardless of whether it’s ever been replicated before. This means that a record can have multiple __sdc_primary_key values, each of them unique.



Questions? Feedback?

Did this article help? If you have questions or feedback, feel free to submit a pull request with your suggestions, open an issue on GitHub, or reach out to us.