Amazon S3 CSV | Stitch Documentation

Amazon S3 CSV is supported by Stitch
This integration is powered by Singer's Amazon S3 tap and certified by Stitch. Check out and contribute to the repo on GitHub.

For support, contact Stitch support.

Amazon S3 CSV feature snapshot

A high-level look at Stitch's Amazon S3 CSV integration, including release status, useful links, and the features supported in Stitch.

STITCH
Release Status	Released	Supported By	Stitch
Stitch Plan	Free	Supported Versions	n/a
Singer GitHub Repository	Amazon S3 CSV Repository
CONNECTION METHODS
SSH Connections	Unsupported	SSL Connections	Unsupported
REPLICATION SETTINGS
Anchor Scheduling	Supported	Advanced Scheduling	Unsupported
Table-level Reset	Unsupported	Configurable Replication Methods	Unsupported
REPLICATION METHODS
Log-based Replication	Unsupported	Key-based Replication	Supported
Full Table Replication	Unsupported
DATA SELECTION
Table Selection	Supported	Column Selection	Supported
View Replication	Unsupported
TRANSPARENCY
Extraction Logs	Supported	Loading Reports	Supported

Connecting Amazon S3 CSV

Amazon S3 CSV setup requirements

To set up Amazon S3 CSV in Stitch, you need:

An Amazon Web Services (AWS) account. Signing up is free - click here or go to https://aws.amazon.com to create an account if you don’t have one already.
Permissions in AWS Identity Access Management (IAM) that allow you to create policies, create roles, and attach policies to roles. This is required to grant Stitch authorization to your S3 bucket.
To verify that column names in CSV files adhere to your destination’s length limit for column names. If a column name exceeds the destination’s limit, the destination will reject the column. Compliant columns will persist to the destination.

Column name limits vary by destination:
- Amazon S3 - Not applicable to this destination
- BigQuery - Limited to 128 characters characters
- Azure SQL Data Warehouse - Limited to 128 characters characters
- Panoply - Limited to 115 characters characters
- PostgreSQL - Limited to 59 characters characters
- Redshift - Limited to 115 characters characters
- Snowflake - Limited to 251 characters characters

Step 1: Retrieve your Amazon Web Services account ID

Sign into your Amazon Web Services (AWS) account.
Click the user menu, located between the bell and Global menus in the top-right corner of the page.
Click My Account.
In the Account Settings section of the page, locate the Account Id field:

Keep this handy - you’ll need it to complete the next step.

Step 2: Add Amazon S3 CSV as a Stitch data source

If you aren’t signed into your Stitch account, sign in now.
On the Stitch Dashboard page, click the Add Integration button.
Locate and click the Amazon S3 icon.
Fill in the fields as follows:
- Integration Name: Enter a name for the integration. This is the name that will display on the Stitch Dashboard for the integration; it’ll also be used to create the schema in your destination.
  
  For example, the name “Stitch Amazon S3 CSV” would create a schema called stitch_amazon_s3_csv in the destination. Note: The schema name cannot be changed after the integration is saved.
- S3 Bucket: Enter the name of bucket you want to replicate data from. Enter only the bucket name: No URLs, https, or S3 parts. For example: com-test-stitch-bucket
- AWS Account ID: Paste the AWS account ID you retrieved in Step 1.

Step 3: Configure tables

Next, you’ll indicate which CSV file(s) you want to include for replication. You can include a single CSV file, or map several CSV files to a table.

In the following sections, we’ll walk you through how to configure a table in Stitch.

Tips for setting up CSV files

For the best results:

Each file should have a header row with names that adhere to your destination’s limits for column names. If a column name exceeds the destination’s limit, the destination will reject the column.
If including multiple files in a table, each file should have the same header row. Including multiple files in a single table depends on the search pattern you define in the next step.

Note: This is not the same as configuring multiple tables. See the search pattern section below for examples.

Step 3.1: Define the table's search settings

In this step, you’ll tell Stitch which files in your S3 bucket you want to replicate data from. To do this, you’ll use the Search Pattern and Directory fields.

Step 3.1.1: Define the Search Pattern

The Search Pattern field defines the search criteria Stitch should use for selecting and replicating CSV files. This field accepts regular expressions, which can be used to include a single file or multiple CSV files.

The search pattern you use depends on how data for a particular entity is updated. Consider these examples:

Scenario	Single file, periodically updated	Multiple files, generated daily
How updates are made	A single CSV file is periodically updated with new and updated customer data.	A new CSV file is created every day that contains new and updated customer data. Old files are never updated after they’re created.
File name	`customers.csv`	`customers-<string>.csv`, where `<string>` is a unique, random string
Search pattern	Because there will only ever be one file, you could enter the exact name of the file in your S3 bucket: `customers\.csv`	To ensure new and updated files are identified, you’d want to enter a search pattern that would match all files beginning with `customers`, regardless of the string in the file name: `(customers-).*\.csv`
Matches	`customer.csv`, exactly	`customers-reQDSwNG6U.csv` `customers-xaPTXfN4tD.csv` `customers-MBJMhCbNCp.csv` etc.

When creating a search pattern, keep the following in mind:

Special characters such as periods (.) have special meaning in regular expressions. To match exactly, they’ll need to be escaped. For example: .\
Stitch uses Python for regular expressions, which may vary in syntax from other varieties. Try using PyRegex to test your expressions before saving the integration in Stitch.

Step 3.1.2: Limit file search to a specific directory

Note: This step is optional.

The Directory field limits the location of the file search Stitch performs during replication jobs. When defined, Stitch will only search for files in this location and select the ones that match the search pattern.

To define a specific location in the S3 bucket, enter the directory path into the Directory field. For example: data-exports/lists. Note: This field is not a regular expression.

While using this field is optional, limiting the search to a single location may make extraction more efficient.

Step 3.2: Define the table's name

In the Table Name field, enter a name for the table. Keep in mind that each destination has its own rules for how tables can be named. For example: Amazon Redshift table names can’t exceed 127 characters.

If the table name exceeds the destination’s character limit, the destination will reject the table entirely. Refer to the documentation for your destination for more info about table naming rules.

Step 3.3: Define the table's Primary Key

Note: This step is optional.

In the Primary Key field, enter one or more header fields (separated by commas) Stitch can use to identify unique rows. For example:

account_id,date

If undefined, Stitch will load data into the table in an append-only fashion. This means that existing rows in the destination won’t be updated, but will be appended to the end of the table. Refer to the Primary Keys and Append-Only Replication section below for more info and examples.

Step 3.4: Specify datetime fields

Note: This step is optional.

In the Specify datetime fields field, enter one or more header fields (separated by commas) that should appear in the destination table as datetime fields instead of strings. For example:

created_at,updated_at

If columns are not specified and values cannot be parsed as dates, Stitch will load them as nullable strings. Refer to the Determining data types section for more info on how Stitch identifies data types.

Step 3.5: Configure additional tables

If you want to add another table, click the Configure another table? link below the Specify datetime fields field. Otherwise, move onto the Sync historical data section.

Stitch doesn’t enforce a limit on the number of tables that you can configure for a single integration.

Step 4: Define the historical sync

For example: You’ve added a customers.*\csv search pattern and set the integration’s historical Start Date to 1 year. During the initial replication job, Stitch will fully replicate the contents of all files that match the search pattern that have been modified in the past year.

During subsequent replication jobs, Stitch will only replicate the files that have been modified since the last job ran.

As files included in a replication job are replicated in full during each job, how data is added to updated files can impact your row count. Refer to the Incremental Replication for Amazon S3 CSV section for more info on initial and subsequent replication jobs for Amazon S3 CSV.

Step 5: Create a replication schedule

Replication schedules affect the time Extraction begins, not the time to data loaded. Refer to the Replication Scheduling documentation for more information.

In the Replication Frequency section, you’ll create the integration’s replication schedule. An integration’s replication schedule determines how often Stitch runs a replication job, and the time that job begins.

Amazon S3 CSV integrations support the following replication scheduling methods:

To keep your row usage low, consider setting the integration to replicate less frequently. See the Understanding and Reducing Your Row Usage guide for tips on reducing your usage.

Step 6: Grant access to your bucket using AWS IAM

Note: To complete this step, you must have permissions in AWS Identity Access Management (IAM) that allow you to create/modify IAM policies and roles.

Next, Stitch will display a Grant Access to Your Bucket page. This page contains the info you need to configure bucket access for Stitch, which is accomplished via an IAM policy and role.

Note: Saving the integration before you’ve completed the steps below will result in connection errors.

Step 6.1: Create an IAM policy

An IAM policy is JSON-based access policy language to manage permissions to bucket resources. The policy Stitch provides is an auto-generated policy unique to the specific bucket you entered in the setup page.

For more info about the top-level permissions the Stitch IAM policy grants, click the link below.

Amazon S3 CSV Bucket Permissions

Permission Name	Operation	Operation Description
s3:GetObject	GET Object	Allows for the retrieval of objects from Amazon S3.
s3:GetObject	HEAD Object	Allows for the retrieval of metadata from an object without returning the object itself.
s3:ListBucket	GET Bucket (List Objects)	Allows for the return of some or all (up to 1,000) of the objects in a bucket.
s3:ListBucket	HEAD Bucket	Used to determine if a bucket exists and access is allowed.

To create the IAM policy:

In AWS, navigate to the IAM service by clicking the Services menu and typing IAM.
Click IAM once it displays in the results.
On the IAM home page, click Policies in the menu on the left side of the page.
Click Create Policy.
In the Create Policy page, click the JSON tab.
Select everything currently in the text field and delete it.
In the text field, paste the Stitch IAM policy.
Click Review policy.
On the Review Policy page, give the policy a name. For example: stitch_s3
Click Create policy.

Step 6.2: Create an IAM role for Stitch

Required permissions

To complete this step, you need the following AWS IAM permissions: CreateRole and AttachRolePolicy. Refer to Amazon’s documentation for more info.

Roles can’t be used for multiple integrations

If you’re creating multiple Amazon S3 CSV integrations, you’ll need to complete this step for each integration you’re connecting.

The Role Name Stitch uses to connect to the Amazon S3 bucket is unique to the integration. Attempting to re-use a role for multiple integrations will cause connection errors.

In this step, you’ll create an IAM role for Stitch and apply the IAM policy from the previous step. This will ensure that Stitch is visible in any logs and audits.

To create the role, you’ll need the Account ID, External ID, and Role Name values provided on the Stitch Grant Access to Your Bucket page.

In AWS, navigate to the IAM Roles page.
Click Create Role.
On the Create Role page:
1. In the Select type of trusted entity section, click the Another AWS account option.
2. In the Account ID field, paste the Account ID from Stitch. Note: This isn’t your AWS account ID from Step 1 - this is the Account ID that displays in Stitch on the Grant Access to Your Bucket page.
3. In the Options section, check the Require external ID box.
4. In the External ID field that displays, paste the External ID from the Stitch Grant Access to Your Bucket page:
5. Click Next: Permissions.
On the Attach permissions page:
1. Search for the policy you created in Step 6.1.
2. Once located, check the box next to it in the table.
3. Click Next: Tags.
If you want to enter any tags, do so on the Add tags page. Otherwise, click Next: Review.
On the Review page:
1. In the Role name field, paste the Role Name from the Stitch Grant Access to Your Bucket page:
  
  Remember: Role names are unique to the Stitch Amazon S3 CSV integration they’re created for. Attempting to use the same role for multiple integrations will cause connection errors.
2. Enter a description in the Role description field. For example: Stitch role for Amazon S3 CSV integration.
3. Click Create role.

Step 6.3: Check and save the connection in Stitch

Note: Saving the integration before you’ve completed the IAM policy and role steps will result in connection errors.

After you’ve created the IAM policy and role, you can save the integration in Stitch. When finished, click Check and Save.

Step 7: Select data to replicate

The last step is to select select the tables and columns you want to replicate.

When you track a table, you’ll also need to define its Replication Key.

You can select tables and columns by:

In the Integration Details page, click the Tables to Replicate tab.
Locate a table you want to replicate.
Click the checkbox next to the object’s name. A green checkmark means the object is set to replicate.
If there are child objects, they’ll automatically display and you’ll be prompted to select some.
After you set a table to replicate, the Settings page will display. Note: When you track a table, by default all columns will also be tracked.
In the Settings page, define the table’s Replication Method and, if using Key-based Incremental Replication, its Replication Key.
Repeat this process for every table you want to replicate.
Click the Finalize Your Selections button to save your data selections.

Initial and historical replication jobs

After you finish setting up Amazon S3 CSV, its Sync Status may show as Pending on either the Stitch Dashboard or in the Integration Details page.

For a new integration, a Pending status indicates that Stitch is in the process of scheduling the initial replication job for the integration. This may take some time to complete.

Initial replication jobs with Anchor Scheduling

If using Anchor Scheduling, an initial replication job may not kick off immediately. This depends on the selected Replication Frequency and Anchor Time. Refer to the Anchor Scheduling documentation for more information.

Free historical data loads

The first seven days of replication, beginning when data is first replicated, are free. Rows replicated from the new integration during this time won’t count towards your quota. Stitch offers this as a way of testing new integrations, measuring usage, and ensuring historical data volumes don’t quickly consume your quota.

Replication will continue after the seven days are over. If you’re no longer interested in this source, be sure to pause or delete the integration to prevent unwanted usage.

Amazon S3 CSV Replication

In this section:

How data types are determined
How new and updated data is identified and replicated
How Primary Keys affect loading data

Determining data types

To determine a column’s data type, Stitch will analyze the first 1,000 lines of (up to) the first five files included for a given table.

Stitch’s Amazon S3 CSV integration will load data from CSV files and type it as one of the following data types:

DATETIME - If a value can’t be parsed as a date, Stitch will load the column as a nullable string. To ensure dates are typed properly, specify them during setup.
INTEGER
NUMBER
STRING

Incremental Replication for Amazon S3 CSV

While data from Amazon S3 CSV integrations is replicated using Key-based Incremental Replication, the behavior for this integration differs subtly from other integrations.

The table below compares Key-based Incremental Replication and Replication Key behavior for Amazon S3 CSV to that of other integrations.

	Amazon S3 CSV	Other integrations
What's used as a Replication Key?	The time a file is modified.	A column or columns in a table.
Are Replication Keys inclusive?	No. Only files with a modification timestamp value greater than the last saved bookmark are replicated.	Yes. Rows with a Replication Key value greater than or equal to the last saved bookmark are replicated.
What's replicated during a replication job?	The entire contents of a modified file.	Only new or updated rows in a table.

Primary Keys and Append-Only Replication

For BigQuery and Amazon S3 (CSV) destinations: Replication will function in an append-only manner regardless of whether a Primary Key is specified during setup. Refer to the Querying Append-Only tables section for more info.

For destinations that support upserts (that is, updating existing rows), Stitch uses Primary Keys to de-dupe data during loading. Primary Keys are used to identify unique rows within a table and ensure that only the most recently updated version of that record appears in your data warehouse.

If Primary Keys aren’t specified during setup, Stitch will load data using Append-Only Replication. This means that existing rows in the destination won’t be updated, but instead appended to the end of the table.

Additionally, Stitch will append a column (__sdc_primary_key) to the table to function as a Primary Key. Note: Appending this column will not enable Stitch to de-dupe data, as a unique value will be inserted every time a row is loaded, regardless of whether it’s ever been replicated before. This means that a record can have multiple __sdc_primary_key values, each of them unique.

Questions? Feedback?

Did this article help? If you have questions or feedback, feel free to submit a pull request with your suggestions, open an issue on GitHub, or reach out to us.

Related	Troubleshooting
Destination & Integration Compatibility Replication Scheduling Replication Methods Replication Keys	Database Connection Errors Understanding & Reducing Your Usage Re-Authorizing Integrations Replication Issues