Querying Append-Only Tables | Stitch Documentation

The querying strategy outline here can be applied to any table that is loaded in an Append-Only manner. This is applicable to BigQuery and Amazon S3 (CSV) destinations.

Append-Only Replication is a type of Incremental Replication where newly replicated data is appended to the end of a table. Existing rows are not updated - updates are added to the end of the table as new rows.

For tables using Incremental Replication, Stitch currently loads data into Google BigQuery in an append-only fashion. This means that as time goes on, tables will wind up containing many different versions of the same row.

Data stored this way can provide insights and historical details about how those rows have changed over time - creating a timeline of the status changes of an order record, for example - but in some cases, you might just want the latest version of the table.

Grab the latest version of every row

In each Stitch-generated integration table, you’ll see a few columns prepended with _sdc. The column we’ll focus on here is the _sdc_sequence column. This column is a Unix epoch (down to the millisecond) attached to the record during replication and can help determine the order of all the versions of a row.

Stitch uses these sequence values in a few places to correctly order rows for loading, but it can be also used to grab the latest version of a record in an append-only table.

Let’s take a look at an example. Assume we have an orders table that contains:

A Primary Key of id,
The system _sdc columns added by Stitch, and
Other order attribute columns

If you wanted to create a snapshot of the latest version of this table, you could run a query like this:

    SELECT DISTINCT o.*
      FROM [stitch-analytics-bigquery-123:ecommerce.orders] o
INNER JOIN (
     SELECT id,
            MAX(_sdc_sequence) AS seq,
            MAX(_sdc_batched_at) AS batch
    FROM [stitch-analytics-bigquery-123:ecommerce.orders]
    GROUP BY id) oo
ON o.id = oo.id
AND o._sdc_sequence = oo.seq
AND o._sdc_batched_at = oo.batch

This approach uses a subquery to get a single list of every row’s Primary Key, maximum sequence number, and maximum batched at timestamp. Since it’s possible to have duplicate records in your warehouse, the query also selects only distinct records of the latest version of the row. It then joins the original table to the Primary Key, maximum sequence, and maximum batched at, which makes all other column values available for querying.

Create views in your data warehouse

To make this easier, you can turn queries like the one above into a view. We recommend this approach because a view will encapsulate all the logic and simplify the process of querying against the latest version of your data.

For more info on creating views in the data warehouses Stitch supports, check out these docs:

Questions? Feedback?

Did this article help? If you have questions or feedback, feel free to submit a pull request with your suggestions, open an issue on GitHub, or reach out to us.

Related	Troubleshooting
Loading Data into Your Destination Table Schema Changes Nested Structures & Row Count Impact Querying Append-Only Tables	Data Discrepancy Guide Identifying Rejected Records