There are many ways to ingest data. The below list of techniques are not an exhaustive list, but more of a list of some simple "Goto" techniques for your "Toolbox". Without belaboring the subject, your ingestion technique will depend on wither your data is CSV, JSON, XML, Fix String, or Parquet files and wither it is a high or low Volume, Variety, Velocity dataset, and wither your reporting is "Real-Time", Hourly, Daily, Weekly, or Monthly cadence. Below, I have provided three basic techniques that I have in my "Toolbox" that have evolved over the last 5 years using Snowflake on various ingestion challenges.
Technique A: I typically use this technique for ingesting data from an ERP, CRM, or Data Mastering Databases in which the volume or velocity are not extreme.
1, 2, 3: You would create a pull script (2) that will bulk pull the data from a table from a database (1) using some change control timestamp in the table (If available) or change control log in the database system itself. Depending on volume and change control available for the table, you can choose to pull the whole table each time or only pull new and changed records and place into a Parquet or CSV file (3). Change control methods vary depending on the database or table design.
4: You then COPY INTO the file into Snowflake (4). I like ingesting Parquet files, because they are compressed and ingest extremely fast into Snowflake.
5: Dedup the table (5). This is where you need to know the Primary Key of the table. If you need, you can treat the table like a Slowly Changing Dimension and perform a non-destructive method for identifying the most current records and deprecating the duplicates. I typically use a soft delete timestamp. This allows me a window into historical changes. If its a high volume table and the changes are high volume you may wish to perform a hard delete.
6: Publish (6) the table in the Store when all other supporting tables are also ready to be published. Publish tables a group were any dependent tables published at the same time. Snowflake's cloning process happens extremely fast, so any read blocking that you might incur will not be noticeable. Make sure you use Snowflake's ability to keep the security grants when using the CREATE OR REPLACE statement on an existing table.
7, 8: You will then create a VIEW(7) that transforms the data into a shape that fits your target table for the warehouse. Selecting against that VIEW you can either do an incremental Insert or do a FULL replacement of the table via CREATE TABLE (8). For small tables you can just replace the whole table and save yourself the complexity of change control and incremental processing.
9: Clone the Staging Table into the published schema area (8).
Technique B: This technique I typically use when I run across JSON, XML, or Fix String files to ingest. Its like technique A, but it has an extra transformation and table staging step to flatten the semi-structured data and place into a staging table(s).
5.2, 5.4: This is where you would create a VIEW (5.2) that would flatten the JSON or XML, or perform some special handling on the Fixed Length String file. And then you would incrementally insert or do a full replaced of the staging table (5.4). Then proceed on to the other steps to further transform and publish.
Technique C: This technique is especially useful when you ingest a high volume table into Snowflake. Its like technique A from step 1-6, but no transformations are performed. You go straight to publishing via CLONING. This technique assumes that you can trust the upstream data source for data quality & that the shape is exactly what you will be using for reporting. For really fast ingesting I would suggest ingesting Parquet files.
Please note that when dealing with High Volume, High Velocity data, that there are several other techniques that can help you perform your intense operational or business reporting cadence. The above lists of techniques is not an exhaustive list, but more of a list of some "Goto" techniques for your "Toolbox".
No comments:
Post a Comment