Saturday, April 10, 2021

How To: Linear Regression for Big Data in Snowflake

BI tools like Looker will do the linear regression on small datasets and are performed in memory.   As awesome as these tools like this are, they are limited to small datasets.    And when you need to feed that data into a gradient boosting tool like XG Boost, you will most likely want to perform a linear regression in SQL.   

Below are two examples.  One that can be ran in any DB that supports CTEs and the other can be ran in Snowflake.  I have also projected out the trend line beyond the sample data to show you how to do simple projections.  Also I throw in R2 so you can evaluate the model fit. 

Sample Data

CREATE OR REPLACE TEMPORARY TABLE DATA_SAMPLE (X INTEGER,Y INTEGER);

INSERT INTO DATA_SAMPLE (X,Y)
VALUES (1,41),
       (2,963),
       (3,4506),
       (4,13557),
       (5,40422),
       (6,61484),
       (7,85075),
       (8,118436),
       (9,134457),
       (10,180253);

Generic SQL Logic

WITH X_Y_AVG AS (
    SELECT   X
           , Y
           , AVG(X) OVER() AS X_AVG
           , AVG(Y) OVER() AS Y_AVG
      FROM DATA_SAMPLE
)
, SLOPE AS (
    SELECT   SUM((X - X_AVG) * (Y - Y_AVG)) / SUM((X - X_AVG) * (X - X_AVG)) AS SLOPE
           , MAX(X_AVG) AS X_AVG_MAX
           , MAX(Y_AVG) AS Y_AVG_MAX
      FROM X_Y_AVG
)
, CALC AS (
    SELECT   SLOPE
           , Y_AVG_MAX - X_AVG_MAX * SLOPE AS INTERCEPT
      FROM SLOPE
)
, TREND_LINE AS (
        SELECT   X_Y_AVG.X
               , X_Y_AVG.Y
               , X_Y_AVG.X_AVG
               , X_Y_AVG.Y_AVG
               , CALC.SLOPE
               , CALC.INTERCEPT
               , (X_Y_AVG.X * CALC.SLOPE) + CALC.INTERCEPT AS TREND_LINE_Y
          FROM X_Y_AVG
          CROSS JOIN CALC
         UNION ALL
        SELECT   PROJECTION_X.X
               , NULL
               , PROJECTION_X.X
               , NULL
               , CALC.SLOPE
               , CALC.INTERCEPT
               , (PROJECTION_X.X * CALC.SLOPE) + CALC.INTERCEPT AS TREND_LINE_Y
          FROM (SELECT COLUMN1 AS X
                  FROM VALUES  (11),
                               (12),
                               (13),
                               (14),
                               (15),
                               (16)) AS PROJECTION_X 
          CROSS JOIN CALC
         ORDER BY X
  )
, RSQUARED AS (
    SELECT 1-(SUM(POWER(Y - TREND_LINE_Y,2)))/(SUM(POWER(Y - Y_AVG,2))) AS R2
      FROM TREND_LINE
)
SELECT   TREND_LINE.X
       , TREND_LINE.Y
       , TREND_LINE.SLOPE
       , TREND_LINE.INTERCEPT
       , TREND_LINE.TREND_LINE_Y
       , IFF(TREND_LINE.Y IS NOT NULL, (SELECT R2 FROM RSQUARED), NULL) AS R2
  FROM TREND_LINE
;    

Snowflake Specific SQL Logic Using REGR_SLOPE & REGR_INTERCEPT Functions

WITH CALC AS (
    SELECT   REGR_SLOPE(Y, X) AS SLOPE
           , REGR_INTERCEPT(Y, X) AS INTERCEPT
     FROM DATA_SAMPLE
)
SELECT   DATA_SAMPLE.X
       , DATA_SAMPLE.Y
       , CALC.SLOPE
       , CALC.INTERCEPT
       , (DATA_SAMPLE.X * CALC.SLOPE) + CALC.INTERCEPT AS TREND_LINE_Y
       , REGR_R2(DATA_SAMPLE.Y, "TREND_LINE_Y") OVER() AS R2
  FROM DATA_SAMPLE
  CROSS JOIN CALC
 UNION ALL
SELECT   PROJECTION_X.X
       , NULL
       , CALC.SLOPE
       , CALC.INTERCEPT
       , (PROJECTION_X.X * CALC.SLOPE) + CALC.INTERCEPT AS TREND_LINE_Y
       , NULL
  FROM (SELECT COLUMN1 AS X
          FROM VALUES  (11),
                       (12),
                       (13),
                       (14),
                       (15),
                       (16)) AS PROJECTION_X 
  CROSS JOIN CALC
  ORDER BY X;  



Friday, April 09, 2021

How To: Snowflake Data Ingestion Techniques

There are many ways to ingest data.  The below list of techniques are not an exhaustive list, but more of a list of some simple "Goto" techniques for your "Toolbox".    Without belaboring the subject, your ingestion technique will depend on wither your data is CSV, JSON, XML, Fix String, or Parquet files and wither it is a high or low Volume, Variety, Velocity dataset, and wither your reporting is "Real-Time", Hourly, Daily, Weekly, or Monthly cadence.  Below, I have provided three basic techniques that I have in my "Toolbox" that have evolved over the last 5 years using Snowflake on various ingestion challenges.


Technique A: I typically use this technique for ingesting data from an ERP, CRM, or Data Mastering Databases in which the volume or velocity are not extreme.   

1, 2, 3:  You would create a pull script (2) that will bulk pull the data from a table from a database (1) using some change control timestamp in the table (If available) or change control log in the database system itself. Depending on volume and change control available for the table, you can choose to pull the whole table each time or only pull new and changed records and place into a Parquet or CSV file (3).  Change control methods vary depending on the database or table design.   

4:  You then COPY INTO the file into Snowflake (4).  I like ingesting Parquet files, because they are compressed and ingest extremely fast into Snowflake.     

5:  Dedup the table (5).  This is where you need to know the Primary Key of the table.   If you need, you can treat the table like a Slowly Changing Dimension and perform a non-destructive method for identifying the most current records and deprecating the duplicates.   I typically use a soft delete timestamp.  This allows me a window into historical changes.   If its a high volume table and the changes are high volume you may wish to perform a hard delete.

6:  Publish (6) the table in the Store when all other supporting tables are also ready to be published.  Publish tables a group were any dependent tables published at the same time.  Snowflake's cloning process happens extremely fast, so any read blocking that you might incur will not be noticeable.   Make sure you use Snowflake's ability to keep the security grants when using the CREATE OR REPLACE statement on an existing table.

7, 8:  You will then create a VIEW(7) that transforms the data into a shape that fits your target table for the warehouse.   Selecting against that  VIEW you can either do an incremental Insert or do a FULL replacement of the table via CREATE TABLE (8).  For small tables you can just replace the whole table and save yourself the complexity of change control and incremental processing. 

9: Clone the Staging Table into the published schema area (8).    


Technique B:  This technique I typically use when I run across JSON, XML, or Fix String files to ingest.   Its like technique A, but it has an extra transformation and table staging step to flatten the semi-structured data and place into a staging table(s).

5.2, 5.4:  This is where you would create a VIEW (5.2) that would flatten the JSON or XML, or perform some special handling on the Fixed Length String file.    And then you would incrementally insert or do a full replaced of the staging table (5.4).  Then proceed on to the other steps to further transform and publish.


Technique C:  This technique is especially useful when you ingest a high volume table into Snowflake.   Its like technique A from step 1-6, but no transformations are performed.  You go straight to publishing via CLONING.  This technique assumes that you can trust the upstream data source for data quality & that the shape is exactly what you will be using for reporting.  For really fast ingesting I would suggest ingesting Parquet files.    


Please note that when dealing with High Volume, High Velocity data, that there are several other techniques that can help you perform your intense operational or business reporting cadence.  The above lists of techniques is not an exhaustive list, but more of a list of some "Goto" techniques for your "Toolbox".