Friday, February 04, 2022

How To: Segment High Volume Events into 15 mins Ticks in Snowflake

Segmenting your events results in exploding your data.   When you have billions of event records needing to be segmented into 15 minute ticks, this could end up being slow and expensive to process.  The smaller the segmentation requirement the more explosion.   

Word to the Wise:  Don't join a big event table to another big table ,or in this case, a large segmentation dataset.   You want the segmentation dataset to be as small as possible to help the sql optimizer to be able to hold at least one side of the join in memory and no spill to disk.

Here is my solution that worked very well in Snowflake.   

Please Note: it uses my simple GENERATE_TIMELINE  function that I posted earlier.  And the SEGMENT_TICK CTE dates don't have to change ever.  

The below example is a 15 min example(900 seconds).  Just change the $GRAIN_SECONDS to what you need for segment size.    Your event table can be as large as you require and cover as may months as you need.  The SEGMENT_TICK CTE remains as a tiny dataset.

SET GRAIN_SECONDS = 900;

WITH SEGMENT_TICK AS (
    -- Get epoch seconds starts in 1970  
    SELECT DATE_PART(EPOCH_SECONDS, TIME_START) AS SECONDS 
      FROM TABLE(GENERATE_TIMELINE(
                      '1970-01-01'
                    , '1970-01-02 03:00:00'  
                    , $GRAIN_SECONDS)
                )
)
--------------------------------------------------------------------
-- Set up test cases
--------------------------------------------------------------------
, TEST_DATA AS (SELECT    COLUMN1 AS TEST_CASE
                        , COLUMN2 AS EVENT_START_TIMESTAMP_LOCAL
                        , COLUMN3 AS EVENT_END_TIMESTAMP_LOCAL
                        , DATE_PART(EPOCH_SECONDS, "EVENT_START_TIMESTAMP_LOCAL") % 86400 AS EVENT_START_SECONDS
                        , DATEDIFF(SECONDS, "EVENT_START_TIMESTAMP_LOCAL", "EVENT_END_TIMESTAMP_LOCAL") AS EVENT_DURATION
                  FROM (VALUES    (   'EVENT AFTER MIDNIGHT'                         
                                    , '2021-11-12 01:00:00'::TIMESTAMP_NTZ
                                    , '2021-11-12 02:00:00'::TIMESTAMP_NTZ)
                                , (   'EVENT BEFORE MIDNIGHT'                         
                                    , '2021-11-11 20:00:00'::TIMESTAMP_NTZ
                                    , '2021-11-11 21:00:00'::TIMESTAMP_NTZ) 
                                , (   'EVENT STRADDLES MIDNIGHT'                       
                                    , '2021-11-11 23:00:00'::TIMESTAMP_NTZ
                                    , '2021-11-12 01:00:00'::TIMESTAMP_NTZ)
                                , (   'NOT ON GRAIN EVENT LATE'                       
                                    , '2021-11-11 13:05:00'::TIMESTAMP_NTZ
                                    , '2021-11-11 14:05:00'::TIMESTAMP_NTZ)
                                , (   'NOT ON GRAIN EVENT EARLY'                      
                                    , '2021-11-11 16:55:00'::TIMESTAMP_NTZ
                                    , '2021-11-11 17:55:00'::TIMESTAMP_NTZ)
                                , (   'NOT ON GRAIN EVENT EARLY & LATE'                      
                                    , '2021-11-11 16:55:00'::TIMESTAMP_NTZ
                                    , '2021-11-11 18:05:00'::TIMESTAMP_NTZ)
                                , (   'CROSSING THE 3 AM BOUNDRY'                    
                                    , '2021-11-11 02:55:00'::TIMESTAMP_NTZ
                                    , '2021-11-11 03:55:00'::TIMESTAMP_NTZ)
                       )
                )
--------------------------------------------------------------------
-- Calc Event Start seconds and Event Duration
--------------------------------------------------------------------
, EVENT_GRAIN AS (SELECT  T.EVENT_START_TIMESTAMP_LOCAL
                        , T.EVENT_END_TIMESTAMP_LOCAL
                        , T.EVENT_START_SECONDS
                        , T.EVENT_DURATION
                       
                        -- Calc Start Time
                        , DATEADD(SECONDS, S.SECONDS, T.EVENT_START_TIMESTAMP_LOCAL::DATE) AS SEGMENT_TIME_START
                        -- Calc End Time
                        , DATEADD(SECONDS, $GRAIN_SECONDS, "SEGMENT_TIME_START") SEGMENT_TIME_END
                    FROM SEGMENT_TICK S
                    JOIN TEST_DATA T
                        -- Overlap Window logic
                        -- T.Start < S.End
                        -- S.Start < T.End
                     ON T.EVENT_START_SECONDS < S.SECONDS + $GRAIN_SECONDS              
                    AND S.SECONDS < T.EVENT_START_SECONDS + T.EVENT_DURATION   
                )    
-- Return Results
SELECT * 
  FROM EVENT_GRAIN;
 
 

No comments: