Friday, February 04, 2022

How To: Segment High Volume Events into 15 mins Ticks in Snowflake

Segmenting your events results in exploding your data.   When you have billions of event records needing to be segmented into 15 minute ticks, this could end up being slow and expensive to process.  The smaller the segmentation requirement the more explosion.   

Word to the Wise:  Don't join a big event table to another big table ,or in this case, a large segmentation dataset.   You want the segmentation dataset to be as small as possible to help the sql optimizer to be able to hold at least one side of the join in memory and no spill to disk.

Here is my solution that worked very well in Snowflake.   

Please Note: it uses my simple GENERATE_TIMELINE  function that I posted earlier.  And the SEGMENT_TICK CTE dates don't have to change ever.  

The below example is a 15 min example(900 seconds).  Just change the $GRAIN_SECONDS to what you need for segment size.    Your event table can be as large as you require and cover as may months as you need.  The SEGMENT_TICK CTE remains as a tiny dataset.

SET GRAIN_SECONDS = 900;

WITH SEGMENT_TICK AS (
    -- Get epoch seconds starts in 1970  
    SELECT DATE_PART(EPOCH_SECONDS, TIME_START) AS SECONDS 
      FROM TABLE(GENERATE_TIMELINE(
                      '1970-01-01'
                    , '1970-01-02 03:00:00'  
                    , $GRAIN_SECONDS)
                )
)
--------------------------------------------------------------------
-- Set up test cases
--------------------------------------------------------------------
, TEST_DATA AS (SELECT    COLUMN1 AS TEST_CASE
                        , COLUMN2 AS EVENT_START_TIMESTAMP_LOCAL
                        , COLUMN3 AS EVENT_END_TIMESTAMP_LOCAL
                        , DATE_PART(EPOCH_SECONDS, "EVENT_START_TIMESTAMP_LOCAL") % 86400 AS EVENT_START_SECONDS
                        , DATEDIFF(SECONDS, "EVENT_START_TIMESTAMP_LOCAL", "EVENT_END_TIMESTAMP_LOCAL") AS EVENT_DURATION
                  FROM (VALUES    (   'EVENT AFTER MIDNIGHT'                         
                                    , '2021-11-12 01:00:00'::TIMESTAMP_NTZ
                                    , '2021-11-12 02:00:00'::TIMESTAMP_NTZ)
                                , (   'EVENT BEFORE MIDNIGHT'                         
                                    , '2021-11-11 20:00:00'::TIMESTAMP_NTZ
                                    , '2021-11-11 21:00:00'::TIMESTAMP_NTZ) 
                                , (   'EVENT STRADDLES MIDNIGHT'                       
                                    , '2021-11-11 23:00:00'::TIMESTAMP_NTZ
                                    , '2021-11-12 01:00:00'::TIMESTAMP_NTZ)
                                , (   'NOT ON GRAIN EVENT LATE'                       
                                    , '2021-11-11 13:05:00'::TIMESTAMP_NTZ
                                    , '2021-11-11 14:05:00'::TIMESTAMP_NTZ)
                                , (   'NOT ON GRAIN EVENT EARLY'                      
                                    , '2021-11-11 16:55:00'::TIMESTAMP_NTZ
                                    , '2021-11-11 17:55:00'::TIMESTAMP_NTZ)
                                , (   'NOT ON GRAIN EVENT EARLY & LATE'                      
                                    , '2021-11-11 16:55:00'::TIMESTAMP_NTZ
                                    , '2021-11-11 18:05:00'::TIMESTAMP_NTZ)
                                , (   'CROSSING THE 3 AM BOUNDRY'                    
                                    , '2021-11-11 02:55:00'::TIMESTAMP_NTZ
                                    , '2021-11-11 03:55:00'::TIMESTAMP_NTZ)
                       )
                )
--------------------------------------------------------------------
-- Calc Event Start seconds and Event Duration
--------------------------------------------------------------------
, EVENT_GRAIN AS (SELECT  T.EVENT_START_TIMESTAMP_LOCAL
                        , T.EVENT_END_TIMESTAMP_LOCAL
                        , T.EVENT_START_SECONDS
                        , T.EVENT_DURATION
                       
                        -- Calc Start Time
                        , DATEADD(SECONDS, S.SECONDS, T.EVENT_START_TIMESTAMP_LOCAL::DATE) AS SEGMENT_TIME_START
                        -- Calc End Time
                        , DATEADD(SECONDS, $GRAIN_SECONDS, "SEGMENT_TIME_START") SEGMENT_TIME_END
                    FROM SEGMENT_TICK S
                    JOIN TEST_DATA T
                        -- Overlap Window logic
                        -- T.Start < S.End
                        -- S.Start < T.End
                     ON T.EVENT_START_SECONDS < S.SECONDS + $GRAIN_SECONDS              
                    AND S.SECONDS < T.EVENT_START_SECONDS + T.EVENT_DURATION   
                )    
-- Return Results
SELECT * 
  FROM EVENT_GRAIN;
 
 

Thursday, February 03, 2022

Baseline Model: Business Process Sales to Engineering Cycle BPMN

I typically use Business Process Modeling Notation (BPMN) in my technical documentation for a system to stay technology neutral where appropriate, but it is also very useful for actually thinking business level problems as well where only people or companies are involved.  Though there is one problem.  Most business level people cannot read a BPMN model easily.  So you will need to translate it in to English.  The following is an example of a business process that was documented using BPMN which  I did a while back to fix a communication issue between teams.   


Baseline Model: Document Management Conceptual Model

Years ago I created this conceptual level data model to capture the essence of what kind of data definitions are required for a general document management.   Its a baseline.




What Is: Data Modeling

Basic Concepts of Data Modeling

Data modeling is an applied discipline for defining the languages (nouns and verbs) of the business.   The results of a modeling exercise are a series of data models and a data dictionary that are used to educate and bring awareness of the meaning, relationship, and value of the data across the business.   Data models are most commonly used to define the data requirements for a company’s transactional and data warehouse needs and are vital to data management and governance. 


Approaches: Top-Down, Bottom-Up, and POC

Top-Down: A process of extracting information from subject matter experts for a given business area and producing a model that documents the business data requirements that IT can use to build, modify, or buy services.   You can speed up this process by starting with reference models and modifying areas that deviate from the reference models.


Bottom-Up:  A process of reverse-engineering data structures into “As-Built” data models in an effort to take inventory of the current data ecosystem.  Gap analysis can be performed by determining the gap between the Top-Down and the Bottom-Up produced data models. 


Proof of Concept (POC):  A process where the data requirements are discovered by means of creating a prototype.


Data Model Perspective

There are many types of data model perspectives.  The follow are the most common.


Conceptual Data Model:  Describes the language of the business at a summary level which communicates the subject and scope to all stakeholders.   These models will describe entities and their relationships without going into the primary keys or attribute level of the model.


Logical Data Model:  Describes the language of the business in detail while being technologically neutral while following normalization rules.  The logical model is primarily used to communicate the data requirements to the business, development, and operational teams.


Physical Data Model:  Describes the actual schema design which is optimized and targeted for a specific technology (transactional database, analytical database, No SQL Database, JSON file, CSV File, etc…).   These models are used primarily to communicate the actual physical design to the development, test, and operational teams. 

  • As-Built Physical Data Model:  Describes the physical data model as reverse-engineered from a production database.
  • As-Documented Physical Data Model:  Describes the physical data model as documented.  These documents may deviate from what is actually out in production.


Reference Data Models

Data modeling is a relatively mature science and most business areas out in the wild have been logically mapped out over the years.   As a result, there are serval mature data models that are considered “Reference” models.  Using these reference models as a baseline saves a lot of time and resources which can be refocused on the unique needs of the business that differ from the “Reference” models.   Len Silverston, Paul Agnew, and David Hay have published several models patterns and reference designs.  I too have thrown in my hat in the ring with my models here on this website.


Data Normalization

The first, second, and third rule of data normalization were defined by Edgar F. Codd.   Codd and Raymond F. Boyce combined their efforts to define the Boyce-Codd Normal Form (BCNF) in 1974.  Data normalization is a process that organizes information in a way that clearly defines all the relevant entities, their attributes, and the relationships between entities.   Each entity should only represent one concept such as a person, place, or thing which must have a “key” that uniquely identifies each entity record.  The “Key” must contain only the necessary attributes that make each entity record unique and must be comprised of nothing but those attributes.  Non-key attributes included in the entity’s definition must be seperated from the entity’s unique key and must intrinsically belong to the entity.  Any redundancy must be removed.


Data Integrity & Performance

Data integrity refers to the assurance that the information being managed in a data system is consistent and accurate over the life of the data.  It is critical in the design of a transactional database system to have data integrity.  Normalization of your data along with foreign key constraints are a natural way to help design a transactional system to inherently enforce data integrity.   But a balance must be struck to deal with the performance needs of the system.  This typical requires optimizing the design by introducing derived and duplicate data elements for faster reporting and scaling needs.  Optimization may also include turning off foreign key constraints, collapsing or partitioning tables, reshaping the table, or changing a many-to-many relationship into a one-to-many relationships.  In other words, selective de-normalization of the design is required to deal with tuning a database to perform as required under stress.   As these optimizations are introduced, data integrity problems may arise, so extra processes must be put into place to monitor and protect the data.  


Agile Data Modeling and the Long Term View

Agile development can be applied to data modeling by utilizing reference models and limiting the scope of the modeling effort to the story board.   But, it is also very critical to think of the big picture when it comes to enterprise-wide reporting and data management.  Do not silo your data modeling effort too much.  A conceptual or logical data model focused on the broad view is a great value to the whole team.


Data Modeling Tools

Modeling tools enable a data modeler to create graphical representations of the data models following very specific notation rules.  Some tools can do both data and process modeling using Business Process Modeling Notation (BPMN) and link the two together.  My favorite over the years is Visual Paradigm.   Cheap and just works. 

How To: Pattern for Pipeline Stored Procedure Created in Snowflake

I created a typical pattern for my stored procedures within the pipeline part of the system.   Here is my "GoTo" pattern to start a stored procedure in Snowflake.  This is a JScript structured stored procedure of course and uses the cloning technique for isolating processes from access interruption for down stream users and systems.  Other types of stored procedures may need more control flow logic in JSCRIPT.    

CREATE OR REPLACE PROCEDURE {Schema}.{Pipeline Process Name}()

  RETURNS VARCHAR
  LANGUAGE javascript
  EXECUTE AS CALLER
  AS
  $$
     sqlStatements = 
    `
----------------------------------------------- 
-- Process: {Place Process Name Here}   
-- Typical Run: {X} minutes on {Y} warehouse
-- Typical Output: {Z} Records
----------------------------------------------- 
CREATE OR REPLACE TABLE {Staging Schema}.{Table Name} 
AS
SELECT * FROM {Staging Schema}.{Process Name};
GRANT OWNERSHIP ON {Staging Schema}.{Table Name} TO ROLE {Security Role};
CREATE OR REPLACE TABLE {Publishing Schema}.{Table Name} CLONE {Staging Schema}.{Table Name};
GRANT OWNERSHIP ON VIEW {Publishing Schema}.{Table Name} TO ROLE {Security Role};

----------------------------------------------- 
-- Process: {Place Process Name Here}   
-- 
Typical Run: {X} minutes on {Y} warehouse
-- Typical Output: {Z} Records
----------------------------------------------- 
CREATE OR REPLACE TABLE {Staging Schema}.{Table Name} 
AS
SELECT * FROM {Staging Schema}.{Process Name};
GRANT OWNERSHIP ON {Staging Schema}.{Table Name} TO ROLE {Security Role};
CREATE OR REPLACE TABLE {Publishing Schema}.{Table Name} CLONE {Staging Schema}.{Table Name};
GRANT OWNERSHIP ON VIEW {Publishing Schema}.{Table Name} TO ROLE {Security Role};
`;
    
        statementList = sqlStatements.split(";") ;
        index = 0;
        
      try {
        for (index in statementList) {
            // don't execute empty strings.
            if (statementList[index].trim()) {
                rs = snowflake.execute({sqlText: statementList[index]});
                }
            }          
        return "Succeeded: " + statementList.length + " Statements Executed.";   // Return a success/error indicator.
        }
    catch (err)  {
        throw "Failed: " + err + ".  SQL Executed(" + statementList[index] +  ")";   // Return a success/error indicator.
        }
  $$;
GRANT OWNERSHIP ON PROCEDURE {Staging Schema}.{Pipeline Process Name}() TO ROLE {Security Role}; 


  /*

  CALL {Staging Schema}.{Pipeline Process Name}();

  */



How To: Generating Simple Timeline Records in Snowflake

When you need to segment (a.k.a. fanning out) your events into 10 second grain (Or whatever grain you need), you will need to generate new records representing that grain across time.   This function enables you to do just that and join again your event table.   In the join it is best to use a form of overlapping window join.   Please look at my recent post on what that logic looks like.  

CREATE OR REPLACE FUNCTION GENERATE_TIMELINE(ARG_START_TIME STRING, ARG_END_TIME STRING, ARG_GRAIN INTEGER)
  RETURNS TABLE ( TIME TIMESTAMP_NTZ, TIME_START TIMESTAMP_NTZ, TIME_END TIMESTAMP_NTZ )
  AS 
 $$
            SELECT * 
              FROM (
                  SELECT DATEADD(SECOND, ((ROW_NUMBER() OVER(ORDER BY 1)) - 1) * ARG_GRAIN, ARG_START_TIME::TIMESTAMP_NTZ) AS TIME
                         , TIME AS TIME_START
                         , DATEADD(SECOND, ARG_GRAIN-1, TIME) AS TIME_END
                     FROM TABLE(generator(rowcount => 1000000))
                    )
              WHERE TIME BETWEEN ARG_START_TIME AND ARG_END_TIME  
 $$;
 

SELECT *
  FROM TABLE(GENERATE_TIMELINE('2021-06-01 05:55:03','2021-06-01 07:05:15',10))

How To: Generate Dates from a CHRON Pattern in Snowflake

I created this logic to generate dates based on a Chron Pattern.  Useful for when you have specific date patterns you want to pull events from to process on a continual basis in a pipeline.

CREATE OR REPLACE FUNCTION GENERATE_CHRON_PATTERNED_TIMESTAMPS (CHRON_PATTERN  STRING, DURATION NUMBER)
 RETURNS TABLE (  CHRON_START_TIMESTAMP TIMESTAMP
                , CHRON_END_TIMESTAMP TIMESTAMP
                , CHRON_PATTERN STRING
                , MIN_PATTERN STRING
                , HOUR_PATTERN STRING
                , DAY_PATTERN STRING
                , MON_PATTERN STRING
                , WDAY_PATTERN STRING
                , YEAR_PATTERN STRING
                , MIN STRING
                , HOUR STRING
                , DAY STRING
                , MON STRING
                , WDAY STRING
                , YEAR STRING
               )
  AS
$$
WITH PARSE_CHRON_PATTERN AS (SELECT CHRON_PATTERN as chron_pattern
                       , '[0-9*-\.\/]+' as parse_rule
                       ,  REGEXP_SUBSTR(chron_pattern,parse_rule,1,1) min_pattern
                       ,  REGEXP_SUBSTR(chron_pattern,parse_rule,1,2) hour_pattern
                       ,  REGEXP_SUBSTR(chron_pattern,parse_rule,1,3) day_pattern
                       ,  REGEXP_SUBSTR(chron_pattern,parse_rule,1,4) mon_pattern
                       ,  REGEXP_SUBSTR(chron_pattern,parse_rule,1,5) wday_pattern
                       ,  COALESCE(REGEXP_SUBSTR(chron_pattern,parse_rule,1,6), '*') year_pattern

                 )
,
EXPLODE_LIST AS (
            SELECT P.*
                   , MIN_LIST.VALUE AS MIN
                   , HOUR_LIST.VALUE AS HOUR
                   , DAY_LIST.VALUE AS DAY
                   , MON_LIST.VALUE AS MON
                   , WDAY_LIST.VALUE AS WDAY
                   , YEAR_LIST.VALUE AS YEAR
                    
              FROM PARSE_CHRON_PATTERN P,
              lateral split_to_table(P.min_pattern, ',') MIN_LIST,
              lateral split_to_table(P.hour_pattern, ',') HOUR_LIST,
              lateral split_to_table(P.day_pattern, ',') DAY_LIST,
              lateral split_to_table(P.mon_pattern, ',') MON_LIST,
              lateral split_to_table(P.wday_pattern, ',') WDAY_LIST,
              lateral split_to_table(P.year_pattern, ',') YEAR_LIST
)
, WILD_CARD_RANGE AS (
                         SELECT CHRON_PATTERN
                              , PARSE_RULE
                              , MIN_PATTERN
                              , HOUR_PATTERN
                              , DAY_PATTERN
                              , MON_PATTERN
                              , WDAY_PATTERN
                              , YEAR_PATTERN
                              , IFF(MIN = '*', '0-59', MIN) AS MIN
                              , IFF(HOUR = '*', '0-23', HOUR) AS HOUR
                              , IFF(DAY = '*', '1-31', DAY) AS DAY
                              , IFF(MON = '*', '1-12', MON) AS MON
                              , IFF(WDAY = '*', '0-6', WDAY) AS WDAY
                              , IFF(YEAR = '*' or YEAR IS NULL, '2019-' || DATE_PART(YEAR, CURRENT_DATE())::STRING, YEAR) AS YEAR
                          FROM EXPLODE_LIST
 )
, EXPLODE_RANGE AS (SELECT  E.CHRON_PATTERN
                          --, E.PARSE_RULE
                          , E.MIN_PATTERN
                          , E.HOUR_PATTERN
                          , E.DAY_PATTERN
                          , E.MON_PATTERN
                          , E.WDAY_PATTERN
                          , E.YEAR_PATTERN
                          , COALESCE(MIN_LIST.NUMBER::STRING, E.MIN) AS MIN
                          , COALESCE(HOUR_LIST.NUMBER::STRING, E.HOUR) AS HOUR
                          , COALESCE(DAY_LIST.NUMBER::STRING, E.DAY) AS DAY
                          , COALESCE(MON_LIST.NUMBER::STRING, E.MON) AS MON
                          , COALESCE(WDAY_LIST.NUMBER::STRING, E.WDAY) AS WDAY
                          , COALESCE(YEAR_LIST.NUMBER::STRING, E.YEAR) AS YEAR
                      FROM WILD_CARD_RANGE E
                      LEFT OUTER JOIN MEDIA_CONFIG.CHRON_PATTERN_SEQ_NUMBER MIN_LIST
                        ON MIN_LIST.NUMBER BETWEEN TRY_TO_NUMBER(SPLIT_PART(E.MIN,'-',1)) AND TRY_TO_NUMBER(SPLIT_PART(E.MIN,'-',2))
                      LEFT OUTER JOIN MEDIA_CONFIG.CHRON_PATTERN_SEQ_NUMBER HOUR_LIST
                        ON HOUR_LIST.NUMBER BETWEEN TRY_TO_NUMBER(SPLIT_PART(E.HOUR,'-',1)) AND TRY_TO_NUMBER(SPLIT_PART(E.HOUR,'-',2))
                      LEFT OUTER JOIN MEDIA_CONFIG.CHRON_PATTERN_SEQ_NUMBER  DAY_LIST
                        ON DAY_LIST.NUMBER BETWEEN TRY_TO_NUMBER(SPLIT_PART(E.DAY,'-',1)) AND TRY_TO_NUMBER(SPLIT_PART(E.DAY,'-',2))
                      LEFT OUTER JOIN MEDIA_CONFIG.CHRON_PATTERN_SEQ_NUMBER MON_LIST
                        ON MON_LIST.NUMBER BETWEEN TRY_TO_NUMBER(SPLIT_PART(E.MON,'-',1)) AND TRY_TO_NUMBER(SPLIT_PART(E.MON,'-',2))
                      LEFT OUTER JOIN MEDIA_CONFIG.CHRON_PATTERN_SEQ_NUMBER WDAY_LIST
                        ON WDAY_LIST.NUMBER BETWEEN TRY_TO_NUMBER(SPLIT_PART(E.WDAY,'-',1)) AND TRY_TO_NUMBER(SPLIT_PART(E.WDAY,'-',2))
                      LEFT OUTER JOIN MEDIA_CONFIG.CHRON_PATTERN_SEQ_NUMBER YEAR_LIST
                        ON YEAR_LIST.NUMBER BETWEEN TRY_TO_NUMBER(SPLIT_PART(E.YEAR,'-',1)) AND TRY_TO_NUMBER(SPLIT_PART(E.YEAR,'-',2))
                     
                    )
 
 SELECT   try_to_timestamp(YEAR || '-' || MON || '-' || DAY || ' ' || HOUR || ':' || MIN || ':00', 'yyyy-mm-dd hh24:mi:ss') AS CHRON_START_TIMESTAMP
        , DATEADD(SECONDS, DURATION, CHRON_START_TIMESTAMP) AS CHRON_END_TIMSTAMP
        , *
   FROM EXPLODE_RANGE
 HAVING CHRON_START_TIMESTAMP IS NOT NULL
    AND DATE_PART(WEEKDAY, CHRON_START_TIMESTAMP) = WDAY
    AND CHRON_START_TIMESTAMP <= CURRENT_TIMESTAMP()
  ORDER BY CHRON_START_TIMESTAMP DESC               
 $$
 ;
 
 
SELECT * FROM TABLE (GENERATE_CHRON_PATTERNED_TIMESTAMPS('0 17,19-23 1,15 * 1,2,4-6', 3600));


How To: Detect Daylight Savings Time in Snowflake

 I created this function to easily determine if a date is in Daylight Savings Time or Standard Time.    


CREATE OR REPLACE FUNCTION IS_DAYLIGHT_SAVINGS(DATE TIMESTAMP)
  RETURNS VARCHAR
  LANGUAGE JAVASCRIPT
AS
$$
// STANDARD TIME IS Calc SECOND Sunday in MARCH
var standardMonth = new Date(DATE.getFullYear(), 2);
var standardMonthFirstDayType = standardMonth.getDay();
var standardDaysToAdd = 7 + (8 - (standardMonthFirstDayType % 6))   ; // Its +7 and +1 to Account for day one in month
var standardDate = new Date(DATE.getFullYear(), 2, standardDaysToAdd, 2);  
// DAYLIGHT SAVINGS TIME IS Calc FIRST Sunday in NOVEMBER
var dayLightMonth = new Date(DATE.getFullYear(), 10);
var dayLightMonthFirstDayType = dayLightMonth.getDay();
var daylightDaysToAdd = (8 - (standardMonthFirstDayType % 6)) ; //  Its +7 and +1 to to Account for day one in month
var daylightDate = new Date(DATE.getFullYear(), 10, daylightDaysToAdd, 2);  
//DST Test
if (DATE >= daylightDate || DATE < standardDate) {
    return 0;
}
else {
    return 1;
}
$$;

SELECT IS_DAYLIGHT_SAVINGS('2021-11-07 02:00') AS NOV, 
       IS_DAYLIGHT_SAVINGS('2021-03-14 02:00') AS MARCH  ;
SELECT IS_DAYLIGHT_SAVINGS('2021-11-07 01:59') AS NOV, 
       IS_DAYLIGHT_SAVINGS('2021-03-14 01:59') AS MARCH  ;

How To: Simple String Similarity Function in Snowflake

I created this function to help with performing a simple title matching logic for tv show meta data sourced from two different data providers.    I found it satisfactory for my needs at the time.  Obviously it may need a human in the loop to deal with titles that are so different that this process doesn't work.


CREATE OR REPLACE FUNCTION STRING_SIMILARITY (STR1 string, STR2 string, SUB_STRING_LENGTH variant, CASE_SENSITIVE boolean) 

  RETURNS VARCHAR
  LANGUAGE JAVASCRIPT
AS
$$
if (CASE_SENSITIVE == false) {
STR1 = STR1.toLowerCase();
STR2 = STR2.toLowerCase();
}
if (STR1.length < SUB_STRING_LENGTH || STR2.length < SUB_STRING_LENGTH)
return 0;
const map = new Map();
for (let i = 0; i < STR1.length - (SUB_STRING_LENGTH - 1); i++) {
const substr1 = STR1.substr(i, SUB_STRING_LENGTH);
map.set(substr1, map.has(substr1) ? map.get(substr1) + 1 : 1);
}
let match = 0;
for (let j = 0; j < STR2.length - (SUB_STRING_LENGTH - 1); j++) {
const substr2 = STR2.substr(j, SUB_STRING_LENGTH);
const count = map.has(substr2) ? map.get(substr2) : 0;
if (count > 0) {
map.set(substr2, count - 1);
match++;
}
}
return (match * 2) / (STR1.length + STR2.length - ((SUB_STRING_LENGTH - 1) * 2));
$$
;
SELECT STRING_SIMILARITY ('Startrek', 'Star Trek', 2, FALSE) ;


Tuesday, February 01, 2022

Baseline Conceptual A.R.T.I.S.T. API Structure

Over the years working with the media world and having architected an AI system centered around performing content recognition and classification, I think I have come to some clarity on how I might go forward building a more human accessible and maintainable API.   This is currently an exploratory effort.  

An AI system can be complex with terms like HITL, Labeling, Training, Inference, Classification, NLP, OCR, Machine Learning, Neural Networks, and Meta Data Mastering.   It's all a bit intimidating to someone new.   So I currently decided to borrow from the well established Art community to mitigate that issue a bit.   Most people are familiar with the art community terms.  In the art community you have Artists, Curators, Conservators, Critics, and Art Galleries to name a few key concepts in the art community.   The intent is to organize the API in which it makes it easier to navigate and understand.     

The below API design is a high level entry point design and doesn't cover the objects and functions under them (Yet):

Entry Points:

  • Gallery - The Gallery is where people go to see the Art.  
    • This API entry point is where the UI will search and access all data, Video, and Images.   
  • Critic - The critic evaluates a piece of art based on various criteria.   
    • This API entry point is where the UI can enable the audience to rate a video or manage AI to score the video.   There can be many types of ratings and scores.   Social Rating, Content Rating, and Diversity Scoring to name a few.
  • Conservator - The Conservator provides the caring, reconstruction/preservation, and safe keeping of a piece of art.   
    • This API entry point is where labeling, training, inferance, and classification is performed by people or AI to "Reconstruct" the information that was lost or inaccessible by the originator of the video (Which is always the case).  The resulting inferred information enables the video content to be indexed and easily searchable, accessible, and critiqued.  
  • Curator - The Curator identifies, sources, and collects pieces of art for an Art Gallery or private art collector.  
    • This API entry point is where the data mastering, sourcing, & capturing a video is performed.


Each entry point has its own objects and functions relating to its area.   They will contain the necessary C.R.U.D. functionality along with the appropriate report access, process execution & state management.  Please refer to the data model in previous post to get an idea what those objects and functions might be.

API Layout

This is first take on laying out a clean organization of lower levels of the api before the actual function.  Its not complete and I will let your imagination to expand on it.

Additionally, rather than cluttering up the diagram below with a bunch of C.R.U.D., I am assuming they exist where appropriate.   I will be modifying this as time goes when my thoughts on this becomes clearer.