Data Glass: February 2022

Monday, February 21, 2022

Technology: ARTIST MDM Technology Selection

This is part II from the Previous Post.

This is an exploration of how I might implement a master data management and release service. There are so many ways, technologies, and platforms to use to implement this. I chose AWS as the cloud service provider and databricks as the server orchestration rather than going deep into EKS or Kubernetes hard core for reasons of team efficiency. Refactoring to using EKS or Kubernetes in production rather than Databricks can be taken at a future time when the budget and size of the time allows.

Ultimate Technical Goal

Support 20 Billion Episodes & Movies
Support 100 concurrent curators
Support Localization of text
Support Change Control
Support Data Ticket tracking
Support HITL (Human in the Loop)
Data Structure to be flexible and expandable

Approach

Use standard services so its easy to find intermediate level developers
Keep it simple & maintainable

Teams

Data Engineering - Pipeline, Warehouse, Containers
Video Prep - Video Capture, Pre-Processing, Video Security and Storage
Data Science - ML & NLP packages
UI/UX Dev - ARTIST Website, HITL, Data Ticketing, and Data Collection UI

Preliminary Technology Choice

Use S3 buckets
Organize data using S3 folders and naming conventions
Store raw pre-ingested data as json documents
Use Aurora transactional databases

MDM Editor Database
Data Ticketing and Tracking Database

Databricks

Server orchestration
Job Management
Machine Learning
Data Pulls from 3rd Party APIs

ECS/ECR/ELB/Fargate for Data Collection (MDM UI) Website
Use Lambda for internal API
Use json to store schema, data entry rules, ui presentation and editing layouts

React Material UI

AWS OpenSearch

Connector to Aurora for indexing searches

Topological Diagram

Please refer to my previous post on MDM design for details on the data and process models.

Friday, February 18, 2022

Lessons Learned: An eight year venture in creating data products

Preface

I worked for a data product & AI company for eight years. The company made many attempts to define a series of data products that could be sold to the media industry. We were successful in many ways on a technical level, but in the end we failed as a business. Many things could contribute to that failure. This posting is a memorial to all the hard work we all did and to morn the loss we all suffered when we closed.

My hat goes off to our CEO and COO for their dedication and commitment to the company. To our scientists I sit in awe at what they accomplished in creating machine learning packages that detected so many things we as humans take for granted. To our data engineers who tirelessly built and integrated all the subsystems of our service. To our full stack developer who made all our data visible and beautiful. To our account success managers who walked and talked with our customers through our data products. To our public relations personnel who converted our complex jargen into words that our customers could understand and made YouTube video presentations about the media industry. To our sales people who opened doors and closed deals and kept the wheels moving for our company. To our VCs who had faith in us and the technology and financed our way. To all of you, thank you! You rock and I will remember you all.

The Customer

We had some strong ties to some studios and networks through our CEO and COO. It was obvious that we would capitalized on these relationships to get in the door that would in other ways be impenetrable for most startups. We were very successful at getting conversations going and starting pilot projects with Disney, Paramount, Universal, HBO, Warner Brothers, Sinclair, Weather Channel and others. It looked very promising and we were all very excited. And because of that success, we got support by venture capitalists to help us develop our platform and data products.

It turns out that these media studios and networks all were very curious. They had keen interest to collect actionable metrics on their movies, series, or news broadcasts.

Metrics Like the Following:

Predict Video Ratings before Release
Detect Talent Screen Time
Calculate Talent Diversity Scores
Detect Topics Presented
Detect News Coverage
Find Props & Set Pieces
Collect the Movie Credits on Videos in their Vault
Suggest the Appropriate amount of On Site for news is required (On Site is expensive)
Suggest the Appropriate amount of Spanglish for mix language audience
Suggest the Appropriate amount of News Coverage for a topic
Measure the Retention power of each 10 second of the video
Measure the Retention power of each episode of a series
Determine what Type of Content had Statically Significant Retention for the Audience

The list could go on an on. So many questions, but which ones would provide meaningful value for our customers and a sustainable future for our company.

Customers Needed:

Help increasing sales or lower their costs
A data product that fits their budget
A data product that was simple to understand
A high quality data product
A durable data product that would be available for years to come
A data product that could easily integrate & have low friction with their daily business process
A data product with 24/7 technical support

We Needed:

High sales in a large enough market space to support the company long term
A data product that had a low cost to manufacture
A data product that had a high enough profit margin to survive if not thrive
High subscription renewals
Low sales turn around
A data product we exclusively owned

Once we understood our customer's specific needs and gathered the necessary requirements, we would start with creating a pilot with each customer to run a short list of videos to process and provide them the results to review. This is where it got "interesting". On the one hand, all our customers were surprised it could be done. They were satisfied by the results and gave high praise to our work internally and externally of their company. But on the other hand, after the pilot was over, the customers would typically delay signing any contracts for months. This delay resulted in one of two outcomes: One outcome was that they would smile, thank us for the work, and state that they don't have the budget. The second outcome was that we would find out that our main "point-of-contact" at the studio had moved onto another position and was no longer able to assist us in moving forward with a contract. From a sales perspective this basically meant we had to the start the sales process all over again.

What We Built

First of all, we had to analyze the video and audience behavior and report back to the customer a meaningful and actionable result. At the same time we needed to create a platform that enabled us to create reproducible results in a systematic, predictable, and high volume way. So we built and productionalized an AI service that could process video and combine it with second by second audience behavior to determine what resonated with the audience. A visualization UI was built to allow our customers to see the second by second timeline of all the detections in sync with the video playback. And we summarized a series of scores that enabled our customers to see, at a glance, the performance of the video for talent and content. It was a dream and now it was a reality. We created a fully productionalized service that was processing our customer's videos every day and reporting back to our customers the final results.

Our detectors could do the following:

Talent

Time on Screen

Time Speaking

Style

Pace

Complexity in Vernacular and Length of Sentences

Language Classification of use of Personal Viewpoint (Me, I, Feel, Believe, etc...)

Emotions

Sad, Angry, Fear, Positivity, Surprise

Talent's Setting

News

Local

On Site

National

Scripted

Outside Country

Outside City

Indoors

Shot

Screen Motion

Topic Classification

News Coverage

Crime, Weather, Politics, Lifestyle, Social Justice, Sports, Traffic, Tech/Science, Economy/Business, Education, Public Health

Scripted Storyline

Custom per series or genre

The Ultimate Tragedy

I don't think I could point to just one thing that was the source our ultimate demise. We were technically adept to the hard problem challenge. We were well networked with the media industry and we were financed well enough to execute. At our hight we had three data scientists, three data engineers, one full stack (UI) developer, one user experience (UX) specialist, one product manager (PM), two sales representatives, one account manager, one public relations, an excellent COO, and an awesome and seasoned CEO. Along with a couple temporary advisors we ran pretty cheap in comparison to similar startups. So what happened?

If one is to do a little bit of "Monday Morning Quarterbacking" here, I would summarize it this way:

ONE: Stockmarket Instability

We had VCs willing to continue funding our efforts and had $1.3 million dollars promised to fund us through to the end of 2022. But the Stockmarket instability made it difficult for the VCs to make the funds available. Either their funds dried up or access to them was to distant into the future.

TWO: Too early to the video content analysis game using AI

We couldn't get enough supporting foundational customers that were willing to subscribe to our service long term at supportive rate. The traditional media world was in the curious stage and wasn't ready for a long term commitment. We got a lot of feedback about what questions they wanted answered. But nothing that seemed to keep a customer coming back for more on a daily or weekly basis on a budget that was acceptable. It seemed that once the questions were answered their curiosity was over and they moved on to something else.

THREE: Couldn't solidify a strong enough use case for the data products

Neither we nor our customers could figure out what data products would help them in their daily workflow. Their curiosity was not enough. It turned out that once the movie was in a form in which our service could process the video, it was too late to have any meaningful contribution to most of the steps of film creation (The Seven Steps to Movie Making). Maybe we could have used our data products to suggest to a studio the renditions for other countries, but then again there is a firmly established process there. Or maybe we could have helped a studios' library department perform analysis on the inventory of their existing films, but this was a very finite venture and with an even more finite budget.

The facts are that the traditional media world is 100 years old, strongly networked, and heavily committed to previous contracts; their processes are firmly established and their budgets are tight. A lot of their money is spent on actors, directors, marketing, FX, sets, etc... To squeeze in the cost of an "Unproven" data product into a firmly establish process and budget is a very hard nut to crack. Its like getting your first credit card. You have to prove you are reliable and will make payments on your credit card. A catch twenty two scenario.

FOUR: Competition Grew Fast

AI competition grew in a very short period of time. We dedicated our whole company on this effort from 2018 until we closed our doors in 2022. In that period of time, a lot of competing companies and moneys consolidated and networked faster than we could.

FIVE: Cloud Providers were setting the cost expectations to customers

In the beginning, our services were set at a cost of $30 per minute of video. At that time there wasn't much competition out there. Available identity detection services tended to be hit and miss. We on the other hand used a mixed approach using person tracking, face detection, and identity detection to create a much more accurate representation of a talent on screen, even when the face was obscured. Our customers compared us with AWS, Azure, and Google and our accuracy beat our competition hands down. But these cloud providers make money through the usage of compute and storage services. Which means that the cost of all the other supporting services they provide are offset. Consequently AWS's service for identity detection is extremely cheap at $0.15 a minute of video. That sets the customers expectations of cost per minute of video process to be under a $1 a minute and much closer to $0.15. Ironically we were using AWS compute and storage services ourselves. All though our direct customers understood that our costs included compute and storage, it was still hard for them to evangelize within their own organization the large price differential.

SIX: Legal Access to Video was Difficult
There was a bit of friction in gaining access to videos. This friction slowed our delivery down considerably by a month or two. If we asked the customer to provide videos, we had to step through a lot of legal hoops and lawyer fees to make this happen. Then once we had that nailed down, setting up the method of transport was always different per customer due to the requirements of security. Sometimes the customer didn't desire to go through their own red tape to bother. So we had to approach it several different ways which added to our complexity. We finally settled on capturing video from OTA (Over the Air), RTMP, HLS, and S3 buckets. Each one came with their technical issues and video recording quality problems. But at least we could move forward with a pilot.

SEVEN: Customers Want to Build it Themselves
Time and time again, it turns out that companies just want to build it themselves. They would turn to our company to pilot an idea to see if it can be done. After they are satisfied, they then proceed to do it themselves. The tools and technology available now affords them the ability to provide their own solutions and outsource the work to where the wages can be 1/4-1/3 the cost of someone in the USA. They gain full control of the intellectual property and they can build it the way they want using the tools and platforms they feel have long term future for them.

Just Buy Our Company

You would think that several companies would be knocking down our door offering to buy the company just for its efficient and effective data science and data engineering team, let alone the platform we pioneered. But no, that wasn't going to happen. Not to say there wasn't a desire from our partners and customers to do so. It gets complicated when a company buys another company. You have to get alignment with all stakeholders on both sides of the company and they all must have the will power to overcome any obstacles. It only takes one stakeholder to significantly slow or spoil the effort. And it did.

Twenty / Twenty Hindsight

Now that I have time to reflect upon all the years we put into building a platform and data products, I think I have a better grasp at where the money is for these kind of data products. If you study the The Seven Steps to Movie Making you will find that the budget is pretty tight across the steps. And through experience, I witnessed that it was hard for a customer to find the desire and justification to adjust their budget for any of their steps to accommodate our data products. But there was one area we didn't look into nor had access to. The Distribution Step.

Distribution is where the industry makes it money. The film must be distributed for the producers to make their money back. They make lucrative deals for distribution amongst the cinemas and streaming services such as Amazon Prime, Netflix, HBO etc.... Another factor is that these deals help a "Film’s Reach" to rake in the enough money to ensure the return on investment.

There are 14 billion episodes and 100 million movies internationally (I suspect that those number includes renditions). How in the world does anyone make a decision on what licenses to buy with that volume? You'd have to classify all them in a consistent and reliable way. Humans are ok at classification, but to scale up to hundreds if not tens of thousands of paid people to classify the videos in a consistent manner is not possible. The only real solution is to apply a machine learning approach that can classify the existing catalog and keep up with the steadily increasing volume of new episodes and movies coming out each month internationally.

After understanding the distribution space, I have personally come to the conclusion that we were chasing the wrong customers. We needed to look into partnering with or becoming a content licensing exchange and research service that broker license deals for distribution. We could have used the AI's analytical summary of a video to help prospective license buyers to make the best buying choices. Providing them a searchable interface with filter boxes to find the right videos to examine closer.

Possible filters:

Actual Social Rating Score
Predicted Rating Score
Standardized set of Genres
A Specific Talent's Screen Time
Various Diversity scores of the film
Pace of the Film
Vernacular Complexity of the Film
Topic Coverage
Percentage of Emotions Covered
Localized and Internationalized Content Rating
Percentage of Each Language Spoken

But there is still one really big problem. How does one source and temporarily capture 14 billion videos and process them to create the classifications for the potential buyers to filter on? And what about the costs in doing so? I have no answer, not one that is legal that is. Ok I lie, there is one way; create a platform to enable anyone to create their own custom streaming service complete with custom curated licensed content from anywhere around the world with a business model that is equitable to the content owners, the custom streaming service creators, and the audience. I can only imagine the amount of funding required to accomplish this "Think Big" idea.

One Not So Little Idea Left on the Table

As our runway for launching a successful data product was coming to a close, we were in the midst of defining a data product that would assist short form video content creators on YouTube and Facebook to build their audience. There is huge money to be made in short form video. MRBeast, for example, grossed $54 million dollars in 2021, the highest earning YouTube content creator. This gives you an idea of how much money can be generated in short form video and it's growing. We determined that the first 10-30 seconds of a video is most critical in attracting and retaining an audience. If we could analyze the first 30 seconds of a creator's video and compare it with the videos of their best competitor's, we could provide them useful measurements and suggestions before they even publish their video.

So Long and Thanks for all the Fish

It was fun, it was exhilarating, it was fascinating, it was humbling, it was exhausting, it was terrifying, it was an experience of a life time. I would not give it up for anything. What a trip. I think I can speak for all of us at the company that we gained so much experience with the media industry and the challenges that they face now in the 21 century. We are forever humbled by the natural biological functions of our minds and how it can so easily detect and classify things in an actionable way with just a tiny energy budget. We totally respect anyone who can make a successful data product and keep it relevant in an ever changing world in a sea of data products.

These are my points of advice for building data products to anyone stumbling upon my post:

Understand ALL the major processes in a vertical industry
Determine the most painful process points
Become a part of the industry's process flow
Target the RIGHT customer
Have the goal to be THE Source of Truth for a prospective dataset
Keep it a simple build
Make it a quick and easy sell
Keep the "Friction" low for customers to adopt into their daily lives
Provide data that enables customer's take action on a daily or weekly basis
Network and market like crazy and fast
Expect your first several data products to fail
Keep churning out data products until something sticks
You'll have tones of competition, so expect to the big boys to move in on your territory

Never give up! Never Given! -- Galaxy Quest

Saturday, February 12, 2022

Baseline Model: ARTIST's Ingestion, Data Mastering, & Data Ticketing

Curating media information about a talent, series, or movie, and credits in a growing international media industry is challenging for the following reasons:

Yet to be published data
Sparsely populated data
Slowing changing data
Incorrect data
Duplicate data
Insufficient Synopsis or Bio summaries
Localization Issues - Bad translations or lack of translation
Data "Easter Eggs" - Data that someone put in that is not appropriate for the customer
Missing Image Talent and Character Portraits
Missing Movie, Series, and Episode Posters
Image capturing and persistence and refreshing - Talent portraits for example
Image "Easter Eggs" - Images that some put in that are not appropriate for the customer

Data Ingestion Process Model

The Following is the Ingestion, Conforming, Data Shaping, Title Matching, Easter Egg Hunt, and Data Quality Measuring. Notice different storages and human in the loop requirement for the curator to review, edit, and approve the data before it goes into production. Eventually this structure will allow extension to perform AI auto editing and release. My brain is just not quite there yet on that process.

Data Issue & Ticketing Process Model

The following is a general baseline process for managing data problems. The main point is separating the instances of reporting an issue vs issuing a ticket. Many people can report an issue, but only one ticket should be created for that field for a release. State management and workflow is critical as well.

Data Issue & Ticketing Data Model

The following is a general baseline data model for tracking issues and tickets. The mapping back to the entity and field is a critical point. Also tracking what the original value is in the ticket is important as well for not just future posterity purposes. We can use it in machine learning to detect issues before data even gets released to the public or the curators review process.

Media Data Model

The following is a general baseline data model for the Media Data Model. Please see this post for future details.

What Is: Data Governance, SLA's, and Data Contracts

Don't let perfection be the enemy of the good. All companies struggle against entropy in process and data. It's a never ending process. Each day in which we succeed in conforming a dataset, filled in more values in a sparsely populated dataset, or protected customers from bad data and "Data Vandalism" is a good day to be celebrated.

Data governance is a very intense discipline that spreads across the infrastructure of a company. It's a daunting task to implement it in a way that has traction and has some meaningful "Time to Value" for the company's investment. The following diagram gives a summary of what Data Governance is all about in a nutshell. (I created this "Pamphlet" 5 years ago, but it is still mostly relevant. Sorry for the small print. If you download and save then open up in an editor it will be easier to read.)

When you are a data provider, Data Governance is a core part of your business in which is centered around guaranteeing the SLA for your customers. The below diagram give a summary of what an SLA and a Data Contract are. (I created this "Pamphlet" 5 years ago too, but again it is still mostly relevant. Again, sorry for the small print. If you download and save then open up in an editor it will be easier to read.)

Tuesday, February 08, 2022

Baseline: ARTIST's Media Video AI Data Lake Structure

To create an architecture for the ARTIST Media Video AI system described in an earlier post, we need a data lake to store the videos, frames, audio, and important content meta data about the video which can be used by machine learning, HITL labeling, and model training. We also to store the machine learning results and aggregated reports.

Security is important for video, so its advisable to keep results separated from the machine learning results and reports.

The ARTIST Vault should have a structure like the below with the production S3 bucket having the appropriate security lock down so only the processes for video capturing, pre-processing, HITL labeling, training, and ML has access to this S3 bucket.

The ARTIST Results from the machining learning and any aggregated reports should be stored in a s3 bucket structured like the below. The production security can be looser, because you should have complete legal ownership of this data.

The JSON output from the ML can be JSON or parquet files. I choose to present below the data as JSON to be more universal and human accessible for the blog. But I suggest you use Parquet files.

Important note: All the ML outputs should have a flexible but conformed structure to make it easier to ingest downstream and process easier. The flexibility is isolated within the "output" element that contains the ML result output which the layout is different per ML process. Here is an example and recommended baseline ML output structure:

[

{ "process_service_provider":"Super Duper ML Scooper"

"process_name": "Landscape Detector",

"process_version": "5.1.98",

"processed_timestamp": "2022-02-19 03:43:20",

"video_path": "//{your domain}-{environment}-vault/.../video/video.mp4",

"article_id":"16466982-4518-4551-6278-3d10a32612a1",

"article_name":"To My Backyard and Back Again, a Gardener's Tale.",

"events":[

{

"start_millisecond": 39020,

"end_millisecond": 80930,

"event_id": "1",

"output": [

{

{whatever tags or ids you need for down stream mappings}:"202",

{whatever human friendly values you want for easier future debugging}:"Blah blah",

{whatever detection name}:{{value}},

}, ... more outputs

]

}, ... more events

]

}

]

Oh and about the meta files you may have noticed: meta.json contains the video formatting details, system mapping keys, and providence details and the content_meta.json files contain all the meta data about what we already know about the video provided either by the creator/producer of the video or via some service like IMDB, TVMaze, TVDB, TVTime, TiVo, Gracenote, etc....

Sunday, February 06, 2022

Baseline: ARTIST - Media Video AI High Level Architecture

This represents a high level architecture for a Media Video AI system. I call it the {A}rt of {R}eporting, {T}raining, {I}nferance, & {S}tate {T}racking or A.R.T.I.S.T. for short.

0. CI/CD: Jenkins Continuous integration & delivery of docker container environments.

Capture & Prep: Captures Video and prepares it for processing by ML
Model Training: Training of models to be used by Detectors.
HITL & Data Collection DB: Transactional DB for managing data entry and labeling.
HITL Team: Outsourced team to perform labeling and data entry.
Job & State Mgmt: Job Management for scheduling and running ML tasks.
Job & State Mgmt DB: Transactional DB for managing processes and states.
Detectors: Inference Engines for detecting content in video/audio/text
Videos & ML Results S3 Buckets: Video, Frames Audio, and ML Detection Storage.
3rd Party ML Services: Voice to Text, and other types of NLP or video detection service
Audience Behavior & Ratings Warehouse: Storage and large volume processing warehouse DB
3rd Party Watch Log Providers: Watch Event log data providers & audience/critic panels
Gallery Data Warehouse: Finale Data warehousing of Gallery for integration with other services
Gallery DB Cache: Gallery data distributed across the world and localized to its common language
Gallery UI: Public UI for customers to view the media Gallery

Friday, February 04, 2022

How To: Segment High Volume Events into 15 mins Ticks in Snowflake

Segmenting your events results in exploding your data. When you have billions of event records needing to be segmented into 15 minute ticks, this could end up being slow and expensive to process. The smaller the segmentation requirement the more explosion.

Word to the Wise: Don't join a big event table to another big table ,or in this case, a large segmentation dataset. You want the segmentation dataset to be as small as possible to help the sql optimizer to be able to hold at least one side of the join in memory and no spill to disk.

Here is my solution that worked very well in Snowflake.

Please Note: it uses my simple GENERATE_TIMELINE function that I posted earlier. And the SEGMENT_TICK CTE dates don't have to change ever.

The below example is a 15 min example(900 seconds). Just change the $GRAIN_SECONDS to what you need for segment size. Your event table can be as large as you require and cover as may months as you need. The SEGMENT_TICK CTE remains as a tiny dataset.

SET GRAIN_SECONDS = 900;

WITH SEGMENT_TICK AS (

-- Get epoch seconds starts in 1970

SELECT DATE_PART(EPOCH_SECONDS, TIME_START) AS SECONDS

FROM TABLE(GENERATE_TIMELINE(

'1970-01-01'

, '1970-01-02 03:00:00'

, $GRAIN_SECONDS)

)

--------------------------------------------------------------------

-- Set up test cases

--------------------------------------------------------------------

, TEST_DATA AS (SELECT COLUMN1 AS TEST_CASE

, COLUMN2 AS EVENT_START_TIMESTAMP_LOCAL

, COLUMN3 AS EVENT_END_TIMESTAMP_LOCAL

, DATE_PART(EPOCH_SECONDS, "EVENT_START_TIMESTAMP_LOCAL") % 86400 AS EVENT_START_SECONDS

, DATEDIFF(SECONDS, "EVENT_START_TIMESTAMP_LOCAL", "EVENT_END_TIMESTAMP_LOCAL") AS EVENT_DURATION

FROM (VALUES ( 'EVENT AFTER MIDNIGHT'

, '2021-11-12 01:00:00'::TIMESTAMP_NTZ

, '2021-11-12 02:00:00'::TIMESTAMP_NTZ)

, ( 'EVENT BEFORE MIDNIGHT'

, '2021-11-11 20:00:00'::TIMESTAMP_NTZ

, '2021-11-11 21:00:00'::TIMESTAMP_NTZ)

, ( 'EVENT STRADDLES MIDNIGHT'

, '2021-11-11 23:00:00'::TIMESTAMP_NTZ

, '2021-11-12 01:00:00'::TIMESTAMP_NTZ)

, ( 'NOT ON GRAIN EVENT LATE'

, '2021-11-11 13:05:00'::TIMESTAMP_NTZ

, '2021-11-11 14:05:00'::TIMESTAMP_NTZ)

, ( 'NOT ON GRAIN EVENT EARLY'

, '2021-11-11 16:55:00'::TIMESTAMP_NTZ

, '2021-11-11 17:55:00'::TIMESTAMP_NTZ)

, ( 'NOT ON GRAIN EVENT EARLY & LATE'

, '2021-11-11 16:55:00'::TIMESTAMP_NTZ

, '2021-11-11 18:05:00'::TIMESTAMP_NTZ)

, ( 'CROSSING THE 3 AM BOUNDRY'

, '2021-11-11 02:55:00'::TIMESTAMP_NTZ

, '2021-11-11 03:55:00'::TIMESTAMP_NTZ)

)

--------------------------------------------------------------------

-- Calc Event Start seconds and Event Duration

--------------------------------------------------------------------

, EVENT_GRAIN AS (SELECT T.EVENT_START_TIMESTAMP_LOCAL

, T.EVENT_END_TIMESTAMP_LOCAL

, T.EVENT_START_SECONDS

, T.EVENT_DURATION

-- Calc Start Time

, DATEADD(SECONDS, S.SECONDS, T.EVENT_START_TIMESTAMP_LOCAL::DATE) AS SEGMENT_TIME_START

-- Calc End Time

, DATEADD(SECONDS, $GRAIN_SECONDS, "SEGMENT_TIME_START") SEGMENT_TIME_END

FROM SEGMENT_TICK S

JOIN TEST_DATA T

-- Overlap Window logic

-- T.Start < S.End

-- S.Start < T.End

ON T.EVENT_START_SECONDS < S.SECONDS + $GRAIN_SECONDS

AND S.SECONDS < T.EVENT_START_SECONDS + T.EVENT_DURATION

)

-- Return Results

SELECT *

FROM EVENT_GRAIN;

Thursday, February 03, 2022

Baseline Model: Business Process Sales to Engineering Cycle BPMN

I typically use Business Process Modeling Notation (BPMN) in my technical documentation for a system to stay technology neutral where appropriate, but it is also very useful for actually thinking business level problems as well where only people or companies are involved. Though there is one problem. Most business level people cannot read a BPMN model easily. So you will need to translate it in to English. The following is an example of a business process that was documented using BPMN which I did a while back to fix a communication issue between teams.

Baseline Model: Document Management Conceptual Model

Years ago I created this conceptual level data model to capture the essence of what kind of data definitions are required for a general document management. Its a baseline.

What Is: Data Modeling

Basic Concepts of Data Modeling

Data modeling is an applied discipline for defining the languages (nouns and verbs) of the business. The results of a modeling exercise are a series of data models and a data dictionary that are used to educate and bring awareness of the meaning, relationship, and value of the data across the business. Data models are most commonly used to define the data requirements for a company’s transactional and data warehouse needs and are vital to data management and governance.

Approaches: Top-Down, Bottom-Up, and POC

Top-Down: A process of extracting information from subject matter experts for a given business area and producing a model that documents the business data requirements that IT can use to build, modify, or buy services. You can speed up this process by starting with reference models and modifying areas that deviate from the reference models.

Bottom-Up: A process of reverse-engineering data structures into “As-Built” data models in an effort to take inventory of the current data ecosystem. Gap analysis can be performed by determining the gap between the Top-Down and the Bottom-Up produced data models.

Proof of Concept (POC): A process where the data requirements are discovered by means of creating a prototype.

Data Model Perspective

There are many types of data model perspectives. The follow are the most common.

Conceptual Data Model: Describes the language of the business at a summary level which communicates the subject and scope to all stakeholders. These models will describe entities and their relationships without going into the primary keys or attribute level of the model.

Logical Data Model: Describes the language of the business in detail while being technologically neutral while following normalization rules. The logical model is primarily used to communicate the data requirements to the business, development, and operational teams.

Physical Data Model: Describes the actual schema design which is optimized and targeted for a specific technology (transactional database, analytical database, No SQL Database, JSON file, CSV File, etc…). These models are used primarily to communicate the actual physical design to the development, test, and operational teams.

As-Built Physical Data Model: Describes the physical data model as reverse-engineered from a production database.
As-Documented Physical Data Model: Describes the physical data model as documented. These documents may deviate from what is actually out in production.

Reference Data Models

Data modeling is a relatively mature science and most business areas out in the wild have been logically mapped out over the years. As a result, there are serval mature data models that are considered “Reference” models. Using these reference models as a baseline saves a lot of time and resources which can be refocused on the unique needs of the business that differ from the “Reference” models. Len Silverston, Paul Agnew, and David Hay have published several models patterns and reference designs. I too have thrown in my hat in the ring with my models here on this website.

Data Normalization

The first, second, and third rule of data normalization were defined by Edgar F. Codd. Codd and Raymond F. Boyce combined their efforts to define the Boyce-Codd Normal Form (BCNF) in 1974. Data normalization is a process that organizes information in a way that clearly defines all the relevant entities, their attributes, and the relationships between entities. Each entity should only represent one concept such as a person, place, or thing which must have a “key” that uniquely identifies each entity record. The “Key” must contain only the necessary attributes that make each entity record unique and must be comprised of nothing but those attributes. Non-key attributes included in the entity’s definition must be seperated from the entity’s unique key and must intrinsically belong to the entity. Any redundancy must be removed.

Data Integrity & Performance

Data integrity refers to the assurance that the information being managed in a data system is consistent and accurate over the life of the data. It is critical in the design of a transactional database system to have data integrity. Normalization of your data along with foreign key constraints are a natural way to help design a transactional system to inherently enforce data integrity. But a balance must be struck to deal with the performance needs of the system. This typical requires optimizing the design by introducing derived and duplicate data elements for faster reporting and scaling needs. Optimization may also include turning off foreign key constraints, collapsing or partitioning tables, reshaping the table, or changing a many-to-many relationship into a one-to-many relationships. In other words, selective de-normalization of the design is required to deal with tuning a database to perform as required under stress. As these optimizations are introduced, data integrity problems may arise, so extra processes must be put into place to monitor and protect the data.

Agile Data Modeling and the Long Term View

Agile development can be applied to data modeling by utilizing reference models and limiting the scope of the modeling effort to the story board. But, it is also very critical to think of the big picture when it comes to enterprise-wide reporting and data management. Do not silo your data modeling effort too much. A conceptual or logical data model focused on the broad view is a great value to the whole team.

Data Modeling Tools

Modeling tools enable a data modeler to create graphical representations of the data models following very specific notation rules. Some tools can do both data and process modeling using Business Process Modeling Notation (BPMN) and link the two together. My favorite over the years is Visual Paradigm. Cheap and just works.

How To: Pattern for Pipeline Stored Procedure Created in Snowflake

I created a typical pattern for my stored procedures within the pipeline part of the system. Here is my "GoTo" pattern to start a stored procedure in Snowflake. This is a JScript structured stored procedure of course and uses the cloning technique for isolating processes from access interruption for down stream users and systems. Other types of stored procedures may need more control flow logic in JSCRIPT.

CREATE OR REPLACE PROCEDURE {Schema}.{Pipeline Process Name}()

RETURNS VARCHAR
LANGUAGE javascript
EXECUTE AS CALLER
AS
$$
sqlStatements =
`
-----------------------------------------------
-- Process: {Place Process Name Here}
-- Typical Run: {X} minutes on {Y} warehouse
-- Typical Output: {Z} Records
-----------------------------------------------
CREATE OR REPLACE TABLE {Staging Schema}.{Table Name}
AS
SELECT * FROM {Staging Schema}.{Process Name};
GRANT OWNERSHIP ON {Staging Schema}.{Table Name} TO ROLE {Security Role};
CREATE OR REPLACE TABLE {Publishing Schema}.{Table Name} CLONE {Staging Schema}.{Table Name};
GRANT OWNERSHIP ON VIEW {Publishing Schema}.{Table Name} TO ROLE {Security Role};

-----------------------------------------------
-- Process: {Place Process Name Here}
-- Typical Run: {X} minutes on {Y} warehouse

-- Typical Output: {Z} Records

-----------------------------------------------
CREATE OR REPLACE TABLE {Staging Schema}.{Table Name}
AS
SELECT * FROM {Staging Schema}.{Process Name};
GRANT OWNERSHIP ON {Staging Schema}.{Table Name} TO ROLE {Security Role};
CREATE OR REPLACE TABLE {Publishing Schema}.{Table Name} CLONE {Staging Schema}.{Table Name};
GRANT OWNERSHIP ON VIEW {Publishing Schema}.{Table Name} TO ROLE {Security Role};

statementList = sqlStatements.split(";") ;
index = 0;

try {
for (index in statementList) {
// don't execute empty strings.
if (statementList[index].trim()) {
rs = snowflake.execute({sqlText: statementList[index]});
}
}
return "Succeeded: " + statementList.length + " Statements Executed."; // Return a success/error indicator.
}
catch (err) {
throw "Failed: " + err + ". SQL Executed(" + statementList[index] + ")"; // Return a success/error indicator.
}
$$;
GRANT OWNERSHIP ON PROCEDURE {Staging Schema}.{Pipeline Process Name}() TO ROLE {Security Role};

CALL {Staging Schema}.{Pipeline Process Name}();

How To: Generating Simple Timeline Records in Snowflake

When you need to segment (a.k.a. fanning out) your events into 10 second grain (Or whatever grain you need), you will need to generate new records representing that grain across time. This function enables you to do just that and join again your event table. In the join it is best to use a form of overlapping window join. Please look at my recent post on what that logic looks like.

CREATE OR REPLACE FUNCTION GENERATE_TIMELINE(ARG_START_TIME STRING, ARG_END_TIME STRING, ARG_GRAIN INTEGER)
RETURNS TABLE ( TIME TIMESTAMP_NTZ, TIME_START TIMESTAMP_NTZ, TIME_END TIMESTAMP_NTZ )
AS
$$
SELECT *
FROM (
SELECT DATEADD(SECOND, ((ROW_NUMBER() OVER(ORDER BY 1)) - 1) * ARG_GRAIN, ARG_START_TIME::TIMESTAMP_NTZ) AS TIME
, TIME AS TIME_START
, DATEADD(SECOND, ARG_GRAIN-1, TIME) AS TIME_END
FROM TABLE(generator(rowcount => 1000000))
)
WHERE TIME BETWEEN ARG_START_TIME AND ARG_END_TIME
$$;

SELECT *
FROM TABLE(GENERATE_TIMELINE('2021-06-01 05:55:03','2021-06-01 07:05:15',10))

How To: Generate Dates from a CHRON Pattern in Snowflake

I created this logic to generate dates based on a Chron Pattern. Useful for when you have specific date patterns you want to pull events from to process on a continual basis in a pipeline.

CREATE OR REPLACE FUNCTION GENERATE_CHRON_PATTERNED_TIMESTAMPS (CHRON_PATTERN STRING, DURATION NUMBER)
RETURNS TABLE ( CHRON_START_TIMESTAMP TIMESTAMP
, CHRON_END_TIMESTAMP TIMESTAMP
, CHRON_PATTERN STRING
, MIN_PATTERN STRING
, HOUR_PATTERN STRING
, DAY_PATTERN STRING
, MON_PATTERN STRING
, WDAY_PATTERN STRING
, YEAR_PATTERN STRING
, MIN STRING
, HOUR STRING
, DAY STRING
, MON STRING
, WDAY STRING
, YEAR STRING
)
AS
$$
WITH PARSE_CHRON_PATTERN AS (SELECT CHRON_PATTERN as chron_pattern
, '[0-9-\.\/]+' as parse_rule
,  REGEXP_SUBSTR(chron_pattern,parse_rule,1,1) min_pattern
,  REGEXP_SUBSTR(chron_pattern,parse_rule,1,2) hour_pattern
,  REGEXP_SUBSTR(chron_pattern,parse_rule,1,3) day_pattern
,  REGEXP_SUBSTR(chron_pattern,parse_rule,1,4) mon_pattern
,  REGEXP_SUBSTR(chron_pattern,parse_rule,1,5) wday_pattern
, COALESCE(REGEXP_SUBSTR(chron_pattern,parse_rule,1,6), '') year_pattern

)
,
EXPLODE_LIST AS (
SELECT P.*
, MIN_LIST.VALUE AS MIN
, HOUR_LIST.VALUE AS HOUR
, DAY_LIST.VALUE AS DAY
, MON_LIST.VALUE AS MON
, WDAY_LIST.VALUE AS WDAY
, YEAR_LIST.VALUE AS YEAR

FROM PARSE_CHRON_PATTERN P,
lateral split_to_table(P.min_pattern, ',') MIN_LIST,
lateral split_to_table(P.hour_pattern, ',') HOUR_LIST,
lateral split_to_table(P.day_pattern, ',') DAY_LIST,
lateral split_to_table(P.mon_pattern, ',') MON_LIST,
lateral split_to_table(P.wday_pattern, ',') WDAY_LIST,
lateral split_to_table(P.year_pattern, ',') YEAR_LIST
)
, WILD_CARD_RANGE AS (
SELECT CHRON_PATTERN
, PARSE_RULE
, MIN_PATTERN
, HOUR_PATTERN
, DAY_PATTERN
, MON_PATTERN
, WDAY_PATTERN
, YEAR_PATTERN
, IFF(MIN = '', '0-59', MIN) AS MIN
, IFF(HOUR = '', '0-23', HOUR) AS HOUR
, IFF(DAY = '', '1-31', DAY) AS DAY
, IFF(MON = '', '1-12', MON) AS MON
, IFF(WDAY = '', '0-6', WDAY) AS WDAY
, IFF(YEAR = '' or YEAR IS NULL, '2019-' || DATE_PART(YEAR, CURRENT_DATE())::STRING, YEAR) AS YEAR
FROM EXPLODE_LIST
)
, EXPLODE_RANGE AS (SELECT E.CHRON_PATTERN
--, E.PARSE_RULE
, E.MIN_PATTERN
, E.HOUR_PATTERN
, E.DAY_PATTERN
, E.MON_PATTERN
, E.WDAY_PATTERN
, E.YEAR_PATTERN
, COALESCE(MIN_LIST.NUMBER::STRING, E.MIN) AS MIN
, COALESCE(HOUR_LIST.NUMBER::STRING, E.HOUR) AS HOUR
, COALESCE(DAY_LIST.NUMBER::STRING, E.DAY) AS DAY
, COALESCE(MON_LIST.NUMBER::STRING, E.MON) AS MON
, COALESCE(WDAY_LIST.NUMBER::STRING, E.WDAY) AS WDAY
, COALESCE(YEAR_LIST.NUMBER::STRING, E.YEAR) AS YEAR
FROM WILD_CARD_RANGE E
LEFT OUTER JOIN MEDIA_CONFIG.CHRON_PATTERN_SEQ_NUMBER MIN_LIST
ON MIN_LIST.NUMBER BETWEEN TRY_TO_NUMBER(SPLIT_PART(E.MIN,'-',1)) AND TRY_TO_NUMBER(SPLIT_PART(E.MIN,'-',2))
LEFT OUTER JOIN MEDIA_CONFIG.CHRON_PATTERN_SEQ_NUMBER HOUR_LIST
ON HOUR_LIST.NUMBER BETWEEN TRY_TO_NUMBER(SPLIT_PART(E.HOUR,'-',1)) AND TRY_TO_NUMBER(SPLIT_PART(E.HOUR,'-',2))
LEFT OUTER JOIN MEDIA_CONFIG.CHRON_PATTERN_SEQ_NUMBER DAY_LIST
ON DAY_LIST.NUMBER BETWEEN TRY_TO_NUMBER(SPLIT_PART(E.DAY,'-',1)) AND TRY_TO_NUMBER(SPLIT_PART(E.DAY,'-',2))
LEFT OUTER JOIN MEDIA_CONFIG.CHRON_PATTERN_SEQ_NUMBER MON_LIST
ON MON_LIST.NUMBER BETWEEN TRY_TO_NUMBER(SPLIT_PART(E.MON,'-',1)) AND TRY_TO_NUMBER(SPLIT_PART(E.MON,'-',2))
LEFT OUTER JOIN MEDIA_CONFIG.CHRON_PATTERN_SEQ_NUMBER WDAY_LIST
ON WDAY_LIST.NUMBER BETWEEN TRY_TO_NUMBER(SPLIT_PART(E.WDAY,'-',1)) AND TRY_TO_NUMBER(SPLIT_PART(E.WDAY,'-',2))
LEFT OUTER JOIN MEDIA_CONFIG.CHRON_PATTERN_SEQ_NUMBER YEAR_LIST
ON YEAR_LIST.NUMBER BETWEEN TRY_TO_NUMBER(SPLIT_PART(E.YEAR,'-',1)) AND TRY_TO_NUMBER(SPLIT_PART(E.YEAR,'-',2))

)

SELECT try_to_timestamp(YEAR || '-' || MON || '-' || DAY || ' ' || HOUR || ':' || MIN || ':00', 'yyyy-mm-dd hh24:mi:ss') AS CHRON_START_TIMESTAMP
, DATEADD(SECONDS, DURATION, CHRON_START_TIMESTAMP) AS CHRON_END_TIMSTAMP
, *
FROM EXPLODE_RANGE
HAVING CHRON_START_TIMESTAMP IS NOT NULL
AND DATE_PART(WEEKDAY, CHRON_START_TIMESTAMP) = WDAY
AND CHRON_START_TIMESTAMP <= CURRENT_TIMESTAMP()
ORDER BY CHRON_START_TIMESTAMP DESC
$$
;

SELECT * FROM TABLE (GENERATE_CHRON_PATTERNED_TIMESTAMPS('0 17,19-23 1,15 * 1,2,4-6', 3600));

Data Glass

Pages

Monday, February 21, 2022

Technology: ARTIST MDM Technology Selection

Ultimate Technical Goal

Approach

Teams

Preliminary Technology Choice

Topological Diagram

Friday, February 18, 2022

Lessons Learned: An eight year venture in creating data products

Preface

The Customer

What We Built

The Ultimate Tragedy

Just Buy Our Company

Twenty / Twenty Hindsight

One Not So Little Idea Left on the Table

So Long and Thanks for all the Fish

Saturday, February 12, 2022

Baseline Model: ARTIST's Ingestion, Data Mastering, & Data Ticketing

Data Ingestion Process Model

Data Issue & Ticketing Process Model

Data Issue & Ticketing Data Model

Media Data Model

What Is: Data Governance, SLA's, and Data Contracts

Tuesday, February 08, 2022

Baseline: ARTIST's Media Video AI Data Lake Structure

Sunday, February 06, 2022

Baseline: ARTIST - Media Video AI High Level Architecture

Friday, February 04, 2022

How To: Segment High Volume Events into 15 mins Ticks in Snowflake

Thursday, February 03, 2022

Baseline Model: Business Process Sales to Engineering Cycle BPMN

Baseline Model: Document Management Conceptual Model

What Is: Data Modeling

How To: Pattern for Pipeline Stored Procedure Created in Snowflake

How To: Generating Simple Timeline Records in Snowflake

How To: Generate Dates from a CHRON Pattern in Snowflake

About Me