I created an architectural specification document complete with process, sequence, and ERD diagrams for the ARTIST AI Media Service. It's a work in progress. ARTIST AI Media Service
Data Glass
Personal Technical Diary on Data & Process Models, Data Warehousing, Planning, Techniques, and Memoirs.
Sunday, January 14, 2024
Monday, February 21, 2022
Technology: ARTIST MDM Technology Selection
Ultimate Technical Goal
- Support 20 Billion Episodes & Movies
- Support 100 concurrent curators
- Support Localization of text
- Support Change Control
- Support Data Ticket tracking
- Support HITL (Human in the Loop)
- Data Structure to be flexible and expandable
Approach
- Use standard services so its easy to find intermediate level developers
- Keep it simple & maintainable
Teams
- Data Engineering - Pipeline, Warehouse, Containers
- Video Prep - Video Capture, Pre-Processing, Video Security and Storage
- Data Science - ML & NLP packages
- UI/UX Dev - ARTIST Website, HITL, Data Ticketing, and Data Collection UI
Preliminary Technology Choice
- Use S3 buckets
- Organize data using S3 folders and naming conventions
- Store raw pre-ingested data as json documents
- Use Aurora transactional databases
- MDM Editor Database
- Data Ticketing and Tracking Database
- Databricks
- Server orchestration
- Job Management
- Machine Learning
- Data Pulls from 3rd Party APIs
- ECS/ECR/ELB/Fargate for Data Collection (MDM UI) Website
- Use Lambda for internal API
- Use json to store schema, data entry rules, ui presentation and editing layouts
- AWS OpenSearch
- Connector to Aurora for indexing searches
Topological Diagram
Friday, February 18, 2022
Lessons Learned: An eight year venture in creating data products
Preface
The Customer
- Predict Video Ratings before Release
- Detect Talent Screen Time
- Calculate Talent Diversity Scores
- Detect Topics Presented
- Detect News Coverage
- Find Props & Set Pieces
- Collect the Movie Credits on Videos in their Vault
- Suggest the Appropriate amount of On Site for news is required (On Site is expensive)
- Suggest the Appropriate amount of Spanglish for mix language audience
- Suggest the Appropriate amount of News Coverage for a topic
- Measure the Retention power of each 10 second of the video
- Measure the Retention power of each episode of a series
- Determine what Type of Content had Statically Significant Retention for the Audience
- Help increasing sales or lower their costs
- A data product that fits their budget
- A data product that was simple to understand
- A high quality data product
- A durable data product that would be available for years to come
- A data product that could easily integrate & have low friction with their daily business process
- A data product with 24/7 technical support
- High sales in a large enough market space to support the company long term
- A data product that had a low cost to manufacture
- A data product that had a high enough profit margin to survive if not thrive
- High subscription renewals
- Low sales turn around
- A data product we exclusively owned
What We Built
- Talent
- Time on Screen
- Time Speaking
- Style
- Pace
- Complexity in Vernacular and Length of Sentences
- Language Classification of use of Personal Viewpoint (Me, I, Feel, Believe, etc...)
- Emotions
- Sad, Angry, Fear, Positivity, Surprise
- Talent's Setting
- News
- Local
- On Site
- National
- Scripted
- Outside Country
- Outside City
- Indoors
- Shot
- Screen Motion
- Topic Classification
- News Coverage
- Crime, Weather, Politics, Lifestyle, Social Justice, Sports, Traffic, Tech/Science, Economy/Business, Education, Public Health
- Scripted Storyline
- Custom per series or genre
The Ultimate Tragedy
The facts are that the traditional media world is 100 years old, strongly networked, and heavily committed to previous contracts; their processes are firmly established and their budgets are tight. A lot of their money is spent on actors, directors, marketing, FX, sets, etc... To squeeze in the cost of an "Unproven" data product into a firmly establish process and budget is a very hard nut to crack. Its like getting your first credit card. You have to prove you are reliable and will make payments on your credit card. A catch twenty two scenario.
There was a bit of friction in gaining access to videos. This friction slowed our delivery down considerably by a month or two. If we asked the customer to provide videos, we had to step through a lot of legal hoops and lawyer fees to make this happen. Then once we had that nailed down, setting up the method of transport was always different per customer due to the requirements of security. Sometimes the customer didn't desire to go through their own red tape to bother. So we had to approach it several different ways which added to our complexity. We finally settled on capturing video from OTA (Over the Air), RTMP, HLS, and S3 buckets. Each one came with their technical issues and video recording quality problems. But at least we could move forward with a pilot.
Time and time again, it turns out that companies just want to build it themselves. They would turn to our company to pilot an idea to see if it can be done. After they are satisfied, they then proceed to do it themselves. The tools and technology available now affords them the ability to provide their own solutions and outsource the work to where the wages can be 1/4-1/3 the cost of someone in the USA. They gain full control of the intellectual property and they can build it the way they want using the tools and platforms they feel have long term future for them.
Just Buy Our Company
Twenty / Twenty Hindsight
- Actual Social Rating Score
- Predicted Rating Score
- Standardized set of Genres
- A Specific Talent's Screen Time
- Various Diversity scores of the film
- Pace of the Film
- Vernacular Complexity of the Film
- Topic Coverage
- Percentage of Emotions Covered
- Localized and Internationalized Content Rating
- Percentage of Each Language Spoken
One Not So Little Idea Left on the Table
So Long and Thanks for all the Fish
- Understand ALL the major processes in a vertical industry
- Determine the most painful process points
- Become a part of the industry's process flow
- Target the RIGHT customer
- Have the goal to be THE Source of Truth for a prospective dataset
- Keep it a simple build
- Make it a quick and easy sell
- Keep the "Friction" low for customers to adopt into their daily lives
- Provide data that enables customer's take action on a daily or weekly basis
- Network and market like crazy and fast
- Expect your first several data products to fail
- Keep churning out data products until something sticks
- You'll have tones of competition, so expect to the big boys to move in on your territory
Saturday, February 12, 2022
Baseline Model: ARTIST's Ingestion, Data Mastering, & Data Ticketing
Curating media information about a talent, series, or movie, and credits in a growing international media industry is challenging for the following reasons:
- Yet to be published data
- Sparsely populated data
- Slowing changing data
- Incorrect data
- Duplicate data
- Insufficient Synopsis or Bio summaries
- Localization Issues - Bad translations or lack of translation
- Data "Easter Eggs" - Data that someone put in that is not appropriate for the customer
- Missing Image Talent and Character Portraits
- Missing Movie, Series, and Episode Posters
- Image capturing and persistence and refreshing - Talent portraits for example
- Image "Easter Eggs" - Images that some put in that are not appropriate for the customer
Data Ingestion Process Model
Data Issue & Ticketing Process Model
Data Issue & Ticketing Data Model
Media Data Model
What Is: Data Governance, SLA's, and Data Contracts
Don't let perfection be the enemy of the good. All companies struggle against entropy in process and data. It's a never ending process. Each day in which we succeed in conforming a dataset, filled in more values in a sparsely populated dataset, or protected customers from bad data and "Data Vandalism" is a good day to be celebrated.
Data governance is a very intense discipline that spreads across the infrastructure of a company. It's a daunting task to implement it in a way that has traction and has some meaningful "Time to Value" for the company's investment. The following diagram gives a summary of what Data Governance is all about in a nutshell. (I created this "Pamphlet" 5 years ago, but it is still mostly relevant. Sorry for the small print. If you download and save then open up in an editor it will be easier to read.)
When you are a data provider, Data Governance is a core part of your business in which is centered around guaranteeing the SLA for your customers. The below diagram give a summary of what an SLA and a Data Contract are. (I created this "Pamphlet" 5 years ago too, but again it is still mostly relevant. Again, sorry for the small print. If you download and save then open up in an editor it will be easier to read.)
Tuesday, February 08, 2022
Baseline: ARTIST's Media Video AI Data Lake Structure
To create an architecture for the ARTIST Media Video AI system described in an earlier post, we need a data lake to store the videos, frames, audio, and important content meta data about the video which can be used by machine learning, HITL labeling, and model training. We also to store the machine learning results and aggregated reports.
Security is important for video, so its advisable to keep results separated from the machine learning results and reports.
The ARTIST Vault should have a structure like the below with the production S3 bucket having the appropriate security lock down so only the processes for video capturing, pre-processing, HITL labeling, training, and ML has access to this S3 bucket.
The JSON output from the ML can be JSON or parquet files. I choose to present below the data as JSON to be more universal and human accessible for the blog. But I suggest you use Parquet files.
Important note: All the ML outputs should have a flexible but conformed structure to make it easier to ingest downstream and process easier. The flexibility is isolated within the "output" element that contains the ML result output which the layout is different per ML process. Here is an example and recommended baseline ML output structure:
Sunday, February 06, 2022
Baseline: ARTIST - Media Video AI High Level Architecture
This represents a high level architecture for a Media Video AI system. I call it the {A}rt of {R}eporting, {T}raining, {I}nferance, & {S}tate {T}racking or A.R.T.I.S.T. for short.
- Capture & Prep: Captures Video and prepares it for processing by ML
- Model Training: Training of models to be used by Detectors.
- HITL & Data Collection DB: Transactional DB for managing data entry and labeling.
- HITL Team: Outsourced team to perform labeling and data entry.
- Job & State Mgmt: Job Management for scheduling and running ML tasks.
- Job & State Mgmt DB: Transactional DB for managing processes and states.
- Detectors: Inference Engines for detecting content in video/audio/text
- Videos & ML Results S3 Buckets: Video, Frames Audio, and ML Detection Storage.
- 3rd Party ML Services: Voice to Text, and other types of NLP or video detection service
- Audience Behavior & Ratings Warehouse: Storage and large volume processing warehouse DB
- 3rd Party Watch Log Providers: Watch Event log data providers & audience/critic panels
- Gallery Data Warehouse: Finale Data warehousing of Gallery for integration with other services
- Gallery DB Cache: Gallery data distributed across the world and localized to its common language
- Gallery UI: Public UI for customers to view the media Gallery
Friday, February 04, 2022
How To: Segment High Volume Events into 15 mins Ticks in Snowflake
Segmenting your events results in exploding your data. When you have billions of event records needing to be segmented into 15 minute ticks, this could end up being slow and expensive to process. The smaller the segmentation requirement the more explosion.
Word to the Wise: Don't join a big event table to another big table ,or in this case, a large segmentation dataset. You want the segmentation dataset to be as small as possible to help the sql optimizer to be able to hold at least one side of the join in memory and no spill to disk.
Here is my solution that worked very well in Snowflake.
Please Note: it uses my simple GENERATE_TIMELINE function that I posted earlier. And the SEGMENT_TICK CTE dates don't have to change ever.
Thursday, February 03, 2022
Baseline Model: Business Process Sales to Engineering Cycle BPMN
Baseline Model: Document Management Conceptual Model
Years ago I created this conceptual level data model to capture the essence of what kind of data definitions are required for a general document management. Its a baseline.