Tuesday, February 08, 2022

Baseline: ARTIST's Media Video AI Data Lake Structure

To create an architecture for the ARTIST Media Video AI system described in an earlier post, we need a data lake to store the videos, frames, audio, and important content meta data about the video which can be used by machine learning, HITL labeling, and model training.   We also to store the machine learning results and aggregated reports.

Security is important for video, so its advisable to keep results separated from the machine learning results and reports.   

The ARTIST Vault should have a structure like the below with the production S3 bucket having the appropriate security lock down so only the processes for video capturing, pre-processing, HITL labeling, training, and ML has access to this S3 bucket.



The ARTIST Results from the machining learning and any aggregated reports should be stored in a s3 bucket structured like the below.  The production security can be looser, because you should have complete legal ownership of this data.  


The JSON output from the ML can be JSON or parquet files.   I choose to present below the data as JSON to be more universal and human accessible for the blog.  But I suggest you use Parquet files.  

Important note:   All the ML outputs should have a flexible but conformed structure to make it easier to ingest downstream and process easier.  The flexibility is isolated within the "output" element that contains the ML result output which the layout is different per ML process.  Here is an example and recommended baseline ML output structure:

[
   {  "process_service_provider":"Super Duper ML Scooper" 
      "process_name": "Landscape Detector",
      "process_version": "5.1.98",
      "processed_timestamp": "2022-02-19 03:43:20",
      "video_path": "//{your domain}-{environment}-vault/.../video/video.mp4",
      "article_id":"16466982-4518-4551-6278-3d10a32612a1",
      "article_name":"To My Backyard and Back Again, a Gardener's Tale.",
      
      "events":[
               {
                  "start_millisecond": 39020,
                  "end_millisecond": 80930,
                  "event_id": "1",
                  
                  "output": [
                     {
                     {whatever tags or ids you need for down stream mappings}:"202",
                     {whatever human friendly values you want for easier future debugging}:"Blah blah",
                     {whatever detection name}:{{value}},
                     }, ... more outputs
                  ]
               }, ... more events
      ]
   }
]

Oh and about the meta files you may have noticed:  meta.json contains the video formatting details, system mapping keys, and providence details and the content_meta.json files contain all the meta data about what we already know about the video provided either by the creator/producer of the video or via some service like IMDB, TVMaze, TVDB, TVTime, TiVo, Gracenote, etc....   

No comments: