Monday, February 21, 2022

Technology: ARTIST MDM Technology Selection

 

This is part II from the Previous Post.

This is an exploration of how I might implement a master data management and release service. There are so many ways, technologies, and platforms to use to implement this.   I chose AWS as the cloud service provider and databricks as the server orchestration rather than going deep into EKS or Kubernetes hard core for reasons of team efficiency.   Refactoring to using EKS or Kubernetes in production rather than Databricks can be taken at a future time when the budget and size of the time allows.

Ultimate Technical Goal

  • Support 20 Billion Episodes & Movies
  • Support 100 concurrent curators
  • Support Localization of text
  • Support Change Control 
  • Support Data Ticket tracking
  • Support HITL (Human in the Loop)
  • Data Structure to be flexible and expandable

Approach

  • Use standard services so its easy to find intermediate level developers
  • Keep it simple & maintainable

Teams

  • Data Engineering - Pipeline, Warehouse, Containers
  • Video Prep - Video Capture, Pre-Processing, Video Security and Storage
  • Data Science - ML & NLP packages
  • UI/UX Dev - ARTIST Website, HITL, Data Ticketing, and Data Collection UI

Preliminary Technology Choice 

  • Use S3 buckets
  • Organize data using S3 folders and naming conventions
  • Store raw pre-ingested data as json documents
  • Use Aurora transactional databases
    • MDM Editor Database
    • Data Ticketing and Tracking Database
  • Databricks
    • Server orchestration
    • Job Management
    • Machine Learning
    • Data Pulls from 3rd Party APIs
  • ECS/ECR/ELB/Fargate for Data Collection (MDM UI) Website
  • Use Lambda for internal API
  • Use json to store schema, data entry rules, ui presentation and editing layouts
  • AWS OpenSearch 
    • Connector to Aurora for indexing searches

Topological Diagram



Please refer to my previous post on MDM design for details on the data and process models.

No comments: