Friday, December 25, 2020

Proven Snowflake Data Mesh Design Suggestions

 



I have found that Snowflake is an extremely fast and easy way to get your enterprise data warehouse up in the cloud in a secure way and supports futuristic design ideas like "Data Meshing".   Snowflake has been extremely fast and reliable for my projects, even for high volume projects with a daily ingestion of 275 million records a day or more.   
  1. I highly suggest you divide your warehouse database into separate schemas.   This will enable you to keep the noise down for analysts and developers maintaining the warehouse and speed up delivery.    Each schema representing a specific state of the data.  There are two types of schemas:  Staging and Published.   And then there are 3 different schemas I would recommend:  Config (or you a name it Reference), Store,  and Insight schemas.   Staging schema type represents the RAW state of the data as well as it represents the data possibly being in a flux.  Note that Staging schema type is applied across all the schemas:  Config,  Store,  and Insight.   Each of these schemas will have a state of the data that is in flux as its being transformed.  The Store schema represents the published cleaned up, deduplicated, json-> table flattened, and the data is safe to be consumed by down stream services or analysts.  
  2. The Insight schema is the highly modeled design that is shaped and aggregated in a way that is ready to be used for reporting.   Notice that is also has its own Staging.
  3. The Config or Reference schema is a place to warehouse your configuration meta data and data mastery data for domains & hierarchies.
  4. Each domain would export out of their transactional database into an S3 Bucket path.  Data ingestion into Snowflake via AWS S3 is extremely simple to do.  Snowflake processes many types of data formats like Parquet, JSON, or CSV files with ease.   
  5. In the spirit of supporting "Data Mesh" concepts, you may find it more convenient all around to create separate warehouse databases for each business domain which are separated from the main enterprise reporting warehouse.   This reduces what the main enterprise reporting warehouse needs to publish to only the necessary aggregates.  Deeper analysis can be done by accessing the specific domain database.   Using the Data Mesh approach reduces bottlenecks by enabling you to scale out your teams independently for development, maintenance, OPS, and subject matter expertise for each business domain without introducing in heavy dependencies.
  6. Most warehouses need some kind of data mastery in which data is created by the company and change controlled.  I have explored all kinds of means to master data.   But I find it the most simple to create an AWS RDS db, model the data you need, and place the necessary fields for change control.  As for UI needs, if the changes are not high volume, I wouldn't bother.   Make a change order process and have a DBA make the necessary changes.  But if it is high volume, then you should create a UI purpose built for mastering data by a team of people.   
  7. Complex Warehouses will have some kind of configurations that drive reporting to the end customer.   This is like Master Data except this is more for pipeline control and setting arguments like thresholds, percentage ranges, etc...
  8. You may need to process unstructured data or make predictions based on data in the warehouse. Such tools do not run within the Snowflake Environment.
    • If you have non-structured data like video or audio or images, you will probably be using some kind of Machine Learning for image and audio processing and then import the results into your warehouse.
    • If you have a need to make predictions, you will probably be using some kind of gradient boosting tools like XG-BOOST.   You'll be exporting data from the warehouse and importing the inference model back into your warehouse that then can be used to make predictions.
  9. Business Intelligence Reporting tools are essential within the enterprise.   I personally have found that Looker is an excellent match with Snowflake and I have been able to deliver some extremely complex reporting.   Creating new reports using tools like Looker, is orders of magnitude faster than building reports using a web dev.
  10. Data Quality is critical for a data warehouse.  I would start with validating all results for the basics (G.O.L.D.) then work on other tolerance checks:
    • Gaps in the data
    • Orphaned dimensions
    • Late datasets
    • Duplicate Records
  11. Snowflake makes it extremely easy to do cloud sharing.  You can share internal or external to the company.   This makes it easy for analysts, systems, or partners across the company that need to import data to go directly to the warehouse and export what is needed based on their security rights.

As for processing pipelines and creating job schedules, I have found it most useful to build as much of the processes as close to the data as possible.  Creating Stored Procedures and executing them using Snowflake's Task Scheduler makes the data warehouse self contained when not dependent on having to pull data from API's,  other non Snowflake DBs, or other Non-Snowflake accessible file locations.   When creating pulls from other data sources like an API or a DB, I would suggest using Python and a Cron Scheduler and place the logic package in an a cheap AWS EC2 instance.   Most data services have example Python Logic for pulling data from their API that you can copy and paste.  You might also find that Databricks can be of great value here as well and also adds an additional value by enabling you to apply other complex logic like XG BOOST, Neural Networks, NLP and the like.   Over the years, I found that complex ETL services and tools are not required and add a level of unnecessary expense.  Personally, I would rather spend that money on more Snowflake and Databrick processing time.  But saying that, having a modern DAG job manager can be useful if you keep the design elegant.