Thursday, February 03, 2022

What Is: Data Modeling

Basic Concepts of Data Modeling

Data modeling is an applied discipline for defining the languages (nouns and verbs) of the business.   The results of a modeling exercise are a series of data models and a data dictionary that are used to educate and bring awareness of the meaning, relationship, and value of the data across the business.   Data models are most commonly used to define the data requirements for a company’s transactional and data warehouse needs and are vital to data management and governance. 


Approaches: Top-Down, Bottom-Up, and POC

Top-Down: A process of extracting information from subject matter experts for a given business area and producing a model that documents the business data requirements that IT can use to build, modify, or buy services.   You can speed up this process by starting with reference models and modifying areas that deviate from the reference models.


Bottom-Up:  A process of reverse-engineering data structures into “As-Built” data models in an effort to take inventory of the current data ecosystem.  Gap analysis can be performed by determining the gap between the Top-Down and the Bottom-Up produced data models. 


Proof of Concept (POC):  A process where the data requirements are discovered by means of creating a prototype.


Data Model Perspective

There are many types of data model perspectives.  The follow are the most common.


Conceptual Data Model:  Describes the language of the business at a summary level which communicates the subject and scope to all stakeholders.   These models will describe entities and their relationships without going into the primary keys or attribute level of the model.


Logical Data Model:  Describes the language of the business in detail while being technologically neutral while following normalization rules.  The logical model is primarily used to communicate the data requirements to the business, development, and operational teams.


Physical Data Model:  Describes the actual schema design which is optimized and targeted for a specific technology (transactional database, analytical database, No SQL Database, JSON file, CSV File, etc…).   These models are used primarily to communicate the actual physical design to the development, test, and operational teams. 

  • As-Built Physical Data Model:  Describes the physical data model as reverse-engineered from a production database.
  • As-Documented Physical Data Model:  Describes the physical data model as documented.  These documents may deviate from what is actually out in production.


Reference Data Models

Data modeling is a relatively mature science and most business areas out in the wild have been logically mapped out over the years.   As a result, there are serval mature data models that are considered “Reference” models.  Using these reference models as a baseline saves a lot of time and resources which can be refocused on the unique needs of the business that differ from the “Reference” models.   Len Silverston, Paul Agnew, and David Hay have published several models patterns and reference designs.  I too have thrown in my hat in the ring with my models here on this website.


Data Normalization

The first, second, and third rule of data normalization were defined by Edgar F. Codd.   Codd and Raymond F. Boyce combined their efforts to define the Boyce-Codd Normal Form (BCNF) in 1974.  Data normalization is a process that organizes information in a way that clearly defines all the relevant entities, their attributes, and the relationships between entities.   Each entity should only represent one concept such as a person, place, or thing which must have a “key” that uniquely identifies each entity record.  The “Key” must contain only the necessary attributes that make each entity record unique and must be comprised of nothing but those attributes.  Non-key attributes included in the entity’s definition must be seperated from the entity’s unique key and must intrinsically belong to the entity.  Any redundancy must be removed.


Data Integrity & Performance

Data integrity refers to the assurance that the information being managed in a data system is consistent and accurate over the life of the data.  It is critical in the design of a transactional database system to have data integrity.  Normalization of your data along with foreign key constraints are a natural way to help design a transactional system to inherently enforce data integrity.   But a balance must be struck to deal with the performance needs of the system.  This typical requires optimizing the design by introducing derived and duplicate data elements for faster reporting and scaling needs.  Optimization may also include turning off foreign key constraints, collapsing or partitioning tables, reshaping the table, or changing a many-to-many relationship into a one-to-many relationships.  In other words, selective de-normalization of the design is required to deal with tuning a database to perform as required under stress.   As these optimizations are introduced, data integrity problems may arise, so extra processes must be put into place to monitor and protect the data.  


Agile Data Modeling and the Long Term View

Agile development can be applied to data modeling by utilizing reference models and limiting the scope of the modeling effort to the story board.   But, it is also very critical to think of the big picture when it comes to enterprise-wide reporting and data management.  Do not silo your data modeling effort too much.  A conceptual or logical data model focused on the broad view is a great value to the whole team.


Data Modeling Tools

Modeling tools enable a data modeler to create graphical representations of the data models following very specific notation rules.  Some tools can do both data and process modeling using Business Process Modeling Notation (BPMN) and link the two together.  My favorite over the years is Visual Paradigm.   Cheap and just works. 

No comments: