Welcome to our "Data Dictionary" series, where we explore key concepts in data analytics, engineering, and architecture in a clear and engaging way. We’re making definitions the focal point of the series because we believe knowing and grasping technical terminology is key to comprehending the data ecosystem.
Whether you're a beginner or a seasoned professional in the field of data, our goal is to empower you with the knowledge and understanding of these important topics. We understand that data can be overwhelming and the jargon can be confusing, and we're here to help you navigate through it all.
Our first entry in the series is the topic of data modeling, a fundamental concept in data management that is essential for organizing and structuring data. We chose to tackle data modeling first because it is the building block that underpins so much of what’s to come as we delve deeper into the beautiful and complex world of data.
So, without further ado, let’s begin our data journey, and answer the central question: what is data modeling?
Data Modeling, by Definition
Data modeling is the process of creating a representation of data in a computer system. To design a data model, you need to define the structure of the data, including the relationships between different pieces of data, and the rules that govern how the data can be stored and accessed.
The Different Types of Data Models
When data architects and engineers refer to a "data model," they're typically citing a conceptual diagram that visualizes data objects and the connections between them. They're thinking of a flow chart with boxes and arrows that can be a high-level outline (a Conceptual Model), or a more detailed view of the specifics of data logic (a Logical Model) and its database-specific implementation (a Physical Model).
The Purpose of Data Modeling
Regardless of the data model's degree of specificity, its central objective is flexible, efficient, and easy-to-understand database design. To achieve that goal, the data architect will first work with business stakeholders to conceptually understand data sources and their use cases, before building on that foundation to develop the logic layer, or making any decisions on technical implementations and software tools.
From Diagrams to Databases
So, how does an abstract concept like “modeling data” become concrete? In practice, the diagrams will eventually take on the form of a physical database, managed by a Database Management System (DBMS), like MySQL or Postgres.
Don't get lost in the acronyms. You’re familiar with databases. They’re just tables, columns, and rows.
Tables, Columns, and Rows
A database is organized into a series of tables, each of which stores a specific type of data. Each table has a set of rows, called records, and a set of columns, called fields. Each record in a table represents a unique instance of the data, and each field in a record contains a specific piece of information about that instance.
Focusing on Fields
Determining the fields in a table, and how they relate to other tables, is one of the most important decisions a database designer has to make. So much so that you may hear the fields of a table being loosely referred to as the database schema. (The database schema is a core concept and comprises more than just the column names, but we’ll get to that in our next dictionary entry.)
To create a field, you must give it a name, and assign it a specific data type, such as integer, string, or date, which limits the kind of data that can be stored in that field.
The Importance of Primary Keys
The Primary Key ("PK") is the most essential field in any table. Sometimes referred to as just “the key,” this field is a unique identifier that helps link or “join” records from one table to another table. With it, every record or row in a table has its own label, making it identifiable across different tables or even databases.
For example, in a database table that contains information about customers, a standard Primary Key would be the customer's ID number, or a combination of the customer's first and last name. Because each customer's ID is unique and does not repeat for any other customer, you can use it to quickly find the customer you want.
A common use case would then be to leverage the Primary Key to link the unique customer record to a table of product orders. Because each order in the orders table is related to one customer, you can use the Primary Key of the customers table as a Foreign Key ("FK") in the orders table, link the two, and identify which customer placed an order.
A Foreign Key can be thought of as a reference. It points to a Primary Key of another table, allowing the database to maintain a relationship between the data in the two tables. Because the foreign key ensures referential integrity, it prevents the existence of an order that is not related to any customer and enables the use of more complex queries like join operations.
Joins are both powerful and commonly utilized because they make possible the retrieval of data from more than one table in a single query. In this case, they could, with one query, fetch all the orders of a specific customer.
The Perils of Poor Database Design
Even in the basic example of the customers and orders tables, you can already begin to see what would make for a good or bad database design.
If the data model did not include a Primary Key or generated one that was not actually unique, the tables could not be accurately mapped, and the sales data would either be double-counted or under-represented. If that’s obvious, consider this more subtle point: if the customers table included fields that were not strictly limited to identifying properties of the customer (such as name, location, email, etc.), it could conflict and overlap with the fields in the orders table (such as product name, SKU, quantity, price, etc.). Going one level deeper, consider the structure of the orders table. Depending on how the products are classified and counted, you could run into unnecessarily costly computation problems trying to segment product purchasing behavior across groups of customers.
How to Avoid The Pain
An experienced data engineer has seen things break, and can anticipate many of the pitfalls that plague databases that were created ad hoc, without a strategy or plan. The larger datasets become, the more sources they draw from, and the more stakeholders they have to serve, the more complicated they get, and the more vulnerable they become to the high costs of shaky data foundations.
At Product Pair, we specialize in helping organizations like yours navigate the complex world of data. Our team of data experts can provide a customized approach to help you unlock the full potential of your data. We offer a range of services, from data modeling and data engineering to data analytics and visualization.
If you're ready to take your data to the next level and see the results for yourself, we would love to hear from you. We are currently offering free consultations to help you understand how our services can benefit your organization.
With a free consultation, you'll have an opportunity to speak with one of our experts and discuss your data needs, answer any questions you may have, and learn more about the solutions we offer. Click here to schedule your free consultation or contact us at firstname.lastname@example.org, and one of our team members will be in touch with you shortly. We look forward to helping you achieve your data goals.