Database Schemas: [Guide] Definitions, Types, Models, and Common Applications
In our last data dictionary entry, we covered Data Modeling, a topic that naturally led to this entry on database schemas. That's because database schemas are the end result or output of a data modeling process. In fact, data modeling and database schemas are so interrelated that you may hear the terms "data model" and "database schema" used synonymously and interchangeably.
In this post, we're covering database schemas separately and in more depth, because they form the foundation of not just every data system, but also any software application. Everything from a simple customer intake form to a robust machine learning model relies on the skeleton of a database schema at its core.
So, what are database schemas?
A database schema is a set of rules that describe the structure, organization, and characteristics of data in a database. It defines the tables, records, and fields in which data will be stored, providing a definition to which users can refer as they access and update information in the database.
Since it's a visual diagram, a database schema is often the only resource that can provide a birds-eye-view of your entire database. Schemas are living documents that can (and should) be updated, but for the most part, they are relatively stable.
Database schema versus database instance
Because you may occasionally hear the term "database instance" in the same conversation as a database schema, we want to quickly define it and save you any confusion.
A database instance is a sample of data from a database at any given moment in time. It contains all the properties that are defined by the schema as actual values. Database instances can change over time as new information is added or existing information is modified or deleted.
An instance is usually created as an illustrative example of a database schema. It's just easier to grasp an actual name (like "John Smith") and location (like "Los Angeles CA") than to deduce the data output from reading the field names, data types, rules, and relationships in a table.
Why are database schemas essential?
As the overview of the database, a schema is the best tool to help you quickly identify potential problems and design issues in your database architecture, ensuring that only valid data is stored and retrieved properly. Without a schema, it would be near impossible to ensure data integrity and consistency.
What's more, because a schema provides the blueprint of the data, it helps facilitate better data understanding and greater information organization, in turn leading to optimized data retrieval and more effective queries. Going one step further, schemas also help illuminate data performance issues and storage inefficiencies.
Another key use case of a schema is its benefit as a database record in times of change. As business needs evolve, data sources are updated, or engineering leadership turns over, database schemas are a vital reference point.
Three database schema types
There are three types of database schemas: conceptual, logical, and physical. Each of these schema types covers the same fundamental information (data sources and their relationships) but differs in its degree of complexity and level of technical specificity. We'll cover each schema type, starting from the top with conceptual schemas, the broadest, and drilling down to the most detailed, physical schemas.
A conceptual data schema is the highest level of abstraction, providing a general overview of the entire database system. It defines the entities and their relationships in a way that can be understood by both technical and non-technical users. A conceptual schema does not include any implementation details, such as data types or constraints, but instead focuses on the overall structure of the data. This type of schema is useful for understanding how different parts of an organization interact with each other and how data flows between them.
The conceptual database schema can be thought of as the "business-facing" data model, used to display data sources and use cases with product and operations stakeholders.
Logical data schemas are more detailed than conceptual schemas, providing information about the specific tables, fields, and relationships within a database. They also include information about data types and constraints that must be followed when entering data into the system. Logical schemas are used to ensure that all data entered into a system is valid and consistent with other records in the database.
A logical data schema is often where a data architect will spend most of her time. That's because a logical schema commands a level of detail that forces the bulk of the thinking behind a data model. It's where you have to make crucial decisions about how to define entities, what attributes to assign them, how to relate them with each other (one-to-one, one-to-many, or many-to-many), and which rules to govern them with. At this stage, some data engineers will go so far as to define different views of the data (specific queries designed to present a customized view of a section of the data), and write stored procedures (custom functions developed to perform certain repetitive data computation tasks).
Physical data schemas are the most detailed type of database schema, providing information about how data is actually stored in a system. This includes details such as file locations, indexing strategies, partitioning schemes, and storage formats for each table in the database.
At the level of granularity of a physical database schema, the data architect actually has to select the Database Management System (DBMS). That's because each DBMS has its own unique structure and quirky set of rules that can cause a lot of headaches down the line if they're not explicitly accounted for in the data model. An experienced data architect will plan for the fact that MySQL, PostgreSQL, Oracle, and DynamoDB, for example, are each best suited for specific use cases, and require certain data types and patterns from the database schema.
Because of their precision, physical schemas are a key technical resource for engineers looking to optimize performance when accessing or manipulating large amounts of data.
SQL versus NoSQL database schemas
As mentioned, different DBMS have their own specific rules, standards, strengths, and limitations. Unwinding the intricacies of each DBMS is outside the scope of this entry (and a monumental task!), but for now, we can cover the differences between the two main database types at a high level.
A SQL database schema defines how data is stored and organized in a relational database - your standard, linear table, with columns and, if you want, infinite rows. (A defining feature of a SQL database is that it puts no cap on the number of data objects stored within a database.)
SQL databases use Structured Query Language (SQL) to create and manage the database structure. This language provides commands for creating tables, inserting data into them, updating existing records, deleting records, and querying the database for specific information.
SQL is the bread and butter of database schemas. It's so ubiquitous that you can assume it's what most people are implicitly referring to when they say "database schema."
As the name "Not Only SQL" suggests, NoSQL databases are defined in contrast to the standard choice, relational or SQL databases. They are a less rigid, more dynamic alternative to data tables, favored in certain use cases for their flexibility, speed, ease of use, and ability to scale horizontally.
In fact, it's sort of ironic to discuss NoSQL databases in an entry on database schemas. They're sometimes preferred by data engineers precisely because their flexibility means that they require less upfront planning and data modeling.
There are four main types of NoSQL databases: document databases, key-value databases, wide-column stores, and graph databases. We'll cover them briefly.
Document databases save information in documents that resemble JSON objects, with their key-value pairs. Values can be of different types, such as strings, numbers, boolean, arrays, or objects.
Key-value databases are a simpler type of database where each item contains keys and values. They're used in applications that handle a high volume of continuous reads and writes.
Wide-column stores store data in tables, rows, and dynamic columns. This type of database is often used for storing large amounts of data with different attributes that need to be accessed quickly.
Graph databases are used for applications that require complex relationships between data points such as social networks or recommendation systems. They use nodes (data points) connected by edges (relationships) to represent the data structure.
As you can imagine, each of these database types has its own data modeling conventions and best practices.
Some common schema designs and models
Just as there are distinct schema specifications for particular database types, so too are there common schema models for certain use cases and scenarios. Because most schema designs fall into several tried and true patterns, it's helpful to know, identify and apply them. Below are some of the most commonly used database schema models you're likely to encounter as a data engineer, in order of simplicity.
A flat database schema model is a simple and straightforward structure that stores data in a single table.
Common use cases for a flat database schema include simple data collection tasks, like storing contact information generated from a lead form, or basic transactional data from invoices.
The main advantage of using a flat database schema is its simplicity and ease of use. It requires minimal setup and maintenance, making it ideal for smaller datasets, early-stage startups, or toy projects.
Obviously, a flat database schema is not suitable for larger data sets or substantial computations, so its biggest drawback is that you will soon outgrow it if your application scales.
The relational database schema model is the standard table structure we've generally referred to throughout this post. This type of schema is often used for applications that require complex relationships between data points because it allows users to easily join multiple tables together in order to access the data they need.
Typical use cases for a relational database schema include storing customer information, product catalogs, or orders. Although NoSQL graph databases are the go-to for tangled webs of data produced from social networks and recommendation systems, relational database schemas have the range to handle that level of data complexity, too.
Relational database schemas are the go-to data model because of their logical structure, broad application, near-universal adoption, and scalability. However, they are linear and rigid, making them a suboptimal choice for handling unstructured data like images and videos.
In a star database schema, a primary fact table sits in the middle, and smaller dimensional tables connect to it, forming a "star"-like shape. The center of the star is a single table of "facts," and stores important primary data like sales orders. The secondary dimensional tables all contain attributes of the primary data and link back to the fact table. The entire database revolves around the fact table.
A star data model is optimized for historical data. As a result, star database schemas are the natural choice for data warehouses, data marts, and business intelligence tools concerned with backward-looking data visualizations. Conversely, they are not well-suited for applications that need to handle real-time data, such as live website event streams.
A snowflake schema is a more complex variation of the star model. Each dimension table can have additional dimensions branching off from it, making the model resemble a literal snowflake. It still has a fact table at its center, but the multiple dimension tables that connect to it can be further normalized into smaller tables.
The snowflake schema is used for highly complex queries and advanced analytics. Think of it like the Ph.D. of database schemas. It's powerful, but it also requires many joins to fetch data, and should not be tried by amateurs!
And many more...
In this post, we managed to introduce some of the most common database schemas, but the variations are infinite as we optimize a database for its set of data sources, particular use cases, and desired results.
At Product Pair, we understand that every business has unique data needs and requirements. That's why we offer a free consultation to help you determine the best database schema for your particular use case. Our experienced team of data professionals can provide expert advice on the different types of schemas available and how they can be tailored to meet your specific needs. We can also help you identify any potential pitfalls or challenges that may arise during the implementation process. Feel free to schedule a call with us directly, and we'll get you started on the data journey!