Database Normalization and You

This is a repost of a post I wrote years ago (circa 2018). At the time I was working on some This was educational content to teach data scientists some software engineering fundamentals. The normalization examples are still accurate, though my approach to technical writing has evolved since then. Including it for completeness.

Data comes to us in a variety of shapes, sizes and formats. In its raw form, data is unruly, a garbled mess of unusable insight. Within data however lie relationships and connections, a network of disjointed information that tell a cohesive story. In order to gain knowledge from data, we need to make sense of the mangled network.

Normalization seeks to model relationships in data in a way that is easy to follow, easy to combine, and easy to maintain. Normalization is the act of splitting apart a disjointed web of datum and organizing it into a reasonable structure, guided by a set of rules pertaining to the relationships of attributes. These rules, outlined by relational database pioneer Edgar F. Codd, are referred to as the “Normal Forms”.

First Normal Form

In First Normal Form, all attributes within a record set are indivisible and atomic. No one field can contain more than one piece of related information. Let’s take a look at an example. Consider the following relation:

staff_num	staff_name	staff_email
123	John Doe	john.doe@domain.com,jdoe@ajob.org
456	Jane Doe	jane@domain.com,jane@anemployer.net
789	Robert Tables	littlebobbytables@domain.com

Since the staff_email column contains multiple comma-delimited entries, it is not atomic. You would not, for example, be able to easily select distinct employee email addresses.

To shift this to First Normal Form, we need to split the staff_email column from the relation, leaving two related relations.

staff_num	staff_name
123	John Doe
456	Jane Doe
789	Robert Tables

staff_num	staff_email
123	john.doe@domain.com
123	jdoe@anemployer.org
456	jane@domain.com

By moving emails to a separate relation, we can now, through a join or subquery, get unique records containing a staff member and their associated email addresses.

Note: As an aside, the above relation also adheres to Third Normal Form.

Second Normal Form

For a relation to be in Second Normal Form, non-prime attributes (ancillary columns) must be dependent on the candidate key in its entirety, not just a subset thereof. In other words, a field must relate back the every attribute of the candidate key, not just one or two attributes.

Consider the following team/role relation, with a candidate key (project_num, team).

project_num	team	role	team_hq
123	Team 1	User Interface	New York
123	Team 2	Database Design	San Francisco
465	Team 2	API Development	San Francisco

Values in the team_hq column are only dependent on the team attribute of the candidate key. They are in no way dependent on the project number. To move this to Second Normal Form, we need to split the team_hq column from the relation, leaving two related relations.

project_num	team	role
123	Team 1	User Interface
123	Team 2	Database Design
465	Team 2	API Development
789	Team 2	User Interface

team	team_hq
Team 1	New York
Team 2	San Francisco

We can still apply joins to easily retrieve the team’s headquarters, but we no longer have a maintenance burden of updating each record associated with a team if a team were to move.

Third Normal Form

For a relation to be in Third Normal Form, a table must first be in Second Normal Form, and all attributes must be determined only by a candidate key. In other words, a field must be directly attributable to the candidate key. The relation will be free of superfluous content.

Consider the following staff/manager relation, with a surrogate key (staff_num).

staff_num	staff_name	manager_num	manager_name
123	John Doe	987	Sara Manageer
456	Jane Doe	654	Jay Deboss
789	Robert Tables	321	Elle Hefe

While staff_name and manager_num both directly relate to the surrogate key staff_num, manager_name does not.

To move this to Third Normal Form, we need to split the manager_name column from the relation, leaving two related relations.

staff_num	staff_name	manager_num
123	John Doe	987
456	Jane Doe	654
789	Robert Tables	321

manager_num	manager_name
987	Sara Manageer
654	Jay Deboss
321	Elle Hefe

We can still apply joins to easily retrieve the manager’s name, but we no longer have a maintenance burden of updating names when staff members change direct reports.

Boyce-Codd Normal Form

For a relation to be in Boyce-Codd Normal Form, a table must first be in Third Normal Form. Additionally, it must be free of reverse dependencies. In other words, if two attributes on a relation have a Many:One relationship, we can infer the many from the one.

Note: Boyce-Codd Normal Form is somewhere between Third Normal and Fourth Normal Forms. Think of it as Three-and-a-Half Normal Form.

Consider the following staff member to skill and language relation, with a surrogate key (staff_num).

staff_num	language	skill
123	JavaScript	User Interface
123	JavaScript	Data Visualization
456	Python	Data Science
789	Python	API Development

In our hypothetical company, management has required that all User Interface and Data Visualization activities are done using JavaScript. All Data Science and API Development are done in Python. While we know that a language can have multiple components, a skill is attributed to one and only one language. We can glean the language from the skill.

staff_num	skill
123	User Interface
123	Data Visualization
456	Data Science
456	Data Visualization
789	API Development

language	skill
JavaScript	User Interface
JavaScript	Data Visualization
Python	Data Science
JavaScript	Data Visualization
Python	API Development

Through joins we can learn the team member’s language from their skill.

Fourth Normal Form

For a relation to be in Fourth Normal Form, a table must first be in Third Normal Form. The relationship should be free of multivalued dependencies. In other words, querying should not leave any ambiguity or imply any results.

Note: Fourth Normal Form is not regularly encountered.

Consider the following project staffing relation. Let’s assume a project will need one team member to fill each role. In this case, each project has an assigned staff member, as well as an assigned project need. There is no relationship between project need and staff member.

project_num	staff_num	project_need
12345	123	User Interface
12345	456	Data Science
46578	789	API Development
78901	123	Data Visualization

If we were to query project 12345, we’d receive tuples that imply a staff member fulfills a specific role. We’d get (132, User Interface) and (456, Data Science).

To make this less ambiguous and remove an implications, split it into two relations.

project_num	staff_num
12345	123
12345	456
46578	789
78901	123

project_num	project_need
12345	User Interface
12345	Data Science
45678	API Development
78901	Data Visualization

We can then join the two tables on project number to get staffing and project need.

Fifth Normal Form

For a relation to be in Fifth Normal Form, a table must first be in Fourth Normal Form. When rejoining decomposed structures, we should not gain or lose any attributes. So when we query with part of the table, we shouldn’t gain or lose any additional records.

Note: Fifth Normal Form is not regularly encountered.

Consider the following project staffing relation. In this case, a project has assigned staff, and with them come certain abilities. All three attributes combine to form a candidate key.

project_num	staff_num	project_asset
12345	123	User Interface
12345	456	Data Science
45678	789	API Development
78901	123	Data Visualization

Problems arise when we try to add new entries to the relation. If we added a need for Data Visualization to 45678, we’d have a missing staff number attribute. If we were to add a new staff member with two skills, we must add two separate entries.

We can get around this by joining three separate relations. One staff to project, one skill to staff, and one skill to project.

project_num	staff_num
12345	123
12345	456
45678	789
78901	123

staff_num	project_asset
123	User Interface
456	Data Science
789	API Development
123	Data Visualization

project_num	project_asset
12345	User Interface
12345	Data Science
45678	API Development
78901	Data Visualization

So what form should I use?

The normal forms are nothing more than guiding principles to database structure. They are not hard, fast rules or requirements. When you’re developing a database, adherence to normal forms is highly dependent on the project and data at hand, and is entirely up to developers’ discretion