This is a repost of a post I wrote years ago (circa 2018). At the time I was working on some This was educational content to teach data scientists some software engineering fundamentals. The normalization examples are still accurate, though my approach to technical writing has evolved since then. Including it for completeness.
Data comes to us in a variety of shapes, sizes and formats. In its raw form, data is unruly, a garbled mess of unusable insight. Within data however lie relationships and connections, a network of disjointed information that tell a cohesive story. In order to gain knowledge from data, we need to make sense of the mangled network.
Normalization seeks to model relationships in data in a way that is easy to follow, easy to combine, and easy to maintain. Normalization is the act of splitting apart a disjointed web of datum and organizing it into a reasonable structure, guided by a set of rules pertaining to the relationships of attributes. These rules, outlined by relational database pioneer Edgar F. Codd, are referred to as the “Normal Forms”.
First Normal Form
In First Normal Form, all attributes within a record set are indivisible and atomic. No one field can contain more than one piece of related information. Let’s take a look at an example. Consider the following relation:
| staff_num | staff_name | staff_email |
|---|---|---|
| 123 | John Doe | john.doe@domain.com,jdoe@ajob.org |
| 456 | Jane Doe | jane@domain.com,jane@anemployer.net |
| 789 | Robert Tables | littlebobbytables@domain.com |
Since the staff_email column contains multiple comma-delimited entries, it is not atomic. You would not, for example, be able to easily select distinct employee email addresses.
To shift this to First Normal Form, we need to split the staff_email column from the relation, leaving two related relations.
| staff_num | staff_name |
|---|---|
| 123 | John Doe |
| 456 | Jane Doe |
| 789 | Robert Tables |
| staff_num | staff_email |
|---|---|
| 123 | john.doe@domain.com |
| 123 | jdoe@anemployer.org |
| 456 | jane@domain.com |
By moving emails to a separate relation, we can now, through a join or subquery, get unique records containing a staff member and their associated email addresses.
Note: As an aside, the above relation also adheres to Third Normal Form.
Second Normal Form
For a relation to be in Second Normal Form, non-prime attributes (ancillary columns) must be dependent on the candidate key in its entirety, not just a subset thereof. In other words, a field must relate back the every attribute of the candidate key, not just one or two attributes.
Consider the following team/role relation, with a candidate key (project_num, team).
| project_num | team | role | team_hq |
|---|---|---|---|
| 123 | Team 1 | User Interface | New York |
| 123 | Team 2 | Database Design | San Francisco |
| 465 | Team 2 | API Development | San Francisco |
Values in the team_hq column are only dependent on the team attribute of the candidate key. They are in no way dependent on the project number. To move this to Second Normal Form, we need to split the team_hq column from the relation, leaving two related relations.
| project_num | team | role |
|---|---|---|
| 123 | Team 1 | User Interface |
| 123 | Team 2 | Database Design |
| 465 | Team 2 | API Development |
| 789 | Team 2 | User Interface |
| team | team_hq |
|---|---|
| Team 1 | New York |
| Team 2 | San Francisco |
We can still apply joins to easily retrieve the team’s headquarters, but we no longer have a maintenance burden of updating each record associated with a team if a team were to move.
Third Normal Form
For a relation to be in Third Normal Form, a table must first be in Second Normal Form, and all attributes must be determined only by a candidate key. In other words, a field must be directly attributable to the candidate key. The relation will be free of superfluous content.
Consider the following staff/manager relation, with a surrogate key (staff_num).
| staff_num | staff_name | manager_num | manager_name |
|---|---|---|---|
| 123 | John Doe | 987 | Sara Manageer |
| 456 | Jane Doe | 654 | Jay Deboss |
| 789 | Robert Tables | 321 | Elle Hefe |
While staff_name and manager_num both directly relate to the surrogate key staff_num, manager_name does not.
To move this to Third Normal Form, we need to split the manager_name column from the relation, leaving two related relations.
| staff_num | staff_name | manager_num |
|---|---|---|
| 123 | John Doe | 987 |
| 456 | Jane Doe | 654 |
| 789 | Robert Tables | 321 |
| manager_num | manager_name |
|---|---|
| 987 | Sara Manageer |
| 654 | Jay Deboss |
| 321 | Elle Hefe |
We can still apply joins to easily retrieve the manager’s name, but we no longer have a maintenance burden of updating names when staff members change direct reports.
Boyce-Codd Normal Form
For a relation to be in Boyce-Codd Normal Form, a table must first be in Third Normal Form. Additionally, it must be free of reverse dependencies. In other words, if two attributes on a relation have a Many:One relationship, we can infer the many from the one.
Note: Boyce-Codd Normal Form is somewhere between Third Normal and Fourth Normal Forms. Think of it as Three-and-a-Half Normal Form.
Consider the following staff member to skill and language relation, with a surrogate key (staff_num).
| staff_num | language | skill |
|---|---|---|
| 123 | JavaScript | User Interface |
| 123 | JavaScript | Data Visualization |
| 456 | Python | Data Science |
| 789 | Python | API Development |
In our hypothetical company, management has required that all User Interface and Data Visualization activities are done using JavaScript. All Data Science and API Development are done in Python. While we know that a language can have multiple components, a skill is attributed to one and only one language. We can glean the language from the skill.
| staff_num | skill |
|---|---|
| 123 | User Interface |
| 123 | Data Visualization |
| 456 | Data Science |
| 456 | Data Visualization |
| 789 | API Development |
| language | skill |
|---|---|
| JavaScript | User Interface |
| JavaScript | Data Visualization |
| Python | Data Science |
| JavaScript | Data Visualization |
| Python | API Development |
Through joins we can learn the team member’s language from their skill.
Fourth Normal Form
For a relation to be in Fourth Normal Form, a table must first be in Third Normal Form. The relationship should be free of multivalued dependencies. In other words, querying should not leave any ambiguity or imply any results.
Note: Fourth Normal Form is not regularly encountered.
Consider the following project staffing relation. Let’s assume a project will need one team member to fill each role. In this case, each project has an assigned staff member, as well as an assigned project need. There is no relationship between project need and staff member.
| project_num | staff_num | project_need |
|---|---|---|
| 12345 | 123 | User Interface |
| 12345 | 456 | Data Science |
| 46578 | 789 | API Development |
| 78901 | 123 | Data Visualization |
If we were to query project 12345, we’d receive tuples that imply a staff member fulfills a specific role. We’d get (132, User Interface) and (456, Data Science).
To make this less ambiguous and remove an implications, split it into two relations.
| project_num | staff_num |
|---|---|
| 12345 | 123 |
| 12345 | 456 |
| 46578 | 789 |
| 78901 | 123 |
| project_num | project_need |
|---|---|
| 12345 | User Interface |
| 12345 | Data Science |
| 45678 | API Development |
| 78901 | Data Visualization |
We can then join the two tables on project number to get staffing and project need.
Fifth Normal Form
For a relation to be in Fifth Normal Form, a table must first be in Fourth Normal Form. When rejoining decomposed structures, we should not gain or lose any attributes. So when we query with part of the table, we shouldn’t gain or lose any additional records.
Note: Fifth Normal Form is not regularly encountered.
Consider the following project staffing relation. In this case, a project has assigned staff, and with them come certain abilities. All three attributes combine to form a candidate key.
| project_num | staff_num | project_asset |
|---|---|---|
| 12345 | 123 | User Interface |
| 12345 | 456 | Data Science |
| 45678 | 789 | API Development |
| 78901 | 123 | Data Visualization |
Problems arise when we try to add new entries to the relation. If we added a need for Data Visualization to 45678, we’d have a missing staff number attribute. If we were to add a new staff member with two skills, we must add two separate entries.
We can get around this by joining three separate relations. One staff to project, one skill to staff, and one skill to project.
| project_num | staff_num |
|---|---|
| 12345 | 123 |
| 12345 | 456 |
| 45678 | 789 |
| 78901 | 123 |
| staff_num | project_asset |
|---|---|
| 123 | User Interface |
| 456 | Data Science |
| 789 | API Development |
| 123 | Data Visualization |
| project_num | project_asset |
|---|---|
| 12345 | User Interface |
| 12345 | Data Science |
| 45678 | API Development |
| 78901 | Data Visualization |
So what form should I use?
The normal forms are nothing more than guiding principles to database structure. They are not hard, fast rules or requirements. When you’re developing a database, adherence to normal forms is highly dependent on the project and data at hand, and is entirely up to developers’ discretion