Database
normalization:
Database
normalization is the process of organizing the attributes and tables of a
relational database to minimize data redundancy.
Primarily,
purpose of normalization is the following:
1.
Arranging data into logical groupings such that
each group describes a small part of the whole system. In other words, refactoring
a table into smaller tables without losing information.
2.
Eliminating/minimizing redundant data stored in
the database.
3.
Organizing the data such that when you modify
it, you make change only at one place.
4.
Building a database in which you can access and
manipulate the data quickly and efficiently without compromising the integrity
of the data in storage.
A
typical example of normalization is that an entity’s unique ID is stored
everywhere in the system but its name is held only in one table. The name can
be updated more easily in one row of one table.
A typical update in such an example would be Symantec Company changing
its CEO to Michael Brown. The update would be done in one place and immediately
the correct CEO name would be displayed throughout the system.
Why do we need Normalization?
Without
normalization database systems can be inaccurate, slow, inefficient and they
might not produce the data you expect. Insertion, Updation and Deletion
anomalies are very frequent if database is not normalized.
Lets
take an example of student and courses he/she is enrolled in.
|
StudentID
|
StudentName
|
StudentAddress
|
StudentCity
|
Courses_enrolled
|
|
101
|
Alex
|
12
Ellis st
|
Milpitas
|
Python
|
|
102
|
Adam
|
310
Yosemite ct
|
Palo
Alto
|
Web
services
|
|
103
|
Kat
|
99
Buena vista ave
|
Cupertino
|
Java
|
|
101
|
Alex
|
12
Ellis st
|
Milpitas
|
Java
|
Update Anomaly: The same information is
expressed on multiple rows on the above table. A change of address for a
particular student will potentially need to be applied to multiple records (one
for each course). If the update doesn’t happen successfully, we could leave the
data in inconsistent state.
Insertion Anomaly: Suppose for a new
admission, we have a studentID, name, address, city of the student but if the
student has not yet enrolled for any courses then we have to insert null there,
leading to insertion anomaly.
Deletion Anomaly: If student ID:102 who
has enrolled only in one course and temporarily he drops it, when we delete the
row, entire student information will be deleted along with it.
Normalization
Rule:
Normalization
rules are divided into the following normal form:
1.
First Normal Form (1NF)
2.
Second Normal Form (2NF)
3.
Third Normal Form (3NF)
4.
Boyce and Codd (BCNF)
5.
Forth Normal Form (4NF)
6.
Fifth Normal Form (5NF)
1. First Normal Form – No repeating
elements or group of elements.
A.
It contains only atomic values – An atomic value is a value, which cannot be
divided. For eg in the table below studentid 101 has enrolled in two courses
namely Python and Java. Hence the table below is not in 1NF.
|
StudentID
|
StudentName
|
StudentAddress
|
StudentCity
|
Courses_enrolled
|
|
101
|
Alex
|
12
Ellis st
|
Milpitas
|
Python,
Java
|
|
102
|
Adam
|
310
Yosemite ct
|
Palo
Alto
|
Web
services
|
|
103
|
Kat
|
99
Buena vista ave
|
Cupertino
|
Java
|
In
the first normal form, the previous table can be represented in the following
way:
|
StudentID
|
StudentName
|
StudentAddress
|
StudentCity
|
Courses_enrolled
|
|
101
|
Alex
|
12
Ellis st
|
Milpitas
|
Python
|
|
102
|
Adam
|
310
Yosemite ct
|
Palo
Alto
|
Web
services
|
|
101
|
Alex
|
12
Ellis st
|
Milpitas
|
Java
|
|
103
|
Kat
|
99
Buena vista ave
|
Cupertino
|
Java
|
Using
the first normal form, data redundancy increases, as there will be many columns
with same data in multiple rows but each row as a whole will be unique.
A primary
key is a column (or group of columns) that uniquely identifies each row. In the
above table, there is no single column that uniquely identifies each row.
However, if we put a number of columns together we can satisfy this
requirement. The two columns that together uniquely identify each row are
StudentID and Courses_enrolled. Therefore, together they qualify to be used as
the table’s primary key. However, this database design does not meet the more
stringent requirements of the second normal form.
B.
There are no repeating groups - A repeating group means that a table contains
two or more columns that are closely related. For eg the table below records a
student who has enrolled in two courses in course1 and course2. This table is
not in 1NF because course1, course2 are all repeating the same attribute.
|
StudentID
|
StudentName
|
StudentAddress
|
StudentCity
|
Course1
|
Course2
|
|
101
|
Alex
|
12
Ellis st
|
Milpitas
|
Python
|
Java
|
|
102
|
Adam
|
310
Yosemite ct
|
Palo
Alto
|
Web
services
|
Null
|
|
103
|
Kat
|
99
Buena vista ave
|
Cupertino
|
Java
|
Null
|
In
the first normal form, the previous table can be represented in the following
way:
|
StudentID
|
StudentName
|
StudentAddress
|
StudentCity
|
Courses_enrolled
|
|
101
|
Alex
|
12
Ellis st
|
Milpitas
|
Python
|
|
102
|
Adam
|
310
Yosemite ct
|
Palo
Alto
|
Web
services
|
|
101
|
Alex
|
12
Ellis st
|
Milpitas
|
Java
|
|
103
|
Kat
|
99
Buena vista ave
|
Cupertino
|
Java
|
The
First Normal Form addresses two issues:
A.
A row of data cannot contain repeating groups of
similar data (atomicity) and
B.
Each row of data must have a unique identifier
(Primary key)
2. Second Normal Form: No partial dependency
on a concatenated key.
For
second normal form, we test the table for partial dependencies on a
concatenated key. This means that for a table that has a concatenated primary
key, each column in the table that is not part of the primary
key must depend upon the entire concatenated key for its
existence. If any column only depends upon one part of the concatenated key,
then we say that the entire table has failed Second Normal Form and we must
create another table to rectify the failure.
|
StudentID
|
StudentName
|
StudentAddress
|
StudentCity
|
Courses_enrolled
|
|
101
|
Alex
|
12
Ellis st
|
Milpitas
|
Python
|
|
102
|
Adam
|
310
Yosemite ct
|
Palo
Alto
|
Web
services
|
|
101
|
Alex
|
12
Ellis st
|
Milpitas
|
Java
|
|
103
|
Kat
|
99
Buena vista ave
|
Cupertino
|
Java
|
In
the above table, StudentID and Courses_enrolled is the concatenated primary
key.
The
table has two rows for StudentID 101 to include multiple courses the student
has enrolled for. While this is searchable and follows 1NF, it is inefficient
use of space. Also, attributes like
Student Name, Address, City depends on Student ID and not on the Courses he has
enrolled for. To achieve 2NF, we will have to split the tables as follows:
Student
table
|
StudentID
|
StudentName
|
StudentAddress
|
StudentCity
|
|
101
|
Alex
|
12
Ellis st
|
Milpitas
|
|
102
|
Adam
|
310
Yosemite ct
|
Palo
Alto
|
|
103
|
Kat
|
99
Buena vista ave
|
Cupertino
|
Courses
table
|
StudentID
|
Courses_enrolled
|
|
101
|
Python
|
|
101
|
Java
|
|
103
|
Java
|
|
102
|
Webservices
|
Now,
both the tables qualify for the Second normal form and will not suffer from
update anomalies.