Sunday, April 12, 2015

Database - Normalization and De-normalization concepts

*In progress* 

Database normalization:
Database normalization is the process of organizing the attributes and tables of a relational database to minimize data redundancy.

Primarily, purpose of normalization is the following:
1.     Arranging data into logical groupings such that each group describes a small part of the whole system. In other words, refactoring a table into smaller tables without losing information.
2.     Eliminating/minimizing redundant data stored in the database.
3.     Organizing the data such that when you modify it, you make change only at one place.
4.     Building a database in which you can access and manipulate the data quickly and efficiently without compromising the integrity of the data in storage.

A typical example of normalization is that an entity’s unique ID is stored everywhere in the system but its name is held only in one table. The name can be updated more easily in one row of one table.  A typical update in such an example would be Symantec Company changing its CEO to Michael Brown. The update would be done in one place and immediately the correct CEO name would be displayed throughout the system.

Why do we need Normalization?
Without normalization database systems can be inaccurate, slow, inefficient and they might not produce the data you expect. Insertion, Updation and Deletion anomalies are very frequent if database is not normalized.
Lets take an example of student and courses he/she is enrolled in.

StudentID
StudentName
StudentAddress
StudentCity
Courses_enrolled
101
Alex
12 Ellis st
Milpitas
Python
102
Adam
310 Yosemite ct
Palo Alto
Web services
103
Kat
99 Buena vista ave
Cupertino
Java
101
Alex
12 Ellis st
Milpitas
Java

Update Anomaly: The same information is expressed on multiple rows on the above table. A change of address for a particular student will potentially need to be applied to multiple records (one for each course). If the update doesn’t happen successfully, we could leave the data in inconsistent state.

Insertion Anomaly: Suppose for a new admission, we have a studentID, name, address, city of the student but if the student has not yet enrolled for any courses then we have to insert null there, leading to insertion anomaly.

Deletion Anomaly: If student ID:102 who has enrolled only in one course and temporarily he drops it, when we delete the row, entire student information will be deleted along with it.

Normalization Rule:

Normalization rules are divided into the following normal form:
1.     First Normal Form (1NF)
2.     Second Normal Form (2NF)
3.     Third Normal Form (3NF)
4.     Boyce and Codd (BCNF)
5.     Forth Normal Form (4NF)
6.     Fifth Normal Form (5NF)

1. First Normal Form – No repeating elements or group of elements.
A. It contains only atomic values – An atomic value is a value, which cannot be divided. For eg in the table below studentid 101 has enrolled in two courses namely Python and Java. Hence the table below is not in 1NF.

StudentID
StudentName
StudentAddress
StudentCity
Courses_enrolled
101
Alex
12 Ellis st
Milpitas
Python, Java
102
Adam
310 Yosemite ct
Palo Alto
Web services
103
Kat
99 Buena vista ave
Cupertino
Java

In the first normal form, the previous table can be represented in the following way:

StudentID
StudentName
StudentAddress
StudentCity
Courses_enrolled
101
Alex
12 Ellis st
Milpitas
Python
102
Adam
310 Yosemite ct
Palo Alto
Web services
101
Alex
12 Ellis st
Milpitas
Java
103
Kat
99 Buena vista ave
Cupertino
Java

Using the first normal form, data redundancy increases, as there will be many columns with same data in multiple rows but each row as a whole will be unique.
A primary key is a column (or group of columns) that uniquely identifies each row. In the above table, there is no single column that uniquely identifies each row. However, if we put a number of columns together we can satisfy this requirement. The two columns that together uniquely identify each row are StudentID and Courses_enrolled. Therefore, together they qualify to be used as the table’s primary key. However, this database design does not meet the more stringent requirements of the second normal form.

B. There are no repeating groups - A repeating group means that a table contains two or more columns that are closely related. For eg the table below records a student who has enrolled in two courses in course1 and course2. This table is not in 1NF because course1, course2 are all repeating the same attribute.

StudentID
StudentName
StudentAddress
StudentCity
Course1
Course2
101
Alex
12 Ellis st
Milpitas
Python
 Java
102
Adam
310 Yosemite ct
Palo Alto
Web services
Null
103
Kat
99 Buena vista ave
Cupertino
Java
Null


In the first normal form, the previous table can be represented in the following way:

StudentID
StudentName
StudentAddress
StudentCity
Courses_enrolled
101
Alex
12 Ellis st
Milpitas
Python
102
Adam
310 Yosemite ct
Palo Alto
Web services
101
Alex
12 Ellis st
Milpitas
Java
103
Kat
99 Buena vista ave
Cupertino
Java

The First Normal Form addresses two issues:
A.     A row of data cannot contain repeating groups of similar data (atomicity) and
B.     Each row of data must have a unique identifier (Primary key)

2. Second Normal Form: No partial dependency on a concatenated key.

For second normal form, we test the table for partial dependencies on a concatenated key. This means that for a table that has a concatenated primary key, each column in the table that is not part of the primary key must depend upon the entire concatenated key for its existence. If any column only depends upon one part of the concatenated key, then we say that the entire table has failed Second Normal Form and we must create another table to rectify the failure.


StudentID
StudentName
StudentAddress
StudentCity
Courses_enrolled
101
Alex
12 Ellis st
Milpitas
Python
102
Adam
310 Yosemite ct
Palo Alto
Web services
101
Alex
12 Ellis st
Milpitas
Java
103
Kat
99 Buena vista ave
Cupertino
Java

In the above table, StudentID and Courses_enrolled is the concatenated primary key.
The table has two rows for StudentID 101 to include multiple courses the student has enrolled for. While this is searchable and follows 1NF, it is inefficient use of space.  Also, attributes like Student Name, Address, City depends on Student ID and not on the Courses he has enrolled for. To achieve 2NF, we will have to split the tables as follows:

Student table
StudentID
StudentName
StudentAddress
StudentCity
101
Alex
12 Ellis st
Milpitas
102
Adam
310 Yosemite ct
Palo Alto
103
Kat
99 Buena vista ave
Cupertino

Courses table
StudentID
Courses_enrolled
101
Python
101
Java
103
Java
102
Webservices

Now, both the tables qualify for the Second normal form and will not suffer from update anomalies.




1 comment:

  1. Very nice posting. Your article us quite informative. Thanks for the same. Our service also helps you to market your products with various marketing strategies, right from emails to social media. Whether you seek to increase ROI or drive higher efficiencies at lower costs, Pegasi Media Group is your committed partner will provide b2bleads.
    ATG Users

    ReplyDelete