Geek In Me!: April 2015

*In progress*

Database normalization:

Database normalization is the process of organizing the attributes and tables of a relational database to minimize data redundancy.

Primarily, purpose of normalization is the following:

1. Arranging data into logical groupings such that each group describes a small part of the whole system. In other words, refactoring a table into smaller tables without losing information.

2. Eliminating/minimizing redundant data stored in the database.

3. Organizing the data such that when you modify it, you make change only at one place.

4. Building a database in which you can access and manipulate the data quickly and efficiently without compromising the integrity of the data in storage.

A typical example of normalization is that an entity’s unique ID is stored everywhere in the system but its name is held only in one table. The name can be updated more easily in one row of one table. A typical update in such an example would be Symantec Company changing its CEO to Michael Brown. The update would be done in one place and immediately the correct CEO name would be displayed throughout the system.

Why do we need Normalization?

Without normalization database systems can be inaccurate, slow, inefficient and they might not produce the data you expect. Insertion, Updation and Deletion anomalies are very frequent if database is not normalized.

Lets take an example of student and courses he/she is enrolled in.

StudentID	StudentName	StudentAddress	StudentCity	Courses_enrolled
101	Alex	12 Ellis st	Milpitas	Python
102	Adam	310 Yosemite ct	Palo Alto	Web services
103	Kat	99 Buena vista ave	Cupertino	Java
101	Alex	12 Ellis st	Milpitas	Java

Update Anomaly: The same information is expressed on multiple rows on the above table. A change of address for a particular student will potentially need to be applied to multiple records (one for each course). If the update doesn’t happen successfully, we could leave the data in inconsistent state.

Insertion Anomaly: Suppose for a new admission, we have a studentID, name, address, city of the student but if the student has not yet enrolled for any courses then we have to insert null there, leading to insertion anomaly.

Deletion Anomaly: If student ID:102 who has enrolled only in one course and temporarily he drops it, when we delete the row, entire student information will be deleted along with it.

Normalization Rule:

Normalization rules are divided into the following normal form:

1. First Normal Form (1NF)

2. Second Normal Form (2NF)

3. Third Normal Form (3NF)

4. Boyce and Codd (BCNF)

5. Forth Normal Form (4NF)

6. Fifth Normal Form (5NF)

1. First Normal Form – No repeating elements or group of elements.

A. It contains only atomic values – An atomic value is a value, which cannot be divided. For eg in the table below studentid 101 has enrolled in two courses namely Python and Java. Hence the table below is not in 1NF.

StudentID	StudentName	StudentAddress	StudentCity	Courses_enrolled
101	Alex	12 Ellis st	Milpitas	Python, Java
102	Adam	310 Yosemite ct	Palo Alto	Web services
103	Kat	99 Buena vista ave	Cupertino	Java

In the first normal form, the previous table can be represented in the following way:

StudentID	StudentName	StudentAddress	StudentCity	Courses_enrolled
101	Alex	12 Ellis st	Milpitas	Python
102	Adam	310 Yosemite ct	Palo Alto	Web services
101	Alex	12 Ellis st	Milpitas	Java
103	Kat	99 Buena vista ave	Cupertino	Java

Using the first normal form, data redundancy increases, as there will be many columns with same data in multiple rows but each row as a whole will be unique.

A primary key is a column (or group of columns) that uniquely identifies each row. In the above table, there is no single column that uniquely identifies each row. However, if we put a number of columns together we can satisfy this requirement. The two columns that together uniquely identify each row are StudentID and Courses_enrolled. Therefore, together they qualify to be used as the table’s primary key. However, this database design does not meet the more stringent requirements of the second normal form.

B. There are no repeating groups - A repeating group means that a table contains two or more columns that are closely related. For eg the table below records a student who has enrolled in two courses in course1 and course2. This table is not in 1NF because course1, course2 are all repeating the same attribute.

StudentID	StudentName	StudentAddress	StudentCity	Course1	Course2
101	Alex	12 Ellis st	Milpitas	Python	Java
102	Adam	310 Yosemite ct	Palo Alto	Web services	Null
103	Kat	99 Buena vista ave	Cupertino	Java	Null

In the first normal form, the previous table can be represented in the following way:

StudentID	StudentName	StudentAddress	StudentCity	Courses_enrolled
101	Alex	12 Ellis st	Milpitas	Python
102	Adam	310 Yosemite ct	Palo Alto	Web services
101	Alex	12 Ellis st	Milpitas	Java
103	Kat	99 Buena vista ave	Cupertino	Java

The First Normal Form addresses two issues:

A. A row of data cannot contain repeating groups of similar data (atomicity) and

B. Each row of data must have a unique identifier (Primary key)

2. Second Normal Form: No partial dependency on a concatenated key.

For second normal form, we test the table for partial dependencies on a concatenated key. This means that for a table that has a concatenated primary key, each column in the table that is not part of the primary key must depend upon the entire concatenated key for its existence. If any column only depends upon one part of the concatenated key, then we say that the entire table has failed Second Normal Form and we must create another table to rectify the failure.

StudentID	StudentName	StudentAddress	StudentCity	Courses_enrolled
101	Alex	12 Ellis st	Milpitas	Python
102	Adam	310 Yosemite ct	Palo Alto	Web services
101	Alex	12 Ellis st	Milpitas	Java
103	Kat	99 Buena vista ave	Cupertino	Java

In the above table, StudentID and Courses_enrolled is the concatenated primary key.

The table has two rows for StudentID 101 to include multiple courses the student has enrolled for. While this is searchable and follows 1NF, it is inefficient use of space. Also, attributes like Student Name, Address, City depends on Student ID and not on the Courses he has enrolled for. To achieve 2NF, we will have to split the tables as follows:

Student table

StudentID	StudentName	StudentAddress	StudentCity
101	Alex	12 Ellis st	Milpitas
102	Adam	310 Yosemite ct	Palo Alto
103	Kat	99 Buena vista ave	Cupertino

Courses table

StudentID	Courses_enrolled
101	Python
101	Java
103	Java
102	Webservices

Now, both the tables qualify for the Second normal form and will not suffer from update anomalies.

References: http://en.wikipedia.org/wiki/Database_normalization

Geek In Me!

Sunday, April 12, 2015

Database - Normalization and De-normalization concepts