Database Normalization … taming the wild horse

For any application to use database and hence data effectively it is very important that data is organized properly. The most common problem for database is redundancy in data. If table has highly redundant data, that means for every operation (like insert, update or delete) it takes more effort to find correct record. And it can be even troublesome when we simply search for record and we get redundant results which is just waste of space and time. To overcome this problem of redundancy in data, a specific procedure is developed which is called “Normalization” . As we know, in RDBMSes (like SQL Server, Oracle, MySQL) data is stored based on relation. And this relations are used in process of normalization to reduce redundancy in data and hence improving overall performance of system.

Unfortunately, normalization is the process that can be used to properly arrange data is applied during very initial phase of database designing which is called logical modeling. But, most of the time when you realize that you have screwed up and have not properly optimized database it is already late. The concept of normalization was firstly introduced by E.F. Codd, who is known to us as an inventor of relational database management system. Actually he published a paper containing 12 rules to define what is relational database actually meant. All modern DBMSes follow these rules and if we consider while designing database, it can have significant effect later.

Concepts of normalization can be explained with following example. Consider following table which shows student details including chosen subject and advisor for that subject.

Un-Normalized Data

Student#	Advisor	Adv.-Room	Class1	Class2	Class3
1022	Jones	412	101-07	143-01	159-02
4123	Smith	216	201-01	211-02	214-01

First Normalization (1NF) : It is very basic rule of normalization. It dis-allows multivalued attributes & composite attributes. So, above shown table has to be rearranged as shown below.

Student#	Advisor	Adv.-Room	Class#
1022	Jones	412	101-07
1022	Jones	412	143-01
1022	Jones	412	159-02
4123	Smith	216	201-01
4123	Smith	216	211-02

But as you can see that data is still redundant

Second Normalization (2NF) : It states that each non key values must be fully functional dependent on primary key or composite primary key (in case of not having single unique key). And data should also satisfy 1NF. And they should be placed in different table. In above table (1NF), we can see that class# is not fully dependent on primary key (which is student#), so we can create two tables Students, and Registration by separating attributes.

Student#	Advisor	Adv.-Room
students	table
1022	Jones	412
4123	Smith	216

Student#	Class#
Registration	table
1022	101-07
1022	143-01
1022	159-02
4123	201-01
4123	211-02

Third Normalization Form (3NF) : It has been said that, an average human analyzes and arranges data in 3rd normalized form by nature. So, if you look at the design sometimes you may “feel” that something is wrong (but to be honest, it needs more than feeling heart_broken ). Third normalization prohibits transitive dependency. Also, data should satisfy 2NF conditions too. For example, in above table, Advisor Room is not directly dependent on Primary key Students. Actually relation is something like, Advisor Room –> Advisor –> Student# , which is a transitive dependency. So, we need to separate that column and put it in different table.

So finally we will have two tables from Student Table, Students and Advisor.

Student#	Advisor
Student:
1022	Jones
4123	Smith

Advisor	Adv.-Room
Advisor:
Jones	412
Smith	216

An so in the end, we finally have 3 tables, Student, Advisor and Registration. And as we can see, the data which was once highly redundant, is now properly arranged and less redundant. Which naturally solved problem we discussed at the beginning of this post, like insert, update, delete and ofcourse search.

Normal forms can be informally defined as, the key, the whole key and nothing but the key. (does it remind of something ??)

Actually we don’t need to further normalize our data, but if there further normalization techniques are available, which are called, Fourth Normalized form (4NF), Fifth Normalized Form (5NF), Boyce-Codd Normal Form (BCNF) and Sixth Normalization Form (6NF). But as a note you should remember that, as data gets highly normalized number of joins to get required results gets increased and that can seriously raise response time (as joins are in general very expensive operations.), so in general we don’t see normalization beyond 3NF. Even in some cases when allowed, Database Architects prefer to denormalize (reversing normalization by joining tables) which may improve performance with introduction to redundant data.

This is enough for today.

Its Just A Thought … fingerscrossed

Gaurang Sign

Leave a Reply