Indexing in simple term can be explained as a pointer to some specific information. It works nearly same as indexes at end of book, just fine the keyword you are looking for at index and get the page number where it is located (when I was studying,
mostly sometimes I thought that best way to read a book is to begin from index , because it gives you precise location of information).
If you ask any developer or DBA about importance of indexes in database then at first they will look at you like or … and then probably they will lecture you for next 1 hour about how it is really important to have good indexes (are there bad ones too ? … of course yes) for database. I will talk about bad indexes at end, but first why we really need them ??
Well, basic use of any DBMS is to store data, right ?? And as I said, index is basically pointed to some information or in our case some record. So basically when we execute any query, Query Optimizer of Database Engine first looks for existence of any Index for the column on which we are performing operation (NOTE: before running ANY query, Query Optimizer generates a plan to execute query called EXECUTION PLAN, which is used to analyze/troubleshoot performance of query)!!! And if index exists on column in question then it becomes really easy (generally this is process, but things can get complicated at complex queries) to get results … by easy means execution time of that query reduces significantly. Of course if you have used SQL Server or even any DBMS then you know that there is no general answer for any question and you will hear “it depends” a lot from senior guys … which sometimes makes you doubtful about their
qualification understanding … but they are correct, I mean even though I am writing it now there are many “it depends” scenario we can talk about but then this post will be more like epic then simple introduction … but I will try to mention some points in the end of post.
So, how we decide where we need to create an index ?? Answer is simple (mostly), you look for the column/s that you query the most and create index on that column. And you will instantly see performance improvement in your query (again it depends )… Now, I use SQL Server mostly so I am gonna write about different type of Indexes (yes there are more of them) offered by SQL Server.
1. Clustered Index: In general clustered indexes are created (or automatically created) on Primary key of a table (and again “it depends”) … Primary Key can be defined as a unique column in whole table which has non-repeating (and NOT NULL) values in rows. For example we have Employee table in database, then most common Primary Key can be assumed as Employee ID (yes not employee SSN, because of security, and other reasons). So, no two employee will have same emplyeeID and that will make it unique, non-repeating, non-null value. So we create an index on employee ID. When we create clustered index on column, whole table data physically gets sorted and stored with reference to that key value of index in B-Tree. And data itself is stored at Datapages of last Leaf Nodes of B-Tree (B-Tree is used because of its superior performance for overall search, sort, update, delete, insert operations.). As data is physically gets sorted and stored, it is not possible to have more then 1 clustered index per table.
2. Non Clustered Index: Well, this is another type of index created on columns. This type of index is generally created on columns that are used frequently in query (and/or on candidate keys), generally NOT created on Primary Keys. So, if take previous example of Employee table, then you can use Manager ID or Department ID (lots of IDs, because table is highly normalized and each Entity that doesn’t depend on Key value is separated). When we define non Clustered index, data doesn’t get stored or sorted at all, and so leaf node of B-tree contains only pointer to data pages of Clustered index or a Heap (coming next). And yes, we can create n numbers of non-clustered index (technically that number is 999)
3. Heap Tables: In practical scenarios we at least create Clustered Index on a table, but there are cases were user might want to skip the creation part or when it is removed for re-arranging data in data pages or for other maintenance tasks. So, in short the table without any clustered index is called HEAP. If you have experience with SQL Development then you might think why in the name of holy cow I want my table without any Clustered Index ?? But believe me, if chosen wisely it has some real advantage … you will understand more after reading next paragraph…
So overall having Clustered index and Non-Clustered Indexes on table is really good for query performance. But my friend there is nothing like “free lunch” in this world … remember I was talking about bad indexes ?? well now its time to explain it. As I already explained, when we create non-clustered index data is no data is physically stored, all information is accessed via pointers residing at leaf node of non-clustered index pointing towards data pages of clustered index. But, still non-clustered indexes takes up some place to store all that information. So higher the number more additional space we need to store non-clustered index information. Also, when we do any DML operation (insert, update or delete) on table, corresponding indexes too have to update that information in their b-tree structure. So as you can imagine if we have highly transactional OLTP (Online Transactional Processing) Database where record changes at very high rate per minute, this whole non-cluster index update process takes up high number of resources and the situation gets worse when you have created high number of indexes on table …. so I say it again … there is nothing like “free lunch” in this world … … and this are two main reasons why you should avoid creating lots of un-necessary indexes (there are ways to find required index too, via DMVs… that is for some other time) … And so it is the reason for some cases were HEAP performs well.
Also remember that having proper index surely gives good performance but in some cases it may not give results (performance wise) as we expect. The main reason for that is, indexes are only good for their key value (column on which we have defined index), so if our query is NOT using that key value then we will not see performance improvement. Also there are cases were index scans are done (which is in a way equivalent to Table scans) which are not good in prospect of performance. In these cases we create some special kind of indexes called Covered Index, but lets leave it for some other time. Also, I personally believe that to have good understanding of how Indexes are useful or how they work, you should know better understanding of B-Tree storage structure and SQL Server storage architecture.
I think this is enough for this time. May be sometime I will write more about indexes as they are vital part of any DBMS to perform good.
Its Just a Thought …