What is an index – a theory (part 1/4 – dense and sparse indexes)

Since I had written a few posts about indexes before, I thought it would be good to shed some light on theory behind indexes in SQL Server. It would be divided into four parts:

  1. Dense and sparse indexes (this post).
  2. B-trees – searching for a value and operation efficiency.
  3. B-trees – inserting an item, operation cost and page splits.
  4. B-trees – deleting an item, operation cost and page merge.

By no means I am comfortable with saying that it’s a definitive source of knowledge about indexes in SQL Server. It’s a result of research conducted by me based on various sources. If you are interested to know more, try reading those books first:

  • Molina, Ullman Database Systems – The Complete Book, chapter 13
  • Knuth The Art of Computer Programming, chapter 6.2.4
  • Cormen, Leiserson, Rivest, Stein Introduction to Algorithms, chapter 18

Whenever I’m using online content by other authors, it’s always used by their consent.

Let’s start with a definition of an index. An index is a data structure that takes a property of records as input – typically the value of one or more fields – and finds the records with that property quickly [Molina, Ullman]. It is equivalent to say that an index allows us to find a record without having to scan all the possible records. The fields upon which an index is based are called search key, or just key.

To get an idea about indexes work, let’s consider following simplification. A sequential file is a data file for a relation which is sorted by primary key (we are assuming that all relations have a primary key). Note that it has some implications – all operations (search, insert, delete) have average complexity O(n), where n is a number of records. If a sequential file grows really big (has a lot of records), using it may become a problem.

If records are big, we could devise following method of indexing – create a file, called index file, in which we keep following pairs for all records from our sequential file – primary key value and pointer to the record in sequential file. Such index is called dense index, since it contains all values of primary key. Of course, it does not give much of a performance boost, but if the records in a table are big it’s easier (quicker) to scan such index than a table itself (especially if the index fits in memory). The data retrieval is done by one disk I/O since the index contains pointer to data in data file. The complexity measured in terms of number of records stays the same – it’s still O(n) on the average.

If a dense index were too big to give a benefit, we could take a shortcut and have a pointer not to every record, but, say, every fourth record. This is a sparse index, which trades a bit of speed of dense index for smaller size. In our case a sparse index is four times smaller than a dense index, so searching it would follow such procedure (we are searching for primary key value of x):

  1. Find the biggest value y in a sparse index, such as y <= x.
  2. Go to data file and check if primary key value in a record referenced by a pointer from sparse index is equal to x. If yes, we’re done.
  3. [Loop 4 times] Check the next record if primary key value is equal to x – if yes, search is complete.
  4. If primary key value of x is not found in the loop, it is not contained in the relation.

Note that step 1 in this procedure is also O(n) on the average, but with a lower coefficient hidden in a big O notation. It is also possible that a partial table scan from step 3 induces additional I/O, so having a sparse index “too sparse” might cause extra I/O.

In order to overcome this, we can create multi-level index structure – first level being either dense or sparse, whereas next levels have to be sparse indexes (dense index in levels 2+ does not provide any performance boost). The idea is to benefit from smaller record size in index, so the index page contains more records and it requires less I/O. Therefore the procedure above would be changed to (remember that 4 is our choice here, it can be a different number):

  1. Find the biggest value y in a sparse index, such as y <= x.
  2. Go to first-level index, record next 4 records and look up the primary key value of x.
  3. If primary key value of x is not found, it is not contained in the relation.

This kind of structure becomes very similar to B-tree with one difference – a tree can grow as deep as necessary by itself, but multi-level index has arbitrary depth. Owing to this, the pessimistic example in multi-level index still has O(n) complexity, but the tree lookup will be O(log n). With this comment, it is time to end part 1 of my index theory summary and return in the future with more information about B-trees.

About these ads

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s