What is an index – a theory (part 1/4 – dense and sparse indexes)

Since I had written a few posts about indexes before, I thought it would be good to shed some light on theory behind indexes in SQL Server. It would be divided into four parts:

  1. Dense and sparse indexes (this post).
  2. B-trees – searching for a value and operation efficiency.
  3. B-trees – inserting an item, operation cost and page splits.
  4. B-trees – deleting an item, operation cost and page merge.

By no means I am comfortable with saying that it’s a definitive source of knowledge about indexes in SQL Server. It’s a result of research conducted by me based on various sources. If you are interested to know more, try reading those books first:

  • Molina, Ullman Database Systems – The Complete Book, chapter 13
  • Knuth The Art of Computer Programming, chapter 6.2.4
  • Cormen, Leiserson, Rivest, Stein Introduction to Algorithms, chapter 18

Whenever I’m using online content by other authors, it’s always used by their consent.

Let’s start with a definition of an index. An index is a data structure that takes a property of records as input – typically the value of one or more fields – and finds the records with that property quickly [Molina, Ullman]. It is equivalent to say that an index allows us to find a record without having to scan all the possible records. The fields upon which an index is based are called search key, or just key.

To get an idea about indexes work, let’s consider following simplification. A sequential file is a data file for a relation which is sorted by primary key (we are assuming that all relations have a primary key). Note that it has some implications – all operations (search, insert, delete) have average complexity O(n), where n is a number of records. If a sequential file grows really big (has a lot of records), using it may become a problem.

If records are big, we could devise following method of indexing – create a file, called index file, in which we keep following pairs for all records from our sequential file – primary key value and pointer to the record in sequential file. Such index is called dense index, since it contains all values of primary key. Of course, it does not give much of a performance boost, but if the records in a table are big it’s easier (quicker) to scan such index than a table itself (especially if the index fits in memory). The data retrieval is done by one disk I/O since the index contains pointer to data in data file. The complexity measured in terms of number of records stays the same – it’s still O(n) on the average.

If a dense index were too big to give a benefit, we could take a shortcut and have a pointer not to every record, but, say, every fourth record. This is a sparse index, which trades a bit of speed of dense index for smaller size. In our case a sparse index is four times smaller than a dense index, so searching it would follow such procedure (we are searching for primary key value of x):

  1. Find the biggest value y in a sparse index, such as y <= x.
  2. Go to data file and check if primary key value in a record referenced by a pointer from sparse index is equal to x. If yes, we’re done.
  3. [Loop 4 times] Check the next record if primary key value is equal to x – if yes, search is complete.
  4. If primary key value of x is not found in the loop, it is not contained in the relation.

Note that step 1 in this procedure is also O(n) on the average, but with a lower coefficient hidden in a big O notation. It is also possible that a partial table scan from step 3 induces additional I/O, so having a sparse index “too sparse” might cause extra I/O.

In order to overcome this, we can create multi-level index structure – first level being either dense or sparse, whereas next levels have to be sparse indexes (dense index in levels 2+ does not provide any performance boost). The idea is to benefit from smaller record size in index, so the index page contains more records and it requires less I/O. Therefore the procedure above would be changed to (remember that 4 is our choice here, it can be a different number):

  1. Find the biggest value y in a sparse index, such as y <= x.
  2. Go to first-level index, record next 4 records and look up the primary key value of x.
  3. If primary key value of x is not found, it is not contained in the relation.

This kind of structure becomes very similar to B-tree with one difference – a tree can grow as deep as necessary by itself, but multi-level index has arbitrary depth. Owing to this, the pessimistic example in multi-level index still has O(n) complexity, but the tree lookup will be O(log n). With this comment, it is time to end part 1 of my index theory summary and return in the future with more information about B-trees.

Index maintenance revisited

It’s almost a month since my post on index maintenance based on fragmentation, in which I attempted to show what can be done using simple script and some DMVs. There was some discussion on LinkedIn group SQLDBA about it and I would refer to it with few comments.

  1. The purpose of my script was purely to demonstrate the existence of sys.dm_db_index_physical_stats DMF, which in conjunction with other DMV yields very interesting results.
  2. The script presented by me was very simple, if not simplistic. It does not work well with partitioned indexes (my own findings), offline databases (as pointed out by Sten Westerback) and – what’s more important – it’s only criteria of operation is index fragmentation being completely unaware of size or other constraints. This does not make it a good candidate for running unmodified in production environments.
  3. As per another comment – indeed Microsoft provides very good example in Books Online regarding sys.dm_db_index_physical_stats, which deals with first two of points mentioned above.
  4. Finally, there is a fantastic maintenance tool developed by Ola Hallengren, which gives all kinds of options when it comes to database maintenance. It is successfully used by many DBAs all around world – if you need something to be working out-of-the-box, you might want to give it a try.

The reason why I’m writing this is that I don’t want to be blamed for taking a proof of concept (as my script was) and turning it into a primary maintenance solution. As usual – read, test, ask questions, adapt, share your findings, we all will benefit from that.

How to find potential missing indexes in a database

In previous posts (here and here), I described the methods I use to monitor and maintain indexes. This is a next chapter – let’s try to find candidates for new indexes or hints for extending existing ones. Of course, there are few methods to achieve this goal:

  1. Find a candidate query and analyze it in Management Studio (applies to SQL Server 2008 or newer). This reduces to another problem – how to find a suitable query? You can use Profiler, but you have to spend some time working on filtering plus you can miss the query, if you’re not patient enough. You can also use reports in Management Studio – right click on server, select Reports –> Standard Reports and choose Peformance – Top Queries by Average IO and Top Queries by Total IO. Queries in Top Queries in Average IO are the best candidates for optimizing, especially if they are frequently used.
    As for analysis – you can do it yourself manually or take advantage of new SSMS feature in 2008 – if you display execution plan, it will mention an index which could be helpful and projected benefit if such index is created.
  2. If you have a query or a bunch of them, you can provide them as input to Database Engine Tuning Advisor and get the recommendation.
  3. You can also check sys.dm_db_missing_index_details DMV to get information of all possible missing indexes in a database. This is actually the method used in SSMS (mentioned in step 1) – if it stumbles upon a scan or a seek which would benefit from additional index, it updates stats in the DMV and returns information to the user. If you want to find all indexes you can use following script:
    select 
    	mid.statement as [Table name], 
    	mid.equality_columns, 
    	mid.inequality_columns, 
    	mid.included_columns, 
    	isnull(migs.unique_compiles, 0) as [Compile count], 
    	isnull(migs.user_seeks, 0) as [User seeks], 
    	isnull(migs.user_scans, 0) as [User scans], 
    	migs.avg_total_user_cost, 
    	migs.avg_user_impact
    from 
    	sys.dm_db_missing_index_details mid
    	inner join sys.dm_db_missing_index_groups mig on mid.index_handle = mig.index_handle
    	left join sys.dm_db_missing_index_group_stats migs on mig.index_group_handle = migs.group_handle

Compare it also with Books On-Line entries for sys.dm_db_missing_index_columns dynamic management function (there is a similar example to this one in this article) and sys.dm_db_missing_index_details dynamic management view for more information. And – as usual – please let me know if you have any comments or suggestions.

A script for index maintenance based on fragmentation

In my previous post regarding indexes I presented a method of inspecting all indexes in a database using sys.dm_db_index_physical_stats. Let us develop it a little bit more.

Suppose you want to loop through all indexes and rebuild those which fragmentation exceeds 30 percent, and reorganize those which fragmentation is between 5 and 30 percent. Rebuilding or reorganizing all indexes is not a problem – you can prepare a maintenance plan and incorporate a particular task. But it will not solve the problem – this way you can apply only action to all the indexes in a database. You can say there is no problem in rebuilding all indexes, but there is – there is no point rebuilding those which are barely fragmented since it is a waste of resources (disk space, CPU and I/O), extra log space is being used to record all the operations and it may not give any performance boost at all. On the other hand – reorganizing some indexes may give no performance bonus, especially at high level of fragmentation, when it’s easier (and better) to rebuild.

So the problem stated in the beginning may be solved this way:

  1. Extract information about all indexes to be maintained (note that the scope can be narrowed to a single table) – you need index name, table name and fragmentation.
  2. For each index perform rebuild or reorganization based on fragmentation.
  3. (optional) Place it in a maintenance plan (in Execute T-SQL Statement Task) or SQL Server Agent job to run it periodically.

First step is something already mentioned:

DECLARE @IndexName varchar(255)
DECLARE @TableName varchar(255)
declare @Frag float

DECLARE TableCursor CURSOR FOR
SELECT si.[name] as index_name,
	sdm.avg_fragmentation_in_percent,
	so.[name] as table_name
FROM sys.dm_db_index_physical_stats (NULL, NULL, NULL, NULL, NULL) sdm
	inner join sys.indexes si on sdm.object_id = si.object_id and si.index_id = sdm.index_id
	inner join sys.objects so on so.object_id = si.object_id

Notice variable declarations and a cursor for a future loop. Since I already described this part of the query, let’s fast forward to second step. Here’s the code:

OPEN TableCursor 

FETCH NEXT FROM TableCursor INTO @IndexName, @Frag , @TableName
WHILE @@FETCH_STATUS = 0
BEGIN
	print @TableName + ' - ' + @IndexName + '...'
	if @Frag < 30 and @Frag > 5
	begin
		print ' REORGANIZE '
		exec ('ALTER INDEX ' + @IndexName + ' ON [' + @TableName + '] REORGANIZE')
	end
	else if @Frag > 30
	begin
		print ' REBUILD '
		exec ('ALTER INDEX ' + @IndexName + ' ON [' + @TableName + '] REBUILD')
	end
	print 'done' + char(13)
	FETCH NEXT FROM TableCursor INTO @IndexName, @Frag, @TableName
END 

CLOSE TableCursor
DEALLOCATE TableCursor

Using a cursor, we can loop through all the indexes and then choose appropriate action based on value of @Frag variable. In addition, you will have a trace of actions in the script output.

Whole script is available for download here. As usual, I await your comments and suggestions.

How to check fragmentation of all indexes in a database

It has been told multiple times how index maintenance is important for a healthy database. Here’s one of my tricks – a quick script to check all indexes within a database. It uses DMF sys.dm_db_index_physical stats, which is available since SQL Server 2005. Quick glance into Books Online gives the basic syntax:

sys.dm_db_index_physical_stats 
(
	{ database_id | NULL | 0 | DEFAULT }, 
	{ object_id | NULL | 0 | DEFAULT }, 
	{ index_id | NULL | 0 | -1 | DEFAULT }, 
	{ partition_number | NULL | 0 | DEFAULT },
	{ mode | NULL | DEFAULT }
)

However, the output provided by this DMF is not easily readable (see for yourself Szeroki uśmiech), so it would be good just to make more user-friendly. Here’s my proposition:

SELECT 
	s.[name] AS [Schema], 
	o.[name] AS [Table], 
	ips.index_type_desc AS [Index Type], 
	i.[name] AS [Index Name], 
	i.is_primary_key AS [Primary Key], 
	i.is_unique AS [Unique], 
	ips.index_depth, 
	i.fill_factor AS [Fill factor], 
	ips.avg_fragmentation_in_percent, 
	ips.fragment_count, 
	ips.avg_fragment_size_in_pages, 
	ips.page_count, 
	o.create_date, 
	o.modify_date
FROM sys.dm_db_index_physical_stats(DB_ID(), NULL, NULL, NULL, NULL) ips
	INNER JOIN sys.objects o ON ips.object_id = o.object_id
	INNER JOIN sys.schemAS s ON o.schema_id = s.schema_id
	INNER JOIN sys.indexes i ON i.object_id = ips.object_id 
	AND i.index_id = ips.index_id

And a sample result you can get by running it against Adventure Works:

indexesAW

You can narrow it down by providing a table name as a second parameter to sys.dm_db_index_physical stats, but it must be preceded with a schema name (e.g. OBJECT_ID(N‘Person.Address’) ). Or, if you want to check several tables, and a WHERE clause and filter them. Finally, you can also use it in your index maintenance routine to select which indexes should be rebuilt, which should be reorganized and which don’t need a maintenance; instead of simply rebuilding them all.