Pradeep Chauhan: Working with Statistics

What are statistics?

Simply put - Statistics are a form of dynamic metadata that assists the query optimizer in making better decisions and creating the best possible execution plan. These statistics represent the distribution of the data within a column, or columns. The Query Optimizer uses them to estimate how many rows will be returned from a query plan.
Query Optimizer is an intelligent program; it does not necessarily use Statistics all the time.
Suppose there’s a table with very few records, let’s say – 50 rows. In this case it is better off performing a table scan instead of wasting time to perform multiple seeks.

When are statistics created?

a) Statistics are automatically created when there’s a clustered index on the column.

b) When there’s non-clustered index on one or more columns.

c) SQL Server can also build statistics on non-indexed columns. When the non-indexed columns are used in join or filter criteria, SQL Server automatically builds statistics on such columns if it believes such information valuable for creating a better execution plan.

d)Statistics can also be created by create statistics statement. For Ex:
CREATE STATISTICS <statistic-name> ON TableName(ColumnName(s))

Note: By default AUTO_CREATE_STATISTICS set to ON. In case if it is set to OFF on the database the statistics won’t be created by default except for the indexes.

The drawback of the having AUTO_UPDATE ON is that the statistics update will be a part of a optimization phase when the load on the Database is high i.e., lot of DMLs happening, the optimization phase will wait till the statistics get updated for those inserts and updates which results in slower queries.

To be sure whether statistics are being used or not, check out your Actual Number of Rows and Estimated Number of Rows.
If these numbers are (consistently) fairly close, then most likely your statistics are up-to-date and used by the optimizer for the query. If not, time for you to update your statistics.

How Statistics work?

Data is measured two different ways within a single set of statistics, by Density and by Data distribution.

Density:

Density describes the uniqueness of data in a particular column.

Density can be Calculated as 1/distinct values and it lies between 0 and 1
0 refers to no duplicate records
1 refers to 100% duplicate records.

Data Distribution:

The data distribution represents a statistical analysis of the kind of data that is in the first column available for statistics.

Data distribution is measured by histogram which keeps track of number of occurrences of data that fall into number of categories, and in distribution statistics these categories are chosen so as to represent the distribution of the data. It’s this information that the optimizer can use to estimate the number of rows returned by a given distribution value.

SQL collects and maintains a sampling of 200 values of the index data in RANGE_HI_KEY and uses this histogram to determine an estimated number of matches to the criteria.

To understand how exactly Statistics work, let’s take help of DBCC SHOW_STATISTICS

use AdventureWorks2008

DBCC SHOW_STATISTICS ('Sales.SalesOrderDetail','IX_SalesOrderDetail_ProductID')

Result would be something like this.

It Returns 3 Resultsets.

1) First Resultset focusses on the columns included in the index, details of number of rows, the number of steps in the tree, length and density. You can also check when the statistics were last updated. If it is too old then you need to update the statistics.

The optimizer uses this information to see if it is better to just do a scan than to waste the effort on multiple seeks due to the number of rows and steps.

2) Second Resultset guides you about the All density of frequent values. It also contains information of the columns included and width of the columns.

3) Third resultset is actually the most interesting section. This is where most people think about stats but really don't understand what's going on

SQL collects and maintains a sampling of 200 values of the index data

By using this histogram, SQL Server determines the expected no. of rows to be returned for a given query. For example: in the above if you see the highlighted section, you will noticed that in the 80th step we have

RANGE_HI_KEY=741

RANGE_ROWS=167

EQ_ROWS= 94

Firstly, it means that for range of ProductID>738 and ProductID<741 we will get 167 rows. This can be verified by executing the following query:

Select count(*) from Sales.SalesOrderDetail where ProductID>738 andProductID<741

Result is 167, which confirms the RANGE_ROWS statistics value

Secondly, it also means that for ProductID=741 we will get 94 rows. This can be verified by executing the following query:

Select count(*) from Sales.SalesOrderDetail where ProductID=741
Result is 94, which confirms the EQ_ROWS statistics value.

Conclusion:

Statistics play a very vital role in deciding the best execution plan for the query optimizer and can enhance the performance of queries to a great extent. Inefficient maintenance of statistics may result in all sorts of uninvited performance issues.

Sources:
https://www.simple-talk.com/sql/learn-sql-server/statistics-in-sql-server/

http://blog.namwarrizvi.com/?p=193

http://technet.microsoft.com/en-us/library/dd535534(v=sql.100).aspx

Pradeep Chauhan

Sunday, February 23, 2014

Working with Statistics