Optimizer Histograms

At the recent OUG Ireland conference I had the privilege of participating in a panel discussion on the Oracle Database. During the course of the session the topic of Optimizer histograms came up. As always, a heated discussion ensued among the members of the panel, as we each had very different views on the subject.

Why so many different opinions when it comes to histograms?

The problem arises from the fact that some folks have been burnt by histograms in the past. In Oracle Database 9i and 10g, histograms in combination with bind-peeking lead to some unpredictable performance problems, which is explained in detail in this post on the  Optimizer blog.  This has resulted in a number of folks becoming histogram shy. In fact, I reckon if you were to put 3 Oracle experts on a panel, you would get at least 5 different opinions on when and how you should gather histograms!

So I thought it would be a good idea to explain some of the common misconceptions that surround histograms and the impact of adopting them.

This is a long post, so you might want to grab a coffee before you get into it!

Before we get into the misconceptions, let me quickly recap what is a Histogram?

Histograms tell the Optimizer about the distribution of data within a column. By default (without a histogram), the Optimizer assumes a uniform distribution of rows across the distinct values in a column. Therefore, the Optimizer calculates the cardinality (number of rows returned) for an equality predicate by dividing the total number of rows in the table by the number of distinct values in the column used in the equality predicate.

If the data distribution in that column is not uniform (i.e., a data skew exists) then the cardinality estimate will be incorrect. In order to accurately reflect a non-uniform data distribution, a histogram is required on the column. The presence of a histogram changes the formula used by the Optimizer to estimate the cardinality, and allows it to generate a more accurate execution plan.

The creation of histograms is controlled by the METHOD_OPT parameter in the DBMS_STATS gathering procedures.

BEGIN
dbms_stats.Gather_table_stats(ownname => 'SH', tabname => 'SALES',
method_opt => 'FOR ALL COLUMNS_SIZE AUTO');
END;
/

Oracle offers 4 types of histograms, more details on when and how they are created is available in the Optimizer Statistics White Paper.

Misconception 1: It’s always best to disable the creation of histograms

method_opt => 'FOR ALL COLUMNS_SIZE 1'

This misconception often stems from a belief that Oracle creates to many histograms, so you should prevent Oracle from creating any.

So, how does Oracle determine when to create a histogram?

Oracle automatically determines the columns that need histograms based on the column usage information recorded in SYS.COL_USAGE$ (columns used in WHERE clause predicates), and the presence of a data skew. For example, Oracle will not automatically create a histogram on a unique column if it is only seen in equality predicates.

What’s the down side to stopping Oracle from creating any histograms?

The Optimizer will assume an even distribution of values in all columns.

On first reading this, you would be forgiven if you thought it doesn’t sound like a down side at all! In fact, it sounds like a short cut to the holy grail of plan stability. But what if the plan you get when the Optimizer assumes an even distribution isn’t the one you want?!

Imagine we owned two hardware stores. Most of customers live in the same zipcode as our store, with a couple of exceptions.

SELECT   zipcode, COUNT(*) 
FROM     customers 
GROUP BY zipcode;
 
  ZIPCODE   COUNT(*)
---------- ----------
     91140	20000
     94303	20000
     94070	    2
     94065	    2

Let’s now gather statistics on the CUSTOMERS table using the recommendation to disable histograms.

BEGIN                 
  2  dbms_stats.gather_table_stats('SH','CUSTOMERS', METHOD_OPT=>'FOR ALL COLUMNS SIZE 1');
  3  END;
  4  /
 
PL/SQL PROCEDURE successfully completed.
 
SELECT column_name, num_distinct, histogram 
FROM   user_tab_col_Statistics
WHERE  TABLE_NAME='CUSTOMERS';
 
COLUMN_NAME			    NUM_DISTINCT HISTOGRAM
----------------------------------- ------------ ---------------
CUST_FIRST_NAME 			     451 NONE
CUST_LAST_NAME				     401 NONE
CUST_GENDER				       2 NONE
ZIPCODE 				       4 NONE
CUST_EMAIL				     401 NONE

No histograms were created, so when a query comes in looking to find the different genders of our customers in a given zipcode, the Optimizer will calculate the cardinality estimates by dividing the total number of rows in the table (40,004) by the number of distinct values (4). Therefore the Optimizer will assume 10,001 rows will be returned from the customers table regardless of the zipcode provided, which will always result in a FULL TABLE SCAN. This works great for two of our zipcodes (91140 and 94303) but not so great for the other two (94070 and 94065), who would be better off using an INDEX RANGE SCAN.

EXPLAIN PLAN FOR
SELECT   cust_gender, COUNT(*) 
FROM     customers
WHERE    zipcode='94303' 
GROUP BY cust_gender;
 
Explained.
 
SELECT * FROM TABLE(dbms_xplan.display());
 
 
PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------------------------
Plan hash VALUE: 1577413243
 
--------------------------------------------------------------------------------
| Id  | Operation	   | Name      | ROWS  | Bytes | Cost (%CPU)| TIME     |
--------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |	       |     2 |    14 |    72	 (3)| 00:00:01 |
|   1 |  HASH GROUP BY	   |	       |     2 |    14 |    72	 (3)| 00:00:01 |
|*  2 |   TABLE ACCESS FULL| CUSTOMERS | 10001 | 70007 |    70	 (0)| 00:00:01 |
--------------------------------------------------------------------------------
 
EXPLAIN PLAN FOR
SELECT   cust_gender, COUNT(*) 
FROM     customers
WHERE    zipcode='94065' 
GROUP BY cust_gender;
 
Explained.
 
SELECT * FROM TABLE(dbms_xplan.display());
 
PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------------------------
Plan hash VALUE: 1577413243
 
--------------------------------------------------------------------------------
| Id  | Operation	   | Name      | ROWS  | Bytes | Cost (%CPU)| TIME     |
--------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |	       |     2 |    14 |    72	 (3)| 00:00:01 |
|   1 |  HASH GROUP BY	   |	       |     2 |    14 |    72	 (3)| 00:00:01 |
|*  2 |   TABLE ACCESS FULL| CUSTOMERS | 10001 | 70007 |    70	 (0)| 00:00:01 |
--------------------------------------------------------------------------------

Alternative Solution

If you really don’t want Oracle to create a histogram on a column or a set of columns you can control that by setting a table level preference, which Oracle will honor every time it gathers statistics on that table.

The following command prevents the Optimizer from creating a histogram on the PROD_ID column in the SALES table:

BEGIN
dbms_stats.Set_table_prefs('SH', 'SALES',
'METHOD_OPT'=>'FOR ALL COLUMNS SIZE AUTO, FOR COLUMNS SIZE 1 PROD_ID');
END;
/

But what if you don’t know which columns you don’t want a histogram on up front?

No sweat. You can simply let Oracle collect all the histograms it wants and then drop any of histograms that cause you a problem and then set the preference to tell the optimizer not to gather it again.

The following command drops the histogram on the PROD_ID column:

BEGIN
dbms_stats.Delete_column_stats(ownname => 'SH', tabname => 'SALES',
colname => 'PROD_ID', col_stat_type => 'HISTOGRAM');
END;
/

Misconception 2: It’s always best to prevent Oracle from creating any new histograms

method_opt => 'FOR ALL COLUMNS_SIZE REPEAT'

This misconception is based on a belief, “If it ain’t broke don’t fix it”. In other words, the system is working now, so prevent Oracle from breaking it by create more histograms.

What does setting SIZE REPEAT do?

Setting SIZE REPEAT ensures a histogram will only be created for any column that already has one. If the table is a partitioned table, repeat ensures a histogram will be created for a column that already has one on the global level.

What’s the down side to doing this?

The current number of buckets used in each histogram becomes the limit on the maximum number of buckets used for any histogram created in the future. (Remember a histograms sorts values into buckets.)

Again this may not seem like a bad idea until you realize what limiting the number of buckets can do if you data grows or changes.  For example, let’s assume we own a coffeeshop and initially we only had 1 location. Most of our customers live in the same zipcode as our store, but a couple do not.

SELECT   zipcode, COUNT(*) 
FROM     customers 
GROUP BY zipcode;
 
   ZIPCODE   COUNT(*)
---------- ----------
     91140	20000
     94065	    2

When we gathered statistics initially, we used the default settings and Oracle choses to create a frequency histogram on the zipcode column.

BEGIN 
dbms_stats.gather_table_stats('SH','CUSTOMERS');
END;
 /
 
SELECT column_name, histogram 
FROM   user_tab_col_Statistics
WHERE  TABLE_NAME='CUSTOMERS';
 
COLUMN_NAME			    HISTOGRAM
----------------------------------- ---------------
CUST_FIRST_NAME 		    NONE
CUST_LAST_NAME			    NONE
CUST_GENDER			    NONE
ZIPCODE 			    FREQUENCY
CUST_EMAIL			    NONE
 
SELECT column_name, endpoint_number, endpoint_value                       
FROM   user_histograms 
WHERE  TABLE_NAME='CUSTOMERS'
AND    column_name='ZIPCODE';
 
COLUMN_NAME			    ENDPOINT_NUMBER ENDPOINT_VALUE
----------------------------------- --------------- --------------
ZIPCODE 				      20000	     91140
ZIPCODE 				      20002	     94065

The frequency histogram has two buckets, one for each of the distinct values in the zipcode column. Since the Optimizer knows the exact distribution of both values in our zipcode column, it gets the right execution plan for each value.

EXPLAIN PLAN FOR
SELECT   cust_gender, COUNT(*) 
FROM     customers
WHERE    zipcode='94065' 
GROUP BY cust_gender;
 
Explained.
 
SELECT * FROM TABLE(dbms_xplan.display());
 
PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------------------------
Plan hash VALUE: 3279909665
 
-----------------------------------------------------------------------------------------------------
| Id  | Operation			     | Name	    | ROWS  | Bytes | Cost (%CPU)| TIME     |
-----------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT		     |		    |	  1 |	  7 |	  3  (34)| 00:00:01 |
|   1 |  HASH GROUP BY			     |		    |	  1 |	  7 |	  3  (34)| 00:00:01 |
|   2 |   TABLE ACCESS BY INDEX ROWID BATCHED| CUSTOMERS    |	  2 |	 14 |	  2   (0)| 00:00:01 |
|*  3 |    INDEX RANGE SCAN		     | CUST_ZIPCODE |	  2 |	    |	  1   (0)| 00:00:01 |
-----------------------------------------------------------------------------------------------------
 
SELECT   cust_gender, COUNT(*) 
FROM     customers
WHERE    zipcode='91140' 
GROUP BY cust_gender;
Explained.
 
SELECT * FROM TABLE(dbms_xplan.display());
 
PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------------------------
Plan hash VALUE: 1577413243
 
--------------------------------------------------------------------------------
| Id  | Operation	   | Name      | ROWS  | Bytes | Cost (%CPU)| TIME     |
--------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |	       |     2 |    14 |    38	 (6)| 00:00:01 |
|   1 |  HASH GROUP BY	   |	       |     2 |    14 |    38	 (6)| 00:00:01 |
|*  2 |   TABLE ACCESS FULL| CUSTOMERS | 20000 |   136K|    36	 (0)| 00:00:01 |
--------------------------------------------------------------------------------

So far, so good.

Now we take the advice and switch from the default, SIZE AUTO to SIZE REPEAT. Initially it’s not a problem, but when we open our second store things start to fall apart. We now have a lot of new customers who live in the same zipcode as our second store.

SELECT   zipcode, COUNT(*) 
FROM     customers 
GROUP BY zipcode;
 
  ZIPCODE   COUNT(*)
---------- ----------
     91140	20000
     94303	20000
     94070	    2
     94065	    2

In fact, we have 4 distinct values in the zipcode column but when we gather statistics we are limiting the Optimizer to just 2 histogram buckets.

BEGIN                 
  2  dbms_stats.gather_table_stats('SH','CUSTOMERS', METHOD_OPT=>'FOR ALL COLUMNS SIZE REPEAT');
  3  END;
  4  /
 
PL/SQL PROCEDURE successfully completed.
 
SELECT column_name, histogram 
FROM   user_tab_col_Statistics
WHERE  TABLE_NAME='CUSTOMERS';
 
COLUMN_NAME			    HISTOGRAM
----------------------------------- ---------------
CUST_FIRST_NAME 		    NONE
CUST_LAST_NAME			    NONE
CUST_GENDER			    NONE
ZIPCODE 			    HYBRID
CUST_EMAIL			    NONE
 
SELECT column_name, endpoint_number, endpoint_value, endpoint_repeat_count
FROM   user_histograms
WHERE  TABLE_NAME='CUSTOMERS'
AND    column_name='ZIPCODE';
 
COLUMN_NAME			    ENDPOINT_NUMBER ENDPOINT_VALUE ENDPOINT_REPEAT_COUNT
----------------------------------- --------------- -------------- ---------------------
ZIPCODE 				       2815	     91140		    2815
ZIPCODE 				       5583	     94070		       1

Given the limitations we put on the Optimizer, it has chosen to switch to a HYBRID histogram, as we don’t have enough buckets available to create a frequency histogram (one bucket per distinct value).

Since only 2 buckets were allowed and a histogram always has to have the minimum and maximum values of the column as the endpoints of a bucket, the Optimizer used the values 94070 (maximum value) and 91140(minimum value) even though these are not the two most popular values in the column. So, what happens when we try our query again using the second popular value 94303?

SELECT cust_gender, COUNT(*)
FROM customers
WHERE zipcode='91140'
GROUP BY cust_gender;
 
Explained.
 
SELECT * FROM TABLE(dbms_xplan.display());
 
PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------------------------
Plan hash VALUE: 3279909665
 
-----------------------------------------------------------------------------------------------------
| Id  | Operation			     | Name	    | ROWS  | Bytes | Cost (%CPU)| TIME     |
-----------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT		     |		    |	  1 |	  7 |	  3  (34)| 00:00:01 |
|   1 |  HASH GROUP BY			     |		    |	  1 |	  7 |	  3  (34)| 00:00:01 |
|   2 |   TABLE ACCESS BY INDEX ROWID BATCHED| CUSTOMERS    |	  2 |	 14 |	  2   (0)| 00:00:01 |
|*  3 |    INDEX RANGE SCAN		     | CUST_ZIPCODE |	  2 |	    |	  1   (0)| 00:00:01 |
-----------------------------------------------------------------------------------------------------

Not surprisingly we get the wrong plan, an INDEX RANGE SCAN instead of a FULL TABLE SCAN.

How can we fix it?

Simply use the default value for the METHOD_OPT parameter.

BEGIN 
dbms_stats.gather_table_stats('SH','CUSTOMERS');
END;
 /
 
SELECT column_name, histogram 
FROM   user_tab_col_Statistics
WHERE  TABLE_NAME='CUSTOMERS';
 
COLUMN_NAME			    HISTOGRAM
----------------------------------- ---------------
CUST_FIRST_NAME 		    NONE
CUST_LAST_NAME			    NONE
CUST_GENDER			    NONE
ZIPCODE 			    FREQUENCY
CUST_EMAIL			    NONE
 
SELECT column_name, endpoint_number, endpoint_value                       
FROM   user_histograms 
WHERE  TABLE_NAME='CUSTOMERS'
AND    column_name='ZIPCODE';
 
COLUMN_NAME			    ENDPOINT_NUMBER ENDPOINT_VALUE
----------------------------------- --------------- --------------
ZIPCODE 				      20000	     91140
ZIPCODE 				      40000	     94303
ZIPCODE 				      40002	     94065
ZIPCODE 				      40004	     94070
 
EXPLAIN PLAN FOR 
SELECT   cust_gender, COUNT(*) 
FROM     customers 
WHERE    zipcode='94303' 
GROUP BY cust_gender;
 
Explained.
 
SELECT * FROM TABLE(dbms_xplan.display());
 
PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------------------------
Plan hash VALUE: 1577413243
 
--------------------------------------------------------------------------------
| Id  | Operation	   | Name      | ROWS  | Bytes | Cost (%CPU)| TIME     |
--------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |	       |     2 |    14 |    72	 (3)| 00:00:01 |
|   1 |  HASH GROUP BY	   |	       |     2 |    14 |    72	 (3)| 00:00:01 |
|*  2 |   TABLE ACCESS FULL| CUSTOMERS | 20000 |   136K|    70	 (0)| 00:00:01 |
--------------------------------------------------------------------------------

Alternative Solution

If the default value of METHOD_OPT parameter, doesn’t work in your particular environment, as you only want to allow oracle to create a histogram on this specific column(s) in a table then use table preferences to control it and not SIZE REPEAT.

Let’s assume we only wanted to create a histogram on the CUST_ID of the SALES table. We can use the following command to prevent the Optimizer from create histograms on any other column in the SALES table.

BEGIN
dbms_stats.Set_table_prefs('SH', 'SALES', 
'METHOD_OPT'=>'FOR ALL COLUMNS SIZE 1, FOR COLUMNS SIZE AUTO CUST_ID');
END;
/

What would I recommend?

Always use the default value for METHOD_OPT. This is based on my belief that you should always start with the default.  Yes, there will be exceptions to this rule but it should work for at least 80% of systems, 80% of the time.

For situations where that’s not the case please consider solving the problem at the scope of the problem.

If you have a problem with just one column, in one table, then use DBMS_STATS.SET_TABLE_PREFS to address it, as described above. Or if you are having problems with one particular schema use DBMS_STATS.SET_SCHEMA_PREFS but don’t make system level changes to fix a problem with just one column.

I’ll get off my soapbox now!

14 thoughts on “Optimizer Histograms”

  1. Hi Maria,
    What do you recommend when a database has just been imported, for a migration for example, but the application has not run yet. The SIZE AUTO is not good there because no (or not enough) column usage has been tracked yet, so no histograms are created. If old stats have been imported, we can use REPEAT. If not, then SKEWONLY, but it risks to create lot more histograms. What is your recommended approach there?
    Thanks,
    Franck.

    1. Hi Franck,

      I would recommend ensuring you have a good, representative set of statistics before you begin the migration. Remember statistics should be representative (accurate NDV, MIN, MAX etc) and not necessarily up to the minute.

      When you import the database into the new environment, those representative statistics will also be imported. I recommend you continue to use those representative statistics until the column usage information has been generated before gathering statistics using SIZE AUTO.

      This approach is far safer than suddenly using a new approach to gather statistics (REPEAT or SKEWONLY), which could result in a complete different set of statistics being generated that have the potential to impact existing execution plans.

      Thanks,
      Maria

      1. Hi Maria,
        Thanks. The problem with keeping the same stats is that table BLOCKS and index CLUSTERING_FACTOR will change with migration. We may prefer to gather the new stats before validating the migration and give the go, rather than risking lot of plans changes the day after.

        1. Hi Franck,

          If you want to gather statistics on the migrated system to account for changes in the table BLOCKS and index CLUSTERING_FACTOR then I suggest that you also export the column usage information as part of the migration.

          Once the col_usage$ information is imported into the new system, you can gather statistics using the defaults.

          Alternatively you could use SIZE REPEAT just for the initial gather, as the number of buckets required for each histogram shouldn’t have changed. However, it is not recommended to use the SIZE REPEAT going forward.

          Thanks,
          Maria

  2. Maria – looking at your twitter handle saw that you are presenting the keynote on what’s new in Database 12.2.

    Can we have access to those slides please?

  3. Hi Maria,
    What do think about the following idea:
    Oracle have to make “better” relation between histogram and indexes, actually need some kind of new index, index that not include rows from values with bad selectivity, just ignore that values like what doing with NULL value. Index will be “smaller and faster”.

    Thanks,
    Andjelko Miovcic

  4. First, slight typo, please add the missing “o” to “to” near the beginning where it should be “too.” (I know, auto#&!correct, but we’ve got to get this right for the kids.)

    Second, (and this may well be where histograms are most important), would you explain for the masses in your clear, concise style why using synopsis to quickly get new global stats (where the calculations are on the order of the number of partitions rather than on the order of the number of rows) requires that you figure out and dictate that precisely the same statistics on all partitions of a table, and possibly manually adjust all the partitions (once per need for change) rather than letting a new texture in a new partition cause global statistics calculations to revert to evaluating the entire table.
    Third, is there an easy way (I haven’t found one, but it could be staring me in the face) to tell global statistics collection to just use selected list of partitions? Where most queries are on data spanning the last week or two, for example, and queries on older data are within a single data range partition), it would be useful to have the global stats “lie” and just be calculated on the last week or two. Commonly for life cycle management some of the biggest tables I see have time based primary partitioning, and this would be quite useful. (Eventually I suppose there might be an easy way to have multiple sets of statistics, where the choice is richer than “use partition stats for a single partition” versus “use global stats” if there is more than one partition. So the CBO (if partition selection is observable) would check whether there existed one or more sets of stats that included all the referenced partitions and choose the stat set containing the fewest partitions that satisfied the requirement.) Whew. I hope my 2nd and 3rd are clear, happy to clarify if you have questions.

  5. Hi Maria ,

    can you help us with high_value and low_value used in dba_part_col_statastics
    if the value(high and low) is similar for most of the partition (here I am taking reference of range partition table on a column having datatype number), when stats are gathered,one of column having similar high and low value for most of the partition if my understanding is correct it should not show similar value, it should show high and low end of values , can you please share some light on this ?

  6. Hi Maria,

    Thanks a lot for this amazing blog post.

    I found an issue after gathering stats , that endpoint_repeat_count in dba_histograms was not correct based on the number of values present in the table.

    ENDPOINT_NUMBER ENDPOINT_VALUE ENDPOINT_REPEAT_COUNT
    ————— ————– ———————
    4352 71063970 2
    4353 71065050 1
    4354 71065557 1

    There are 308 values for 71065050, however dba_histograms shows 1.
    would you please let me know what could be the issue here.

    Thanks
    Tina B

    1. Hi Tina,

      Frequency and top frequency histograms are created from a full table scan starting in Oracle Database 12c onwards. However, a hybrid histogram uses a sampling of column data.

      When Oracle creates a height-balanced or hybrid histogram it builds it on a sample that has approximately 5,500 non-null values for the column. Typically all of the histograms required for a table are built from the same 5,500-row sample.

      However, if one or more of the columns that require a histogram has null values then the sample size may be increased in order to achieve a sample of 5,500 non-null values for those columns. in addition, if the difference between the number of nulls in the columns varies greatly, multiple samples may be created, one for the columns that have a low number of null values and one for the columns with a high number of null values.

      You can see the exact sample size used by querying the SAMPLE_SIZE column in the USER_TAB_COL_STATISTICS dictionary view.

      Thanks,
      Maria

Leave a Reply to @FranckPachot Cancel reply

Your email address will not be published. Required fields are marked *

%d bloggers like this: