Jan 272014

In the database design quiz on the PL/SQL Challenge, a player made a comment referring to a 5% rule for indexes – i.e. that an index will (only) be used when accessing 5% or less of the rows in a table. This is something I’ve heard others refer to as well, so I thought I’d put together a post explaining how indexes work via the medium of chocolate!

Imagine you have 100 packets of M&Ms. You also have a document listing the colour of each M&M and the bag it’s stored in. This is ordered by colour, so we have all the blue sweets first, then the brown, green and so on.

You’ve been asked to find all the red M&Ms. There’s a couple of basic ways you could approach this task:

Method 1: Get your document listing the colour and location of each M&M. Go to the top of the “red” section. Lookup the location of the first red M&M, pickup the bag it states, and retrieve the sweet. Go back to your document, and repeat the process for the second red sweet. Keep going back-and-forth between your document and the bags until you’ve found all the red sweets.

Method 2: Pick up a number of bags at a time (e.g. 10), empty their contents out, pick out the red sweets and return the others (to their original bag).

Which approach will be quicker?

Initutively, the second approach appears to be the faster approach. You only pickup each bag once and then do some filtering of the items inside. Whereas with the first approach, you have do a lot of back-and-forth between the document and the bags.

We can be a bit more rigorous than this though. Let’s calculate how many operations we need to do to get all the red chocolates in each case.

When going between the document and the bags (method 1), each time you lookup the location of a new sweet and fetch that bag, that’s a new operation. You have 100 bags, with ~55 of sweets in each. This means you’re doing roughly 920 (100 bags x 55 sweets / 6 colours) operations (plus some work to find the red section in your document).

When you’re picking up several bags, you collect 10 in one step. This means you do 100/10 = 10 operations (plus some filtering once you’ve got them).

Looking at it this way (10 ops vs 920), method 2 is the clear winner.

Let’s imagine another scenario. Mars have started doing a promotion where around 1 in 100 bags contain a silver M&M. If you get the silver sweet, you win a prize. You need to find the silver sweet!

In this case, using method 1, you go to the document to find the location of the single sweet. Then you go to that bag and retrieve the sweet. One operation (well two, if you include finding the silver listing in the document).

With method 2, you still need to pickup every single bag (and do some filtering) just to find one sweet. Clearly method 1 is far superior in this case.

That’s all very nice, but what has it got to do with indexes?

When Oracle stores a record to the database, it is placed in a block. Just like there are many M&Ms in a bag, (normally) there are many rows in a block. When accessing a particular row, Oracle fetches the whole block and retrieves the requested row from within it. This is analogous to us picking up the bag of M&Ms and then picking a single one out.

When doing an index-range scan, Oracle will search the index (your document) to find the first value matching your where clause. It then goes back-and-forth between the index and the table blocks, fetching the records from location pointed to by the index. This is similar to method 1, where you continually switch between the document and the M&M bags.

When doing a full table scan (FTS), Oracle will fetch several blocks at once in a multi-block read. The information is then filtered so that only rows matching your where clause are returned (the red M&Ms) – the rest are discarded. Just like in method 2.

When fetching a “high” percentage of the rows from a table, it becomes far more efficient to get several blocks at once and do some filtering than it is to visit a single block multiple times.

So what counts as “high”? When does an index scan become more efficient than a FTS and vice-versa?

In our M&M example, the “full-table scan” fetches all 100 bags in 10 operations. So the cross-over point will be roughly when you’re searching for a total of 10 M&Ms. When looking for just 10 chocolates You visit 10 bags to fetch all of these via an index and 10 via a FTS. Mars places around 55 M&Ms in each bag, so as a percentage of the “table” that’s just under ( 10/5500*100 = ) 0.2%!

If Mars releases some “giant” M&Ms with only 10 sweets in a bag, the denominator in the equation above decreases. So the crossover is at approximately when accessing ( 10/1000*100 = ) 1%. A higher percentage, but still small in real terms. Equally if they released “mini” M&Ms where you have 200 in a bag, the denominator would increase, meaning that the crossover is at an even smaller percentage of the table!

So as you increase the space required to store a row, an index will become relatively more effective than a full-table scan.

There’s a big assumption I’ve made in the above reasoning however. It’s that there’s no correlation between the order M&Ms are listed in the document and which bag they are in. So, for example, the first red M&M (in the document) may be in bag 1, the second in bag 56, the third in bag 20, etc.

Let’s make a different assumption – that the order of red chocolates in the document corresponds to the order they appear in the bags. So the first 9 red sweets are in bag 1, the next 9 in bag 2 etc. While you still have to visit all 100 bags, you can keep the last bag accessed in your hand, only switching bags every 9 or so sweets. This reduces your overheads, making the index approach more efficient.

We can take this further still. What if Mars changes the bagging process so that only one colour appears in each bag?

Now, instead of having to visit every single bag to get all the red sweets, you only have to visit 100/6 ~ 16 bags. If the sweets are also placed in the bags in the same order they are listed in the document (so M&Ms 1-55 are all blue and in bag 1, bag 2 has the blue M&Ms 56-100, and so on until bag 100, which has yellow M&Ms 496-550) you get the benefits of not switching bags compounded with the effect of having fewer bags to fetch.

This principle – how closely the order of records in a table matches the order they’re listed in a corresponding index – is referred to as the clustering factor in Oracle. This figure is lower when the rows appear in the same physical order in the table that they do in the index (all sweets in a bag are the same colour) and higher when there’s no correlation.

The closer the clustering factor is to the number of blocks in a table, the more likely it is that the index will be used (it is assigned a lower cost). The closer it is to the number of rows in a table, the more likely it is a FTS will be chosen (the index access is given a higher cost).

To sum up, we can see the cost-based optimizer’s decision on whether to use an index or a FTS is determined by:

  • What percentage of the rows in the table it expects a query to return (the selectivity)
  • How many blocks it expects to access using a FTS
  • How many blocks it expects to access via an index (the clustering factor, which in turn depends upon how many rows there are in a block and how well the order of those values in the index matches the order they physically are in the database)

This is just an analogy – if you want to see the formulas the optimizer uses have a read of Wolfgang Breitling’s “Fallacies of the Cost Based Optimizer” paper or read Jonathan Lewis’ Cost-Based Oracle Fundamentals book. I believe it works well for understand how the optimizer decides whether to select an index or FTS though. The blogs of Jonathan Lewis or Richard Foote also contain many posts going into this subject in more detail.

  One Response to “Finding All the Red M&Ms: A Story of Indexes and Full-Table Scans”

  1. […] Finding All the Red M&Ms: A Story of Indexes and Full-Table Scans […]

 Leave a Reply

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>