Mar 312014
 

Continuing the series on joins, I’m going to look at denormalization. This process reduces the number of joins necessary to return results for a schema.

One of the big arguments against normalizing data is “for performance”. The process of normalization creates new tables as relations are decomposed according to their functional dependencies. This means (more) joins are necessary to return the same results.

A google of “database normalization performance” turns up several articles like this, this and this all advocating denormalizing your data to improve performance. There’s not a concrete discussion or test cases showing why you should denormalize, just hand-wavy arguments about joins being bad.

I wanted to test this to see if normalizing really makes performance worse. If you’ve been preaching “you must denormalize for performance”, my conclusions may surprise you.

For my first test case, I’m going to borrow a question I did for the database design quiz on the PL/SQL Challenge.

The setup is a pricing schema for hotel rooms. In the quiz room prices are determined by their type (single, twin, double or suite). The 2NF version of the schema is:

create table hotel_rooms_2nf (
hotel_room_id integer not null primary key,
hotel_name varchar2(100) not null,
room_number integer not null,
room_type varchar2(10) not null,
price_per_night number(5,2) not null,
constraint plch_horo2_hotel_room_num_u
unique (hotel_name, room_number)
);

Because we have functional dependency ROOM_TYPE -> PRICE_PER_NIGHT this isn’t in 3NF (ROOM_TYPE is not a candidate key for this table). To meet 3NF we need to decompose this into two tables as follows:

create table room_types (
room_type varchar2(10) not null primary key,
price_per_night number(5, 2) not null
);

create table hotel_rooms_3nf (
hotel_room_id integer not null primary key,
hotel_name varchar2(100) not null,
room_number integer not null,
room_type varchar2(10) not null
references room_types (room_type),
constraint plch_horo3_hotel_room_num_u
unique (hotel_name, room_number)
);

We now have two tables. This means that in order to return all the prices for all the rooms, we need to join them together. If the “denormalization is good for performance” hypothesis is true, then we should see queries against the 2NF schema perform better.

Let’s try this out. First up, let’s insert some test data into the tables. I’m going to insert 100 hotels, each with 100 rooms giving 10,000 rows in total for the HOTEL_ROOMS* tables:

insert into room_types (room_type, price_per_night)
values ('SINGLE', 40);
insert into room_types (room_type, price_per_night)
values ('TWIN', 50);
insert into room_types (room_type, price_per_night)
values ('DOUBLE', 60);
insert into room_types (room_type, price_per_night)
values ('SUITE', 100);
commit;

exec dbms_random.seed(1);

insert into hotel_rooms_2nf
with rooms as (select rownum room_num,
case when rownum < 20 then 'SINGLE'
when rownum < 40 then 'TWIN'
when rownum < 90 then 'DOUBLE'
else 'SUITE'
end room_type
from dual
connect by level <= 100),
hotels as (select dbms_random.string('a', 20) hotel_name
from dual
connect by level <= 100)
SELECT rownum, hotel_name, room_num , room_type,
case room_type
when 'SINGLE' then 40
when 'TWIN' then 50
when 'DOUBLE' then 60
when 'SUITE' then 100
end
FROM rooms, hotels;

exec dbms_random.seed(1);

insert into hotel_rooms_3nf
with rooms as (select rownum room_num,
case when rownum < 20 then 'SINGLE'
when rownum < 40 then 'TWIN'
when rownum < 90 then 'DOUBLE'
else 'SUITE'
end room_type
from dual
connect by level <= 100),
hotels as (select dbms_random.string('a', 20) hotel_name
from dual
connect by level <= 100)
SELECT rownum, hotel_name, room_num , room_type
FROM rooms, hotels;

commit;

exec dbms_stats.gather_table_stats(user, 'hotel_rooms_3nf');
exec dbms_stats.gather_table_stats(user, 'hotel_rooms_2nf');
exec dbms_stats.gather_table_stats(user, 'room_types');

Let's stop here for a moment and look at what we have:

  • The 2NF version has 10,000 rows in total
  • The 3NF version has 10,000 in the HOTEL_ROOMS table + 4 ROOM_TYPES giving 10,004 rows in total
  • The PRICE_PER_NIGHT column is repeated 10,000 times in our 2NF schema, but only 4 times in our 3NF schema

What difference does this make?

If we look at the storage requirements, we can see this repeating of the PRICE_PER_NIGHT column leads to a greater space consumption for the 2NF schema:

select table_name, blocks, avg_row_len from user_tables
where table_name like '%ROOM%'
order by 1;

TABLE_NAME BLOCKS AVG_ROW_LEN
------------------------------ ---------- -----------
HOTEL_ROOMS_2NF 65 37
HOTEL_ROOMS_3NF 58 34
ROOM_TYPES 5 9

A slightly higher average row length (by three bytes) means the 2NF schema uses requires more storage blocks. Currently there's only 2 blocks in it for the 2NF vs 3NF schema (58 + 5 = 63). The average row length for our 3NF schema is smaller, so as more rooms are added the space savings will grow (providing no more room types are created - we'll need to create many more before the ROOM_TYPES table requires more database blocks however).

Now let's move onto some queries. First up, what's the difference in work performed (consistent gets) when looking up a single row by primary key?

Here we go:

CHRIS@virtual > select *
2 from hotel_rooms_2nf
3 where hotel_room_id = 1;

Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
3 consistent gets
0 physical reads
0 redo size
608 bytes sent via SQL*Net to client
407 bytes received via SQL*Net from client
1 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed

CHRIS@virtual >
CHRIS@virtual > select r.*, t.price_per_night
2 from hotel_rooms_3nf r
3 join room_types t
4 on r.room_type = t.room_type
5 where hotel_room_id = 1;

Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
5 consistent gets
0 physical reads
0 redo size
693 bytes sent via SQL*Net to client
425 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed

That's three consistent gets 2NF schema and five for the 3NF schema. So far we have a slight advantage to the denormalized approach. Our 3NF schema has to do an extra unique index scan and table access, so two extra consistents gets is the minimum you could get away with.

Primary key lookup is just one use case however. Just as important is search - finding finding all the rooms at a given price. In this case we'll look for just the £40 rooms (the singles):

CHRIS@>select *
2 from hotel_rooms_2nf
3 where price_per_night = 40;

1900 rows selected.

Statistics
----------------------------------------------------
0 recursive calls
0 db block gets
193 consistent gets
0 physical reads
0 redo size
74017 bytes sent via SQL*Net to client
1762 bytes received via SQL*Net from client
128 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1900 rows processed

CHRIS@>
CHRIS@>select r.*, t.price_per_night
2 from hotel_rooms_3nf r
3 join room_types t
4 on r.room_type = t.room_type
5 where t.price_per_night = 40;

1900 rows selected.

Statistics
----------------------------------------------------
0 recursive calls
0 db block gets
192 consistent gets
0 physical reads
0 redo size
81672 bytes sent via SQL*Net to client
1762 bytes received via SQL*Net from client
128 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1900 rows processed

Now the difference has gone the other way - 193 (2NF) vs 192 (3NF). A small advantage - the overall cost is higher than our primary key lookup though. What if we (more likely) want to find all the rooms less than or equal to a given price? Let's also increase our maximum price to £50:

CHRIS@>select *
2 from hotel_rooms_2nf
3 where price_per_night <= 50;

3900 rows selected.

Statistics
-----------------------------------------------------
0 recursive calls
0 db block gets
327 consistent gets
0 physical reads
0 redo size
151461 bytes sent via SQL*Net to client
3225 bytes received via SQL*Net from client
261 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
3900 rows processed

CHRIS@>
CHRIS@>select r.*, t.price_per_night
2 from hotel_rooms_3nf r
3 join room_types t
4 on r.room_type = t.room_type
5 where t.price_per_night <= 50;

3900 rows selected.

Statistics
-----------------------------------------------------
0 recursive calls
0 db block gets
325 consistent gets
0 physical reads
0 redo size
163171 bytes sent via SQL*Net to client
3225 bytes received via SQL*Net from client
261 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
3900 rows processed

As we're fetching more data the difference is starting to grow. The difference (2 consistent gets) is the same as for primary key lookup, so "normalized" vs. "denormalized" is basically a tie at this point.

The observant among you may point out that I've not created any indexes yet. We're querying on the PRICE_PER_NIGHT column(s), so let's index these columns and re-run the less-than-fifty-quid queries:

create index horo2_price_i on hotel_rooms_2nf ( price_per_night );
create index roty_price_i on room_types ( price_per_night );

CHRIS@>select *
2 from hotel_rooms_2nf
3 where price_per_night <= 50;

3900 rows selected.

Statistics
----------------------------------------------------
0 recursive calls
0 db block gets
327 consistent gets
0 physical reads
0 redo size
151461 bytes sent via SQL*Net to client
3225 bytes received via SQL*Net from client
261 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
3900 rows processed

CHRIS@>
CHRIS@>select r.*, t.price_per_night
2 from hotel_rooms_3nf r
3 join room_types t
4 on r.room_type = t.room_type
5 where t.price_per_night <= 50;

3900 rows selected.

Statistics
----------------------------------------------------
0 recursive calls
0 db block gets
321 consistent gets
0 physical reads
0 redo size
163171 bytes sent via SQL*Net to client
3225 bytes received via SQL*Net from client
261 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
3900 rows processed

The indexes haven't made a difference to the 2NF schema, but have saved us another 4 consistent gets in our 3NF schema. The advantage for normalization grows!

Why hasn't the index helped our 2NF schema? I won't discuss it here, hopefully this post will help you understand why.

So far we've just looked at a third normal form violation. We can extend this example to include a breach of second normal form and see what effect this has on performance.

To do this we'll include details about the hotels - the city and star rating. Placing these directly on our 2NF table gives us a table in first normal form (these are functionally dependent on part of the (HOTEL_NAME, ROOM_NUMBER) key), giving us a HOTEL_ROOMS_1NF table:

create table hotel_rooms_1nf (
hotel_room_id integer not null primary key,
hotel_name varchar2(100) not null,
hotel_city varchar2(100) not null,
hotel_star_rating integer not null,
room_number integer not null,
room_type varchar2(10) not null,
price_per_night number(5,2) not null,
constraint horo1_hotel_room_num_u
unique (hotel_name, room_number)
);

For the 3NF schema we'll create a new HOTELS table to store these attributes. I'll also replace the HOTEL_NAME in HOTEL_ROOMS_3NF with a surrogate key for the new table:

drop table hotel_rooms_3nf;

create table hotels (
hotel_id integer not null primary key,
hotel_name varchar2(100) not null,
hotel_city varchar2(100) not null,
star_rating integer not null
);

create table hotel_rooms_3nf (
hotel_room_id integer not null primary key,
hotel_id integer not null
references hotels (hotel_id),
room_number integer not null,
room_type varchar2(10) not null
references room_types (room_type),
constraint horo3_hotel_room_num_u
unique (hotel_id, room_number)
);

Let's insert the same 10,000 hotel rooms we had for the 2NF schema with the new fields populated:

exec dbms_random.seed(1);

insert into hotel_rooms_1nf
with rooms as (sELECT rownum room_num,
case when rownum < 20 then 'SINGLE'
when rownum < 40 then 'TWIN'
when rownum < 90 then 'DOUBLE'
else 'SUITE'
end room_type
FROM dual
connect by level <= 100),
hotels as (select dbms_random.string('a', 20) hotel_name,
case
when rownum <= 40 then 'LONDON'
when rownum <= 60 then 'BIRMINGHAM'
when rownum <= 70 then 'MANCHESTER'
when rownum <= 80 then 'EDINBURGH'
when rownum <= 90 then 'NEWCASTLE'
else 'BRISTOL'
end city,
mod(rownum, 3)+3 star
from dual
connect by level <= 100)
SELECT rownum, hotel_name, city, star, room_num , room_type,
case room_type
when 'SINGLE' then 40
when 'TWIN' then 50
when 'DOUBLE' then 60
when 'SUITE' then 100
end
FROM rooms, hotels;

exec dbms_random.seed(1);

insert into hotels
select rownum,
dbms_random.string('a', 20) hotel_name,
case
when rownum <= 40 then 'LONDON'
when rownum <= 60 then 'BIRMINGHAM'
when rownum <= 70 then 'MANCHESTER'
when rownum <= 80 then 'EDINBURGH'
when rownum <= 90 then 'NEWCASTLE'
else 'BRISTOL'
end city,
mod(rownum, 3)+3 star
from dual
connect by level <= 100;

insert into hotel_rooms_3nf
with rooms as (sELECT rownum room_num,
case when rownum < 20 then 'SINGLE'
when rownum < 40 then 'TWIN'
when rownum < 90 then 'DOUBLE'
else 'SUITE'
end room_type
FROM dual
connect by level <= 100)
SELECT rownum, hotel_id, room_num , room_type
FROM rooms, hotels;

commit;

exec dbms_stats.gather_table_stats(user, 'hotel_rooms_3nf');
exec dbms_stats.gather_table_stats(user, 'hotel_rooms_1nf');
exec dbms_stats.gather_table_stats(user, 'hotels');

What's the difference in size between these tables now? Let's take a look:

select table_name, blocks, avg_row_len
from user_tables
where table_name like '%ROOM%' or
table_name like '%HOTEL%'
order by 1;

TABLE_NAME BLOCKS AVG_ROW_LEN
------------------------------ ---------- -----------
HOTELS 5 36
HOTEL_ROOMS_1NF 80 49
HOTEL_ROOMS_2NF 65 37
HOTEL_ROOMS_3NF 35 16
ROOM_TYPES 5 9

Wow, moving the HOTEL_NAME from the 3NF rooms table to its own HOTELS table has nearly halved the size of it! Depsite adding two more attributes to our schema, we've actually managed to reduce the total blocks it requires by nearly half.

Let's now repeat the primary key lookup of a room we did originally, including the extra information we're now storing:

CHRIS@virtual > select *
2 from hotel_rooms_1nf
3 where hotel_room_id = 1;

Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
3 consistent gets
0 physical reads
0 redo size
745 bytes sent via SQL*Net to client
407 bytes received via SQL*Net from client
1 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed

CHRIS@virtual >
CHRIS@virtual > select r.*, t.price_per_night, h.hotel_city, h.star_rating
2 from hotel_rooms_3nf r
3 join room_types t
4 on r.room_type = t.room_type
5 join hotels h
6 on r.hotel_id = h.hotel_id
7 where hotel_room_id = 1;

Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
7 consistent gets
0 physical reads
0 redo size
804 bytes sent via SQL*Net to client
425 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed

The primary key lookup on the 1NF table still does the same amount of work that 2NF version did - three consistent gets. We've now added another two consistent gets to the 3NF schema totalling seven. It appears joins are making our queries more expensive!

Let's look at our search example again. This time we'll include hotel details in our query. We only want to stay in hotels in London with star rating of four or more costing no more than £50/night (I doubt we'd find anything matching this in the real world ;):

CHRIS@virtual > select *
2 from hotel_rooms_1nf
3 where price_per_night <= 50
4 and hotel_city = 'LONDON'
5 and hotel_star_rating >= 4;

1053 rows selected.

Statistics
-----------------------------------------------------
0 recursive calls
0 db block gets
153 consistent gets
0 physical reads
0 redo size
59064 bytes sent via SQL*Net to client
1685 bytes received via SQL*Net from client
72 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1053 rows processed

CHRIS@virtual >
CHRIS@virtual > select r.*, t.price_per_night
2 from hotel_rooms_3nf r
3 join room_types t
4 on r.room_type = t.room_type
5 join hotels h
6 on h.hotel_id = r.hotel_id
7 where t.price_per_night <= 50
8 and h.hotel_city = 'LONDON'
9 and h.star_rating >= 4;

1053 rows selected.

Statistics
-----------------------------------------------------
0 recursive calls
0 db block gets
116 consistent gets
0 physical reads
0 redo size
29451 bytes sent via SQL*Net to client
1685 bytes received via SQL*Net from client
72 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1053 rows processed

Now the advantage for the third normal form schema is really starting to stack up: 116 vs 153 consistent gets - around 25% less work. Again, we don't have any indexes on either table. Let's add indexes on (HOTEL_CITY, STAR_RATING) and see what effect it has:

create index horo1_city_rating_i on hotel_rooms_1nf (hotel_city, hotel_star_rating);
create index hote_city_rating_i on hotels (hotel_city, star_rating);

CHRIS@virtual > select hotel_room_id, hotel_name, hotel_city, hotel_star_rating,
2 room_number, room_type, price_per_night
3 from hotel_rooms_1nf
4 where price_per_night <= 50
5 and hotel_city = 'LONDON'
6 and hotel_star_rating >= 4;

1053 rows selected.

Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
153 consistent gets
0 physical reads
0 redo size
59064 bytes sent via SQL*Net to client
1685 bytes received via SQL*Net from client
72 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1053 rows processed

CHRIS@virtual >
CHRIS@virtual > select r.hotel_room_id, h.hotel_name, h.hotel_city, h.star_rating,
2 r.room_number, r.room_type, t.price_per_night
3 from hotel_rooms_3nf r
4 join room_types t
5 on r.room_type = t.room_type
6 join hotels h
7 on h.hotel_id = r.hotel_id
8 where t.price_per_night <= 50
9 and h.hotel_city = 'LONDON'
10 and h.star_rating >= 4;

1053 rows selected.

Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
112 consistent gets
0 physical reads
0 redo size
59058 bytes sent via SQL*Net to client
1685 bytes received via SQL*Net from client
72 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1053 rows processed

Again we've made no impact on our denormalized (1NF) table. We've made a saving of four consistent gets in our 3NF schema though.

What else can we look at? What if we just want to get the names of all the four star hotels in London without any costs?

First thing to notice is that we'll have to do a distinct on the results from the 1NF schema. In our 3NF schema, all those joins that we were so concerned about are gone entirely - we just look at the HOTELS table.

How does this affect performance? Let's see:

CHRIS@virtual > select distinct hotel_name, hotel_city, hotel_star_rating
2 from hotel_rooms_1nf
3 where hotel_city = 'LONDON'
4 and hotel_star_rating >= 4;

27 rows selected.

Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
83 consistent gets
0 physical reads
0 redo size
1531 bytes sent via SQL*Net to client
443 bytes received via SQL*Net from client
3 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
27 rows processed

CHRIS@virtual >
CHRIS@virtual > select h.hotel_name, h.hotel_city, h.star_rating
2 from hotels h
3 where h.hotel_city = 'LONDON'
4 and h.star_rating >= 4;

27 rows selected.

Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
6 consistent gets
0 physical reads
0 redo size
1525 bytes sent via SQL*Net to client
443 bytes received via SQL*Net from client
3 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
27 rows processed

That's a slam dunk win for our normalized schema - a mere 6 consistent gets vs 83 for our denormalized table.

Let's recap what we've learned so far:

  • The denormlized schema consumed more storage. As our normalization level lowered, the storage requirements increased. Conversely, the storage for our third normal form schema actually decreased
  • Lookups by primary key were slightly more efficient for the denormalized schema
  • "Search" queries were more efficient for the normalized schema. When searches could be limited to just some of the tables in the schema this advantage was spectacular

So is denormalization for performance a bad idea?

The correct answer is of course "it depends". If the vast majority of your application's queries are primary key lookups (or can be engineered to be so), then the performance gains may be worth it. If you have "complex" search queries this is less likely to be the case and the denormalized schema may perform substantially worse. It'll also depend upon the nature of your data - we've just worked through one example which will differ from whatever your working on in many ways. You need to test for yourself.

As Martin Preiss pointed out in the comments of my previous post, the "joins are bad" brigade may have come from a mySQL background. This only supports nested loop joins. The later "search" queries we did used hash joins - when forcing these to use nested loops the 3NF schema peformed spectacularly worse (I'll leave proving this as an exercise for the reader ;). So the environment you use will also may also determine your normalization strategy (or convince you to use a better database! ;)

This is the real point I want to make: denormalizing may help performance in your situation. You need to check this for your schema and environment however, rather than relying on generic arguments such as "joins are bad" therefore we must denormalize!

The observant among you may have noticed that we've missed an important use case - data modification. How does denormalization affect update performance? I'll discuss this in my next article.

Footnote on my environment for those that want to replicate my results:

All tests were run on Oracle Enterprise Edition, 11.2.0.2. Tables were created with an 8k blocksize. The tests were run in SQL*Plus with arraysize set to 15 and "set autotrace trace" to get the performance figures - the second execution of each query was reported to ensure everything was appropriately parsed, cached, etc.

Get SQLfail sent to your inbox

  One Response to “Denormalizing for Performance Is a Bad Idea”

 Leave a Reply

(required)

(required)

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>