Apr 112014
 

Over the past few articles we’ve looked at database joins. It started out with me noticing that joins appear to be getting bad press recently and wondering whether they really are the root of all evil that some people seem to think they are.

We seen that denormalizing removes joins, giving performance benefits to primary key lookups. This benefit can come at the cost of non-primary key lookup queries though – often the extra work for these “search” queries is outweights the gains made for the primary key lookups.

Query performance (particularly for “small” systems) is just part of the story though. The real cost of removing joins is the impact on data modification. Aside from (potentially) vastly increasing the number of records must update to keep data consistent, denormalization can introduce waiting issues for concurrent updates. In the worst case we may introduce application bugs in the form of deadlocks.

We’ve seen there are some cases where joins do result in inefficient queries. These are usually the result of poor design decisions or limitations of the optimizer. Better design and use of Oracle features can overcome the worst of many of these however.

I hope you’ve found this series (and blog as a whole!) useful and informative. If there’s anything else you’d like to see on the subject of joins get in touch or say in the comments.

I’m going to take a break from regular blogging for a while to work on other projects. Thanks to those of you who’ve been following. If you’d like to receive posts when I start again, just enter your email address in the form below!

Apr 072014
 

So far in the joins series we’ve looked at the effect removing joins (via denormalization) has on performance. We’ve seen that joins can cause primary key looks to do more work. Lowering the normalization level to remove these can negatively impact “search” style queries though. More importantly, we’ve seen the real cost of denormalizing to remove joins is when updating records, potentially leading to concurrency waits and application bugs.

So are joins always “good”?

The fastest way to do anything is to not do it at all. If joins aren’t necessary to answer your queries, including them will add some overhead. Also, like any tool, there’s situations where adding a join may substantially slow your query down.

Here’s some examples where joins may be “expensive” and strategies for coping with them.

Missing Indexes

If there’s no indexes on the column(s) joined in a query then this is going to lead to a full table scan of at least one of the tables. If both of these tables are “large”, this is going to be expensive.

I’m not going to go into much more detail as it should be fairly obvious that a full-table scan of a million row plus table is going to take longer than an index lookup (assuming the join is expected to fetch just a few rows).

Fortunately it’s easily fixed – create those indexes!

Multi-column Joins

Often joins involve two or more columns. If there’s a correlation between the values in the join columns (e.g. birthday and star sign) the optimizer can get the estimates horribly wrong.

For a stark example, I’ll create two tables. These will store the values 1-100 one hundred times each. For each record in each table, both columns will have the same number:

create table tab1 (
col1 integer, col2 integer, filler varchar2(100)
);
create table tab2 (
col3 integer, col4 integer, filler varchar2(100)
);

insert into tab1
with rws as (select rownum r from dual connect by level <= 100)
select r1.*, r1.*, dbms_random.string('a', 20)
from rws r1, rws r2;

insert into tab2
with rws as (select rownum r from dual connect by level <= 100)
select r1.*, r1.*, dbms_random.string('a', 20)
from rws r1, rws r2;

create index i1 on tab1 (col1, col2);
create index i2 on tab2 (col3, col4);

select col1, col2 from tab1;

COL1 COL2
---- ----
1 1
1 1
1 1
...
100 100
100 100

So if we now join TAB1 to TAB2 on COL1 = COL3 and COL2 = COL4, we'll get 100 rows when restricting to a particular value (e.g. COL1 = 1).

What does the optimizer think though?

explain plan for
select *
from tab1
join tab2
on col1 = col3
and col2 = col4
where col1 = 1;

select *
from table(dbms_xplan.display(null, null, 'BASIC +ROWS'));

-----------------------------------------------------
| Id | Operation | Name | Rows |
-----------------------------------------------------
| 0 | SELECT STATEMENT | | 156 |
| 1 | HASH JOIN | | 156 |
| 2 | TABLE ACCESS BY INDEX ROWID| TAB1 | 100 |
| 3 | INDEX RANGE SCAN | I1 | 100 |
| 4 | TABLE ACCESS BY INDEX ROWID| TAB2 | 100 |
| 5 | INDEX RANGE SCAN | I2 | 100 |
-----------------------------------------------------

156 rows, reasonably close. It's a big enough margin of error that a sub-optimal plan could be chosen in some circumstances though.

What if we and another condidion, such that our query won't return any rows?

We know that the values in COL1 are always the same as those in COL2. If we add a predicate stating that COL2 = 2 to the query above no rows will be returned.

What does the optimizer think?

explain plan for
select *
from tab1
join tab2
on col1 = col3
and col2 = col4
where col1 = 1
and col2 = 2;

select *
from table(dbms_xplan.display(null, null, 'BASIC +ROWS'));

-----------------------------------------------------
| Id | Operation | Name | Rows |
-----------------------------------------------------
| 0 | SELECT STATEMENT | | 10000 |
| 1 | HASH JOIN | | 10000 |
| 2 | TABLE ACCESS BY INDEX ROWID| TAB1 | 100 |
| 3 | INDEX RANGE SCAN | I1 | 100 |
| 4 | TABLE ACCESS BY INDEX ROWID| TAB2 | 100 |
| 5 | INDEX RANGE SCAN | I2 | 100 |
-----------------------------------------------------

Yikes! An esimate of 10,000 rows for a query that won't return anything! This could lead to some unnecessary full table scans.

Fortunately, as of 11g, there's a way around this. We can create multi-column stats, allowing Oracle to identify the correlation between the values in the columns.

Let's create these and see what effect it has on the plan estimates:

exec dbms_stats.gather_table_stats(user, 'tab1', method_opt => 'for columns (col1, col2)');
exec dbms_stats.gather_table_stats(user, 'tab2', method_opt => 'for columns (col3, col4)');
exec dbms_stats.gather_table_stats(user, 'tab1');
exec dbms_stats.gather_table_stats(user, 'tab2');

explain plan for
select *
from tab1
join tab2
on col1 = col3
and col2 = col4
where col1 = 1;

select *
from table(dbms_xplan.display(null, null, 'BASIC +ROWS'));

-----------------------------------------------------
| Id | Operation | Name | Rows |
-----------------------------------------------------
| 0 | SELECT STATEMENT | | 100 |
| 1 | HASH JOIN | | 100 |
| 2 | TABLE ACCESS BY INDEX ROWID| TAB1 | 100 |
| 3 | INDEX RANGE SCAN | I1 | 100 |
| 4 | TABLE ACCESS BY INDEX ROWID| TAB2 | 100 |
| 5 | INDEX RANGE SCAN | I2 | 100 |
-----------------------------------------------------

explain plan for
select *
from tab1
join tab2
on col1 = col3
and col2 = col4
where col1 = 1
and col2 = 2;

select *
from table(dbms_xplan.display(null, null, 'BASIC +ROWS'));

------------------------------------------------------
| Id | Operation | Name | Rows |
------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 |
| 1 | MERGE JOIN CARTESIAN | | 1 |
| 2 | TABLE ACCESS BY INDEX ROWID | TAB1 | 1 |
| 3 | INDEX RANGE SCAN | I1 | 1 |
| 4 | BUFFER SORT | | 1 |
| 5 | TABLE ACCESS BY INDEX ROWID| TAB2 | 1 |
| 6 | INDEX RANGE SCAN | I2 | 1 |
------------------------------------------------------

A big improvement - the optimizer is bang on for the first query (100 rows) and only off-by-one for the second. This simple change could make a big difference to the execution of some joins.

For further reading on multi-column stats, have a read of Maria Colgan's article on extended stats.

Inexact Joins

Often tables are created with effective from/to dates to enable users to view data as of a point-in-time in the past (e.g. slowly changing dimensions, history tables).

It can be necesssary to join another table with a timestamp to a point-in-time table by stating where the timestamp (in t1) is between the effective dates (in t2). Unfortunately in this case the optimizer can get the cardinality estimates horribly wrong.

Let's look at an example. TAB1 has a record for each day for 500 days. TAB2 is a related table with effective from/to dates covering the same 500 days. Records in TAB2 span a duration of 10 days, giving 50 rows:

drop table tab1;
drop table tab2;
create table tab1 (
t1_id integer not null,
time_stamp date not null
);

create table tab2 (
t2_id integer not null,
from_date date not null,
to_date date not null
);

insert into tab1
select rownum,
date'2013-01-01'+rownum-1
from dual
connect by level <= 500;

insert into tab2
select rownum, date'2013-01-01'+(rownum-1)*10, date'2013-01-01'+(rownum)*10
from dual
connect by level <= 50;

commit;

exec dbms_stats.gather_table_stats(user, 'tab1');
exec dbms_stats.gather_table_stats(user, 'tab2');

If we want to find the record in TAB2 effective on a particular day and all the associated records in TAB1, we can put together a query like this:

select *
from tab1 t1
join tab2 t2
on t2.from_date <= t1.time_stamp
and t2.to_date > t1.time_stamp
where t2.from_date <= date'2014-01-01'
and t2.to_date > date'2014-01-01';

This returns 10 rows as expected. What does the optimizer think?

explain plan for
select *
from tab1 t1
join tab2 t2
on t2.from_date <= t1.time_stamp
and t2.to_date > t1.time_stamp
where t2.from_date <= date'2014-01-01'
and t2.to_date > date'2014-01-01';

select *
from table(dbms_xplan.display(null, null, 'BASIC +ROWS'));

---------------------------------------------
| Id | Operation | Name | Rows |
---------------------------------------------
| 0 | SELECT STATEMENT | | 2893 |
| 1 | MERGE JOIN | | 2893 |
| 2 | SORT JOIN | | 11 |
| 3 | TABLE ACCESS FULL | TAB2 | 11 |
| 4 | FILTER | | |
| 5 | SORT JOIN | | 500 |
| 6 | TABLE ACCESS FULL| TAB1 | 500 |
---------------------------------------------

2,893 rows, ouch! Unfortunately, multi-column stats on (from_date, to_date) don't help us here:

exec dbms_stats.gather_table_stats(user, 'tab2', method_opt => 'for columns (from_date, to_date)');
exec dbms_stats.gather_table_stats(user, 'tab2');

explain plan for
select *
from tab1 t1
join tab2 t2
on t2.from_date <= t1.time_stamp
and t2.to_date > t1.time_stamp
where t2.from_date <= date'2014-01-01'
and t2.to_date > date'2014-01-01';

select *
from table(dbms_xplan.display(null, null, 'BASIC +ROWS'));

---------------------------------------------
| Id | Operation | Name | Rows |
---------------------------------------------
| 0 | SELECT STATEMENT | | 2893 |
| 1 | MERGE JOIN | | 2893 |
| 2 | SORT JOIN | | 11 |
| 3 | TABLE ACCESS FULL | TAB2 | 11 |
| 4 | FILTER | | |
| 5 | SORT JOIN | | 500 |
| 6 | TABLE ACCESS FULL| TAB1 | 500 |
---------------------------------------------

The best way to resolve this is to convert the join to use equalities rather than inequalities. As there's a 1:M link from TAB2:TAB1, we could add the associated TAB2 id to TAB1:

alter table tab1 add (t2_id integer);

update tab1
set t2_id = (select t2_id
from tab2
where time_stamp >= from_date
and time_stamp < to_date);

commit;

exec dbms_stats.gather_table_stats(user, 'tab1');

explain plan for
select *
from tab1 t1
join tab2 t2
on t1.t2_id = t2.t2_id
where t2.from_date <= date'2014-01-01'
and t2.to_date > date'2014-01-01';

select *
from table(dbms_xplan.display(null, null, 'BASIC +ROWS'));

-------------------------------------------
| Id | Operation | Name | Rows |
-------------------------------------------
| 0 | SELECT STATEMENT | | 105 |
| 1 | HASH JOIN | | 105 |
| 2 | TABLE ACCESS FULL| TAB2 | 11 |
| 3 | TABLE ACCESS FULL| TAB1 | 500 |
-------------------------------------------

The cardinality esimate is still out by a factor of 10 (105 estimated vs. 10 actual), but it's signficantly better than our original estimate of 2,893. This could be enough of a difference to enable the optimizer to pick a better execution plan.

Conclusion

We've looked at some examples where joins could be "bad". These have all been caused by poor design or deficiencies in the optimizer leading to sub-optimal execution plans to be chosen.

By ensuring we've created appropriate indexes, given the optimizer the best information available and making joins use equalities (=) rather than inequalities (<, >=, etc.) we can overcome many of these issues causing poor join performance.

These are just a few examples of limitations of joins - if you have any others then please share them in the comments!

Apr 042014
 

In the previous article in the joins series we compared query performance between a third normal form schema and the same schema denormalized to second normal form. We then extended it the example so our denormalized schema was in just first normal form.

The normalized approach performed better overall. The differences were small though – generally just a few consistent gets and all the queries executed in under a second. As Jeff Atwood points out, hardware is powerful enough that for most systems the performance between more normalized and less normalized schemas won’t make much difference to your queries. I disagree slightly with one of his conclusions though – that normalziation “doesn’t matter”.

To see why, let’s look at a different use-case: data modification. While this isn’t to do with joins directly, it is a very important consideration when deciding to denormalize to “remove joins”.

I’m not going to get into the details of the performance updates – there’s a much bigger problem waiting for us. Can you see what it is?

Let’s imagine there’s two users of our hotel application, one managing the hotel details (name, star rating, etc.) and the other managing prices. Using the schemas we built in the previous article, these operations are both single row updates in the 3NF version:

update hotels
set star_rating = 5
where hotel_name = 'rOqKNXFchjvblsjYomFl';

1 rows updated.

update room_types
set price_per_night = 60
where room_type = 'DOUBLE';

1 rows updated.

Just one row updated in each case. Let’s have a look at the equivalent updates in our 1NF schema:

update hotel_rooms_1nf
set hotel_star_rating = 5
where hotel_name = 'rOqKNXFchjvblsjYomFl';

100 rows updated.

update hotel_rooms_1nf
set price_per_night = 60
where room_type = 'DOUBLE';

5,000 rows updated.

First thing to notice is that we’ve updated a lot more rows (5,100 vs just 2). I’m not going to get into the differences in runtime or consistent gets – there’s a bigger problem waiting for us.

What happens if these updates are done concurrently? Let’s update the hotel details in session one and the prices in session two:

--session 1
update hotel_rooms_1nf
set hotel_star_rating = 5
where hotel_name = 'rOqKNXFchjvblsjYomFl';

-- session 2 - this will be stuck waiting
update hotel_rooms_1nf
set price_per_night = 60
where room_type = 'DOUBLE';

We’ve now got concurrency waits! Our second session is blocked waiting for the first to commit. Denormalizing has introduced something which can genuinely slow our application down. If there’s a large number of concurrent updates then this could bring your application grinding to a halt.

It gets worse.

What if, back in session one, we now update the star rating for another hotel?

-- back in session 1
update hotel_rooms_1nf
set hotel_star_rating = 4
where hotel_name = 'rgfKtpWGpAQmjTsTFXIq';

-- in session 2 again:
SQL Error: ORA-00060: deadlock detected while waiting for resource

Deadlock! We’ve now got a bug in our application :(

So what are the ways around this issue? Here’s a few I can think of:

  • Live with it
  • Accept inconsistent data
  • Introduce session locks to prevent concurrent data modification
  • Normalize your schema

Let’s inspect these in more detail:

Live With It

Perhaps the application is insert-only (e.g. it’s a data warehouse) or the rate of (concurrent) updates are very low. In these cases we may be able to get away with the schema as-is and put this down as a “known bug”.

Personally I don’t think this is an acceptable argument for an OLTP application. If users make frequent (concurrent) updates, I doubt they will either.

Accept Inconsistent Data

This ia variation of “live with it”. Rather than do nothing to address the issue the idea here is to “code around” it – e.g. by making all updates only modify a row at a time and commit immediately.

The big issue with this is inconsistent data are visible to clients while an update loop is in progress. Users are likely to get confused if a hotel is showing as 4 star for some rooms but 5 star for others.

Perhaps we can find a way around this e.g. by including a last_updated timestamp and using that to resolve conflicts. This makes the application code complex and won’t necessarily fix the deadlock issue!

This is possibly the worst approach to take.

Introduce Session Locking

Rather than rely on Oracle’s statement level concurrency management, we could roll our own by using lock table, dbms_lock or application level locks. This would take out an application level lock before executing any update against the HOTEL_ROOMS_1NF table.

While this may enable us to avoid the deadlock issue, it means that we won’t have any concurrent updates happening anywhere on the table. At least with “live with it” we’d only be blocking/deadlocking users who’s updates modified the same rows.

This solution will avoid the deadlock problem however. If the rate of concurrent updates is low to non-existent, the increasing level of blocking may be acceptable.

This still complicates our code though. If using an approach other than LOCK TABLE, updates bypassing the application (e.g. release scritps, batch jobs) may still cause us issues as well.

Better than the previous two options. Still not acceptable in my opinion.

Normalize Your Schema

The deadlock situation described above is impossible in the 3NF version of this schema. The updates happen on completely separate tables so we’ve got no concurrency issues either.

You may point out that deadlock is still possible in our 3NF schema (e.g. by having session 1 update hotels then prices while session 2 updates prices then hotels). Fixing this should be a matter of making central APIs that always process updates in the same order – far easier than the solutions proposed in the other cases.

I believe this to be by far the best solution :)

If we’re building the application from scratch then the difference in effort between creating the 1NF and 3NF schemas is trivial.

If this is an existing application with hundreds of thousands of records populated it is a different matter. We’ll have to create a migration script, possibly incurring downtime while the migration is done. Best to avoid this situation and just normalize to start with!

So to recap:

  • Denormalized schemas generally perform (slightly) better when doing primary key lookups
  • Normalized schemas generally perform better for other types of queries – sometimes spectacularly so
  • Normalized schemas generally require less storage
  • Normalized schemas affect far fewer records when updating records
  • Concurrency and deadlock issues that could mean the death of a denormalized schema simply aren’t possible in normalized schemas.

So joins are always good?

Not quite so fast, in the next article we’ll look at some cases where joins are “expensive” and see what we can do about them.

Mar 312014
 

Continuing the series on joins, I’m going to look at denormalization. This process reduces the number of joins necessary to return results for a schema.

One of the big arguments against normalizing data is “for performance”. The process of normalization creates new tables as relations are decomposed according to their functional dependencies. This means (more) joins are necessary to return the same results.

A google of “database normalization performance” turns up several articles like this, this and this all advocating denormalizing your data to improve performance. There’s not a concrete discussion or test cases showing why you should denormalize, just hand-wavy arguments about joins being bad.

I wanted to test this to see if normalizing really makes performance worse. If you’ve been preaching “you must denormalize for performance”, my conclusions may surprise you.

For my first test case, I’m going to borrow a question I did for the database design quiz on the PL/SQL Challenge.

The setup is a pricing schema for hotel rooms. In the quiz room prices are determined by their type (single, twin, double or suite). The 2NF version of the schema is:

create table hotel_rooms_2nf (
hotel_room_id integer not null primary key,
hotel_name varchar2(100) not null,
room_number integer not null,
room_type varchar2(10) not null,
price_per_night number(5,2) not null,
constraint plch_horo2_hotel_room_num_u
unique (hotel_name, room_number)
);

Because we have functional dependency ROOM_TYPE -> PRICE_PER_NIGHT this isn’t in 3NF (ROOM_TYPE is not a candidate key for this table). To meet 3NF we need to decompose this into two tables as follows:

create table room_types (
room_type varchar2(10) not null primary key,
price_per_night number(5, 2) not null
);

create table hotel_rooms_3nf (
hotel_room_id integer not null primary key,
hotel_name varchar2(100) not null,
room_number integer not null,
room_type varchar2(10) not null
references room_types (room_type),
constraint plch_horo3_hotel_room_num_u
unique (hotel_name, room_number)
);

We now have two tables. This means that in order to return all the prices for all the rooms, we need to join them together. If the “denormalization is good for performance” hypothesis is true, then we should see queries against the 2NF schema perform better.

Let’s try this out. First up, let’s insert some test data into the tables. I’m going to insert 100 hotels, each with 100 rooms giving 10,000 rows in total for the HOTEL_ROOMS* tables:

insert into room_types (room_type, price_per_night)
values ('SINGLE', 40);
insert into room_types (room_type, price_per_night)
values ('TWIN', 50);
insert into room_types (room_type, price_per_night)
values ('DOUBLE', 60);
insert into room_types (room_type, price_per_night)
values ('SUITE', 100);
commit;

exec dbms_random.seed(1);

insert into hotel_rooms_2nf
with rooms as (select rownum room_num,
case when rownum < 20 then 'SINGLE'
when rownum < 40 then 'TWIN'
when rownum < 90 then 'DOUBLE'
else 'SUITE'
end room_type
from dual
connect by level <= 100),
hotels as (select dbms_random.string('a', 20) hotel_name
from dual
connect by level <= 100)
SELECT rownum, hotel_name, room_num , room_type,
case room_type
when 'SINGLE' then 40
when 'TWIN' then 50
when 'DOUBLE' then 60
when 'SUITE' then 100
end
FROM rooms, hotels;

exec dbms_random.seed(1);

insert into hotel_rooms_3nf
with rooms as (select rownum room_num,
case when rownum < 20 then 'SINGLE'
when rownum < 40 then 'TWIN'
when rownum < 90 then 'DOUBLE'
else 'SUITE'
end room_type
from dual
connect by level <= 100),
hotels as (select dbms_random.string('a', 20) hotel_name
from dual
connect by level <= 100)
SELECT rownum, hotel_name, room_num , room_type
FROM rooms, hotels;

commit;

exec dbms_stats.gather_table_stats(user, 'hotel_rooms_3nf');
exec dbms_stats.gather_table_stats(user, 'hotel_rooms_2nf');
exec dbms_stats.gather_table_stats(user, 'room_types');

Let's stop here for a moment and look at what we have:

  • The 2NF version has 10,000 rows in total
  • The 3NF version has 10,000 in the HOTEL_ROOMS table + 4 ROOM_TYPES giving 10,004 rows in total
  • The PRICE_PER_NIGHT column is repeated 10,000 times in our 2NF schema, but only 4 times in our 3NF schema

What difference does this make?

If we look at the storage requirements, we can see this repeating of the PRICE_PER_NIGHT column leads to a greater space consumption for the 2NF schema:

select table_name, blocks, avg_row_len from user_tables
where table_name like '%ROOM%'
order by 1;

TABLE_NAME BLOCKS AVG_ROW_LEN
------------------------------ ---------- -----------
HOTEL_ROOMS_2NF 65 37
HOTEL_ROOMS_3NF 58 34
ROOM_TYPES 5 9

A slightly higher average row length (by three bytes) means the 2NF schema uses requires more storage blocks. Currently there's only 2 blocks in it for the 2NF vs 3NF schema (58 + 5 = 63). The average row length for our 3NF schema is smaller, so as more rooms are added the space savings will grow (providing no more room types are created - we'll need to create many more before the ROOM_TYPES table requires more database blocks however).

Now let's move onto some queries. First up, what's the difference in work performed (consistent gets) when looking up a single row by primary key?

Here we go:

CHRIS@virtual > select *
2 from hotel_rooms_2nf
3 where hotel_room_id = 1;

Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
3 consistent gets
0 physical reads
0 redo size
608 bytes sent via SQL*Net to client
407 bytes received via SQL*Net from client
1 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed

CHRIS@virtual >
CHRIS@virtual > select r.*, t.price_per_night
2 from hotel_rooms_3nf r
3 join room_types t
4 on r.room_type = t.room_type
5 where hotel_room_id = 1;

Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
5 consistent gets
0 physical reads
0 redo size
693 bytes sent via SQL*Net to client
425 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed

That's three consistent gets 2NF schema and five for the 3NF schema. So far we have a slight advantage to the denormalized approach. Our 3NF schema has to do an extra unique index scan and table access, so two extra consistents gets is the minimum you could get away with.

Primary key lookup is just one use case however. Just as important is search - finding finding all the rooms at a given price. In this case we'll look for just the £40 rooms (the singles):

CHRIS@>select *
2 from hotel_rooms_2nf
3 where price_per_night = 40;

1900 rows selected.

Statistics
----------------------------------------------------
0 recursive calls
0 db block gets
193 consistent gets
0 physical reads
0 redo size
74017 bytes sent via SQL*Net to client
1762 bytes received via SQL*Net from client
128 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1900 rows processed

CHRIS@>
CHRIS@>select r.*, t.price_per_night
2 from hotel_rooms_3nf r
3 join room_types t
4 on r.room_type = t.room_type
5 where t.price_per_night = 40;

1900 rows selected.

Statistics
----------------------------------------------------
0 recursive calls
0 db block gets
192 consistent gets
0 physical reads
0 redo size
81672 bytes sent via SQL*Net to client
1762 bytes received via SQL*Net from client
128 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1900 rows processed

Now the difference has gone the other way - 193 (2NF) vs 192 (3NF). A small advantage - the overall cost is higher than our primary key lookup though. What if we (more likely) want to find all the rooms less than or equal to a given price? Let's also increase our maximum price to £50:

CHRIS@>select *
2 from hotel_rooms_2nf
3 where price_per_night <= 50;

3900 rows selected.

Statistics
-----------------------------------------------------
0 recursive calls
0 db block gets
327 consistent gets
0 physical reads
0 redo size
151461 bytes sent via SQL*Net to client
3225 bytes received via SQL*Net from client
261 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
3900 rows processed

CHRIS@>
CHRIS@>select r.*, t.price_per_night
2 from hotel_rooms_3nf r
3 join room_types t
4 on r.room_type = t.room_type
5 where t.price_per_night <= 50;

3900 rows selected.

Statistics
-----------------------------------------------------
0 recursive calls
0 db block gets
325 consistent gets
0 physical reads
0 redo size
163171 bytes sent via SQL*Net to client
3225 bytes received via SQL*Net from client
261 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
3900 rows processed

As we're fetching more data the difference is starting to grow. The difference (2 consistent gets) is the same as for primary key lookup, so "normalized" vs. "denormalized" is basically a tie at this point.

The observant among you may point out that I've not created any indexes yet. We're querying on the PRICE_PER_NIGHT column(s), so let's index these columns and re-run the less-than-fifty-quid queries:

create index horo2_price_i on hotel_rooms_2nf ( price_per_night );
create index roty_price_i on room_types ( price_per_night );

CHRIS@>select *
2 from hotel_rooms_2nf
3 where price_per_night <= 50;

3900 rows selected.

Statistics
----------------------------------------------------
0 recursive calls
0 db block gets
327 consistent gets
0 physical reads
0 redo size
151461 bytes sent via SQL*Net to client
3225 bytes received via SQL*Net from client
261 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
3900 rows processed

CHRIS@>
CHRIS@>select r.*, t.price_per_night
2 from hotel_rooms_3nf r
3 join room_types t
4 on r.room_type = t.room_type
5 where t.price_per_night <= 50;

3900 rows selected.

Statistics
----------------------------------------------------
0 recursive calls
0 db block gets
321 consistent gets
0 physical reads
0 redo size
163171 bytes sent via SQL*Net to client
3225 bytes received via SQL*Net from client
261 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
3900 rows processed

The indexes haven't made a difference to the 2NF schema, but have saved us another 4 consistent gets in our 3NF schema. The advantage for normalization grows!

Why hasn't the index helped our 2NF schema? I won't discuss it here, hopefully this post will help you understand why.

So far we've just looked at a third normal form violation. We can extend this example to include a breach of second normal form and see what effect this has on performance.

To do this we'll include details about the hotels - the city and star rating. Placing these directly on our 2NF table gives us a table in first normal form (these are functionally dependent on part of the (HOTEL_NAME, ROOM_NUMBER) key), giving us a HOTEL_ROOMS_1NF table:

create table hotel_rooms_1nf (
hotel_room_id integer not null primary key,
hotel_name varchar2(100) not null,
hotel_city varchar2(100) not null,
hotel_star_rating integer not null,
room_number integer not null,
room_type varchar2(10) not null,
price_per_night number(5,2) not null,
constraint horo1_hotel_room_num_u
unique (hotel_name, room_number)
);

For the 3NF schema we'll create a new HOTELS table to store these attributes. I'll also replace the HOTEL_NAME in HOTEL_ROOMS_3NF with a surrogate key for the new table:

drop table hotel_rooms_3nf;

create table hotels (
hotel_id integer not null primary key,
hotel_name varchar2(100) not null,
hotel_city varchar2(100) not null,
star_rating integer not null
);

create table hotel_rooms_3nf (
hotel_room_id integer not null primary key,
hotel_id integer not null
references hotels (hotel_id),
room_number integer not null,
room_type varchar2(10) not null
references room_types (room_type),
constraint horo3_hotel_room_num_u
unique (hotel_id, room_number)
);

Let's insert the same 10,000 hotel rooms we had for the 2NF schema with the new fields populated:

exec dbms_random.seed(1);

insert into hotel_rooms_1nf
with rooms as (sELECT rownum room_num,
case when rownum < 20 then 'SINGLE'
when rownum < 40 then 'TWIN'
when rownum < 90 then 'DOUBLE'
else 'SUITE'
end room_type
FROM dual
connect by level <= 100),
hotels as (select dbms_random.string('a', 20) hotel_name,
case
when rownum <= 40 then 'LONDON'
when rownum <= 60 then 'BIRMINGHAM'
when rownum <= 70 then 'MANCHESTER'
when rownum <= 80 then 'EDINBURGH'
when rownum <= 90 then 'NEWCASTLE'
else 'BRISTOL'
end city,
mod(rownum, 3)+3 star
from dual
connect by level <= 100)
SELECT rownum, hotel_name, city, star, room_num , room_type,
case room_type
when 'SINGLE' then 40
when 'TWIN' then 50
when 'DOUBLE' then 60
when 'SUITE' then 100
end
FROM rooms, hotels;

exec dbms_random.seed(1);

insert into hotels
select rownum,
dbms_random.string('a', 20) hotel_name,
case
when rownum <= 40 then 'LONDON'
when rownum <= 60 then 'BIRMINGHAM'
when rownum <= 70 then 'MANCHESTER'
when rownum <= 80 then 'EDINBURGH'
when rownum <= 90 then 'NEWCASTLE'
else 'BRISTOL'
end city,
mod(rownum, 3)+3 star
from dual
connect by level <= 100;

insert into hotel_rooms_3nf
with rooms as (sELECT rownum room_num,
case when rownum < 20 then 'SINGLE'
when rownum < 40 then 'TWIN'
when rownum < 90 then 'DOUBLE'
else 'SUITE'
end room_type
FROM dual
connect by level <= 100)
SELECT rownum, hotel_id, room_num , room_type
FROM rooms, hotels;

commit;

exec dbms_stats.gather_table_stats(user, 'hotel_rooms_3nf');
exec dbms_stats.gather_table_stats(user, 'hotel_rooms_1nf');
exec dbms_stats.gather_table_stats(user, 'hotels');

What's the difference in size between these tables now? Let's take a look:

select table_name, blocks, avg_row_len
from user_tables
where table_name like '%ROOM%' or
table_name like '%HOTEL%'
order by 1;

TABLE_NAME BLOCKS AVG_ROW_LEN
------------------------------ ---------- -----------
HOTELS 5 36
HOTEL_ROOMS_1NF 80 49
HOTEL_ROOMS_2NF 65 37
HOTEL_ROOMS_3NF 35 16
ROOM_TYPES 5 9

Wow, moving the HOTEL_NAME from the 3NF rooms table to its own HOTELS table has nearly halved the size of it! Depsite adding two more attributes to our schema, we've actually managed to reduce the total blocks it requires by nearly half.

Let's now repeat the primary key lookup of a room we did originally, including the extra information we're now storing:

CHRIS@virtual > select *
2 from hotel_rooms_1nf
3 where hotel_room_id = 1;

Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
3 consistent gets
0 physical reads
0 redo size
745 bytes sent via SQL*Net to client
407 bytes received via SQL*Net from client
1 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed

CHRIS@virtual >
CHRIS@virtual > select r.*, t.price_per_night, h.hotel_city, h.star_rating
2 from hotel_rooms_3nf r
3 join room_types t
4 on r.room_type = t.room_type
5 join hotels h
6 on r.hotel_id = h.hotel_id
7 where hotel_room_id = 1;

Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
7 consistent gets
0 physical reads
0 redo size
804 bytes sent via SQL*Net to client
425 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed

The primary key lookup on the 1NF table still does the same amount of work that 2NF version did - three consistent gets. We've now added another two consistent gets to the 3NF schema totalling seven. It appears joins are making our queries more expensive!

Let's look at our search example again. This time we'll include hotel details in our query. We only want to stay in hotels in London with star rating of four or more costing no more than £50/night (I doubt we'd find anything matching this in the real world ;):

CHRIS@virtual > select *
2 from hotel_rooms_1nf
3 where price_per_night <= 50
4 and hotel_city = 'LONDON'
5 and hotel_star_rating >= 4;

1053 rows selected.

Statistics
-----------------------------------------------------
0 recursive calls
0 db block gets
153 consistent gets
0 physical reads
0 redo size
59064 bytes sent via SQL*Net to client
1685 bytes received via SQL*Net from client
72 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1053 rows processed

CHRIS@virtual >
CHRIS@virtual > select r.*, t.price_per_night
2 from hotel_rooms_3nf r
3 join room_types t
4 on r.room_type = t.room_type
5 join hotels h
6 on h.hotel_id = r.hotel_id
7 where t.price_per_night <= 50
8 and h.hotel_city = 'LONDON'
9 and h.star_rating >= 4;

1053 rows selected.

Statistics
-----------------------------------------------------
0 recursive calls
0 db block gets
116 consistent gets
0 physical reads
0 redo size
29451 bytes sent via SQL*Net to client
1685 bytes received via SQL*Net from client
72 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1053 rows processed

Now the advantage for the third normal form schema is really starting to stack up: 116 vs 153 consistent gets - around 25% less work. Again, we don't have any indexes on either table. Let's add indexes on (HOTEL_CITY, STAR_RATING) and see what effect it has:

create index horo1_city_rating_i on hotel_rooms_1nf (hotel_city, hotel_star_rating);
create index hote_city_rating_i on hotels (hotel_city, star_rating);

CHRIS@virtual > select hotel_room_id, hotel_name, hotel_city, hotel_star_rating,
2 room_number, room_type, price_per_night
3 from hotel_rooms_1nf
4 where price_per_night <= 50
5 and hotel_city = 'LONDON'
6 and hotel_star_rating >= 4;

1053 rows selected.

Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
153 consistent gets
0 physical reads
0 redo size
59064 bytes sent via SQL*Net to client
1685 bytes received via SQL*Net from client
72 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1053 rows processed

CHRIS@virtual >
CHRIS@virtual > select r.hotel_room_id, h.hotel_name, h.hotel_city, h.star_rating,
2 r.room_number, r.room_type, t.price_per_night
3 from hotel_rooms_3nf r
4 join room_types t
5 on r.room_type = t.room_type
6 join hotels h
7 on h.hotel_id = r.hotel_id
8 where t.price_per_night <= 50
9 and h.hotel_city = 'LONDON'
10 and h.star_rating >= 4;

1053 rows selected.

Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
112 consistent gets
0 physical reads
0 redo size
59058 bytes sent via SQL*Net to client
1685 bytes received via SQL*Net from client
72 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1053 rows processed

Again we've made no impact on our denormalized (1NF) table. We've made a saving of four consistent gets in our 3NF schema though.

What else can we look at? What if we just want to get the names of all the four star hotels in London without any costs?

First thing to notice is that we'll have to do a distinct on the results from the 1NF schema. In our 3NF schema, all those joins that we were so concerned about are gone entirely - we just look at the HOTELS table.

How does this affect performance? Let's see:

CHRIS@virtual > select distinct hotel_name, hotel_city, hotel_star_rating
2 from hotel_rooms_1nf
3 where hotel_city = 'LONDON'
4 and hotel_star_rating >= 4;

27 rows selected.

Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
83 consistent gets
0 physical reads
0 redo size
1531 bytes sent via SQL*Net to client
443 bytes received via SQL*Net from client
3 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
27 rows processed

CHRIS@virtual >
CHRIS@virtual > select h.hotel_name, h.hotel_city, h.star_rating
2 from hotels h
3 where h.hotel_city = 'LONDON'
4 and h.star_rating >= 4;

27 rows selected.

Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
6 consistent gets
0 physical reads
0 redo size
1525 bytes sent via SQL*Net to client
443 bytes received via SQL*Net from client
3 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
27 rows processed

That's a slam dunk win for our normalized schema - a mere 6 consistent gets vs 83 for our denormalized table.

Let's recap what we've learned so far:

  • The denormlized schema consumed more storage. As our normalization level lowered, the storage requirements increased. Conversely, the storage for our third normal form schema actually decreased
  • Lookups by primary key were slightly more efficient for the denormalized schema
  • "Search" queries were more efficient for the normalized schema. When searches could be limited to just some of the tables in the schema this advantage was spectacular

So is denormalization for performance a bad idea?

The correct answer is of course "it depends". If the vast majority of your application's queries are primary key lookups (or can be engineered to be so), then the performance gains may be worth it. If you have "complex" search queries this is less likely to be the case and the denormalized schema may perform substantially worse. It'll also depend upon the nature of your data - we've just worked through one example which will differ from whatever your working on in many ways. You need to test for yourself.

As Martin Preiss pointed out in the comments of my previous post, the "joins are bad" brigade may have come from a mySQL background. This only supports nested loop joins. The later "search" queries we did used hash joins - when forcing these to use nested loops the 3NF schema peformed spectacularly worse (I'll leave proving this as an exercise for the reader ;). So the environment you use will also may also determine your normalization strategy (or convince you to use a better database! ;)

This is the real point I want to make: denormalizing may help performance in your situation. You need to check this for your schema and environment however, rather than relying on generic arguments such as "joins are bad" therefore we must denormalize!

The observant among you may have noticed that we've missed an important use case - data modification. How does denormalization affect update performance? I'll discuss this in my next article.

Footnote on my environment for those that want to replicate my results:

All tests were run on Oracle Enterprise Edition, 11.2.0.2. Tables were created with an 8k blocksize. The tests were run in SQL*Plus with arraysize set to 15 and "set autotrace trace" to get the performance figures - the second execution of each query was reported to ensure everything was appropriately parsed, cached, etc.

Mar 282014
 

Joins appear (to me) to be getting a lot of bad press recently. In discussions I’ve had and articles I’ve read many give the position that joins are somehow inherently bad and to be avoided at all costs.

I’ve never been entirely convinced by the “joins are bad” arguments however. Partly because there’s few concrete cases actually demonstrating the (additional) cost of joins. Instead discussions tend to be hand-wavy arguments around extra CPU cycles used or something to that effect.

So over the next few posts I’m going to do something a bit different on this blog and discuss joins in more detail. We’ll ask questions like:

  • Does removing joins (denormalizing) really help performance? If so, what are the catches?
  • When are joins bad? What can be done in these cases?

If you’ve got any other questions around joins let me know – I’ll add them to the list and address them in this series.

Mar 242014
 

The following update forms the basis of a DIY connection pool. It is intended to allocate an unused session to whoever requests it:

declare

  sess connection_pool.session_name%type;
  
begin
  
  update connection_pool
  set    in_use = 'Y',
         last_used = systimestamp
  where  rowid in (
      select rowid 
      from   (
          select rowid
          from   connection_pool
          where  in_use = 'N'
          order  by last_used
      )
      where  rownum = 1
  )
  returning session_name 
  into      sess;
  
  dbms_output.put_line ( sess );
  
end;
/

There’s a serious flaw with this approach however – can you spot it?

Answers in the comments!

Mar 212014
 

Back on the films today, here’s another six to get:

select talk
from   monarch;

select * 
from   (select city_name
        from   cities
        where  state = 'IL'
        order  by population desc)
where  rownum = 1;

select * 
from   nations
where  not exists (
         select null
         from   people
         where  status = 'retired'
         and    sex = 'male'
         and    p.location = n.location
);

sqlplus -s young_ovis_aries

select key
from   injury;

select *
from   human_property
where  duration between date '2000-01-01'
                and date '2012-12-31';

Finally, there’s something that links all the above films. Can you get it?

Put your answers in the comments!

Mar 172014
 
create table from_12c_onwards (
  you_can_have_your_pk number 
    generated always as identity,
  so_you_dont_have_to_type_it varchar2(10)
);

insert into from_12c_onwards (
  so_you_dont_have_to_type_it
) values (
  'yes'
);

insert into from_12c_onwards (
  so_you_dont_have_to_type_it
) values (
  'Yes'
);

insert into from_12c_onwards (
  so_you_dont_have_to_type_it
) values (
  'YES!'
);

select * from from_12c_onwards;

YOU_CAN_HAVE_YOUR_PK SO_YOU_DON
-------------------- ----------
                   1 yes
                   2 Yes
                   3 YES!

If you’d like more details, there’s a nice article on oracle-base discussing how identity columns work. If you prefer, you can assign a sequence to be a column default instead, as discussed here.

Mar 142014
 

Somehow I ended up with two rows in my table both with the same primary key value:

select the_pk 
from   dodgy_data;

    THE_PK
----------
         1
         1

The primary key does exist and THE_PK is the only column in it, so we can’t insert another row with THE_PK = 1:

select constraint_name, constraint_type
from   user_constraints
where  table_name = 'DODGY_DATA';

CONSTRAINT_NAME      CONSTRAINT_TYPE
-------------------- ---------------
TAB_PK               P              

select column_name 
from   user_cons_columns
where  constraint_name = 'THE_PK';

COLUMN_NAME
-----------
THE_PK     

insert into dodgy_data ( the_pk )
values ( 1 );

SQL Error: ORA-00001: 
  unique constraint (CHRIS.TAB_PK) violated

How? Answers in the comments!

Mar 102014
 
alter table from_11g add 
  column_with_a varchar2(100) default 
    'is instant provided you set it' not null;

Prior to 11g the above statement could take a very long time if the table was “big”. I was reminded of this by a recent discussion on Oracle-L. Tom Kyte also discusses this in more detail on asktom.

For a quick and dirty demo, keep reading Continue reading »