Apr 072014
 

So far in the joins series we’ve looked at the effect removing joins (via denormalization) has on performance. We’ve seen that joins can cause primary key looks to do more work. Lowering the normalization level to remove these can negatively impact “search” style queries though. More importantly, we’ve seen the real cost of denormalizing to remove joins is when updating records, potentially leading to concurrency waits and application bugs.

So are joins always “good”?

The fastest way to do anything is to not do it at all. If joins aren’t necessary to answer your queries, including them will add some overhead. Also, like any tool, there’s situations where adding a join may substantially slow your query down.

Here’s some examples where joins may be “expensive” and strategies for coping with them.

Missing Indexes

If there’s no indexes on the column(s) joined in a query then this is going to lead to a full table scan of at least one of the tables. If both of these tables are “large”, this is going to be expensive.

I’m not going to go into much more detail as it should be fairly obvious that a full-table scan of a million row plus table is going to take longer than an index lookup (assuming the join is expected to fetch just a few rows).

Fortunately it’s easily fixed – create those indexes!

Multi-column Joins

Often joins involve two or more columns. If there’s a correlation between the values in the join columns (e.g. birthday and star sign) the optimizer can get the estimates horribly wrong.

For a stark example, I’ll create two tables. These will store the values 1-100 one hundred times each. For each record in each table, both columns will have the same number:

create table tab1 (
col1 integer, col2 integer, filler varchar2(100)
);
create table tab2 (
col3 integer, col4 integer, filler varchar2(100)
);

insert into tab1
with rws as (select rownum r from dual connect by level <= 100)
select r1.*, r1.*, dbms_random.string('a', 20)
from rws r1, rws r2;

insert into tab2
with rws as (select rownum r from dual connect by level <= 100)
select r1.*, r1.*, dbms_random.string('a', 20)
from rws r1, rws r2;

create index i1 on tab1 (col1, col2);
create index i2 on tab2 (col3, col4);

select col1, col2 from tab1;

COL1 COL2
---- ----
1 1
1 1
1 1
...
100 100
100 100

So if we now join TAB1 to TAB2 on COL1 = COL3 and COL2 = COL4, we'll get 100 rows when restricting to a particular value (e.g. COL1 = 1).

What does the optimizer think though?

explain plan for
select *
from tab1
join tab2
on col1 = col3
and col2 = col4
where col1 = 1;

select *
from table(dbms_xplan.display(null, null, 'BASIC +ROWS'));

-----------------------------------------------------
| Id | Operation | Name | Rows |
-----------------------------------------------------
| 0 | SELECT STATEMENT | | 156 |
| 1 | HASH JOIN | | 156 |
| 2 | TABLE ACCESS BY INDEX ROWID| TAB1 | 100 |
| 3 | INDEX RANGE SCAN | I1 | 100 |
| 4 | TABLE ACCESS BY INDEX ROWID| TAB2 | 100 |
| 5 | INDEX RANGE SCAN | I2 | 100 |
-----------------------------------------------------

156 rows, reasonably close. It's a big enough margin of error that a sub-optimal plan could be chosen in some circumstances though.

What if we and another condidion, such that our query won't return any rows?

We know that the values in COL1 are always the same as those in COL2. If we add a predicate stating that COL2 = 2 to the query above no rows will be returned.

What does the optimizer think?

explain plan for
select *
from tab1
join tab2
on col1 = col3
and col2 = col4
where col1 = 1
and col2 = 2;

select *
from table(dbms_xplan.display(null, null, 'BASIC +ROWS'));

-----------------------------------------------------
| Id | Operation | Name | Rows |
-----------------------------------------------------
| 0 | SELECT STATEMENT | | 10000 |
| 1 | HASH JOIN | | 10000 |
| 2 | TABLE ACCESS BY INDEX ROWID| TAB1 | 100 |
| 3 | INDEX RANGE SCAN | I1 | 100 |
| 4 | TABLE ACCESS BY INDEX ROWID| TAB2 | 100 |
| 5 | INDEX RANGE SCAN | I2 | 100 |
-----------------------------------------------------

Yikes! An esimate of 10,000 rows for a query that won't return anything! This could lead to some unnecessary full table scans.

Fortunately, as of 11g, there's a way around this. We can create multi-column stats, allowing Oracle to identify the correlation between the values in the columns.

Let's create these and see what effect it has on the plan estimates:

exec dbms_stats.gather_table_stats(user, 'tab1', method_opt => 'for columns (col1, col2)');
exec dbms_stats.gather_table_stats(user, 'tab2', method_opt => 'for columns (col3, col4)');
exec dbms_stats.gather_table_stats(user, 'tab1');
exec dbms_stats.gather_table_stats(user, 'tab2');

explain plan for
select *
from tab1
join tab2
on col1 = col3
and col2 = col4
where col1 = 1;

select *
from table(dbms_xplan.display(null, null, 'BASIC +ROWS'));

-----------------------------------------------------
| Id | Operation | Name | Rows |
-----------------------------------------------------
| 0 | SELECT STATEMENT | | 100 |
| 1 | HASH JOIN | | 100 |
| 2 | TABLE ACCESS BY INDEX ROWID| TAB1 | 100 |
| 3 | INDEX RANGE SCAN | I1 | 100 |
| 4 | TABLE ACCESS BY INDEX ROWID| TAB2 | 100 |
| 5 | INDEX RANGE SCAN | I2 | 100 |
-----------------------------------------------------

explain plan for
select *
from tab1
join tab2
on col1 = col3
and col2 = col4
where col1 = 1
and col2 = 2;

select *
from table(dbms_xplan.display(null, null, 'BASIC +ROWS'));

------------------------------------------------------
| Id | Operation | Name | Rows |
------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 |
| 1 | MERGE JOIN CARTESIAN | | 1 |
| 2 | TABLE ACCESS BY INDEX ROWID | TAB1 | 1 |
| 3 | INDEX RANGE SCAN | I1 | 1 |
| 4 | BUFFER SORT | | 1 |
| 5 | TABLE ACCESS BY INDEX ROWID| TAB2 | 1 |
| 6 | INDEX RANGE SCAN | I2 | 1 |
------------------------------------------------------

A big improvement - the optimizer is bang on for the first query (100 rows) and only off-by-one for the second. This simple change could make a big difference to the execution of some joins.

For further reading on multi-column stats, have a read of Maria Colgan's article on extended stats.

Inexact Joins

Often tables are created with effective from/to dates to enable users to view data as of a point-in-time in the past (e.g. slowly changing dimensions, history tables).

It can be necesssary to join another table with a timestamp to a point-in-time table by stating where the timestamp (in t1) is between the effective dates (in t2). Unfortunately in this case the optimizer can get the cardinality estimates horribly wrong.

Let's look at an example. TAB1 has a record for each day for 500 days. TAB2 is a related table with effective from/to dates covering the same 500 days. Records in TAB2 span a duration of 10 days, giving 50 rows:

drop table tab1;
drop table tab2;
create table tab1 (
t1_id integer not null,
time_stamp date not null
);

create table tab2 (
t2_id integer not null,
from_date date not null,
to_date date not null
);

insert into tab1
select rownum,
date'2013-01-01'+rownum-1
from dual
connect by level <= 500;

insert into tab2
select rownum, date'2013-01-01'+(rownum-1)*10, date'2013-01-01'+(rownum)*10
from dual
connect by level <= 50;

commit;

exec dbms_stats.gather_table_stats(user, 'tab1');
exec dbms_stats.gather_table_stats(user, 'tab2');

If we want to find the record in TAB2 effective on a particular day and all the associated records in TAB1, we can put together a query like this:

select *
from tab1 t1
join tab2 t2
on t2.from_date <= t1.time_stamp
and t2.to_date > t1.time_stamp
where t2.from_date <= date'2014-01-01'
and t2.to_date > date'2014-01-01';

This returns 10 rows as expected. What does the optimizer think?

explain plan for
select *
from tab1 t1
join tab2 t2
on t2.from_date <= t1.time_stamp
and t2.to_date > t1.time_stamp
where t2.from_date <= date'2014-01-01'
and t2.to_date > date'2014-01-01';

select *
from table(dbms_xplan.display(null, null, 'BASIC +ROWS'));

---------------------------------------------
| Id | Operation | Name | Rows |
---------------------------------------------
| 0 | SELECT STATEMENT | | 2893 |
| 1 | MERGE JOIN | | 2893 |
| 2 | SORT JOIN | | 11 |
| 3 | TABLE ACCESS FULL | TAB2 | 11 |
| 4 | FILTER | | |
| 5 | SORT JOIN | | 500 |
| 6 | TABLE ACCESS FULL| TAB1 | 500 |
---------------------------------------------

2,893 rows, ouch! Unfortunately, multi-column stats on (from_date, to_date) don't help us here:

exec dbms_stats.gather_table_stats(user, 'tab2', method_opt => 'for columns (from_date, to_date)');
exec dbms_stats.gather_table_stats(user, 'tab2');

explain plan for
select *
from tab1 t1
join tab2 t2
on t2.from_date <= t1.time_stamp
and t2.to_date > t1.time_stamp
where t2.from_date <= date'2014-01-01'
and t2.to_date > date'2014-01-01';

select *
from table(dbms_xplan.display(null, null, 'BASIC +ROWS'));

---------------------------------------------
| Id | Operation | Name | Rows |
---------------------------------------------
| 0 | SELECT STATEMENT | | 2893 |
| 1 | MERGE JOIN | | 2893 |
| 2 | SORT JOIN | | 11 |
| 3 | TABLE ACCESS FULL | TAB2 | 11 |
| 4 | FILTER | | |
| 5 | SORT JOIN | | 500 |
| 6 | TABLE ACCESS FULL| TAB1 | 500 |
---------------------------------------------

The best way to resolve this is to convert the join to use equalities rather than inequalities. As there's a 1:M link from TAB2:TAB1, we could add the associated TAB2 id to TAB1:

alter table tab1 add (t2_id integer);

update tab1
set t2_id = (select t2_id
from tab2
where time_stamp >= from_date
and time_stamp < to_date);

commit;

exec dbms_stats.gather_table_stats(user, 'tab1');

explain plan for
select *
from tab1 t1
join tab2 t2
on t1.t2_id = t2.t2_id
where t2.from_date <= date'2014-01-01'
and t2.to_date > date'2014-01-01';

select *
from table(dbms_xplan.display(null, null, 'BASIC +ROWS'));

-------------------------------------------
| Id | Operation | Name | Rows |
-------------------------------------------
| 0 | SELECT STATEMENT | | 105 |
| 1 | HASH JOIN | | 105 |
| 2 | TABLE ACCESS FULL| TAB2 | 11 |
| 3 | TABLE ACCESS FULL| TAB1 | 500 |
-------------------------------------------

The cardinality esimate is still out by a factor of 10 (105 estimated vs. 10 actual), but it's signficantly better than our original estimate of 2,893. This could be enough of a difference to enable the optimizer to pick a better execution plan.

Conclusion

We've looked at some examples where joins could be "bad". These have all been caused by poor design or deficiencies in the optimizer leading to sub-optimal execution plans to be chosen.

By ensuring we've created appropriate indexes, given the optimizer the best information available and making joins use equalities (=) rather than inequalities (<, >=, etc.) we can overcome many of these issues causing poor join performance.

These are just a few examples of limitations of joins - if you have any others then please share them in the comments!

 Leave a Reply

(required)

(required)

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>