Apr 042014

In the previous article in the joins series we compared query performance between a third normal form schema and the same schema denormalized to second normal form. We then extended it the example so our denormalized schema was in just first normal form.

The normalized approach performed better overall. The differences were small though – generally just a few consistent gets and all the queries executed in under a second. As Jeff Atwood points out, hardware is powerful enough that for most systems the performance between more normalized and less normalized schemas won’t make much difference to your queries. I disagree slightly with one of his conclusions though – that normalziation “doesn’t matter”.

To see why, let’s look at a different use-case: data modification. While this isn’t to do with joins directly, it is a very important consideration when deciding to denormalize to “remove joins”.

I’m not going to get into the details of the performance updates – there’s a much bigger problem waiting for us. Can you see what it is?

Let’s imagine there’s two users of our hotel application, one managing the hotel details (name, star rating, etc.) and the other managing prices. Using the schemas we built in the previous article, these operations are both single row updates in the 3NF version:

update hotels
set star_rating = 5
where hotel_name = 'rOqKNXFchjvblsjYomFl';

1 rows updated.

update room_types
set price_per_night = 60
where room_type = 'DOUBLE';

1 rows updated.

Just one row updated in each case. Let’s have a look at the equivalent updates in our 1NF schema:

update hotel_rooms_1nf
set hotel_star_rating = 5
where hotel_name = 'rOqKNXFchjvblsjYomFl';

100 rows updated.

update hotel_rooms_1nf
set price_per_night = 60
where room_type = 'DOUBLE';

5,000 rows updated.

First thing to notice is that we’ve updated a lot more rows (5,100 vs just 2). I’m not going to get into the differences in runtime or consistent gets – there’s a bigger problem waiting for us.

What happens if these updates are done concurrently? Let’s update the hotel details in session one and the prices in session two:

--session 1
update hotel_rooms_1nf
set hotel_star_rating = 5
where hotel_name = 'rOqKNXFchjvblsjYomFl';

-- session 2 - this will be stuck waiting
update hotel_rooms_1nf
set price_per_night = 60
where room_type = 'DOUBLE';

We’ve now got concurrency waits! Our second session is blocked waiting for the first to commit. Denormalizing has introduced something which can genuinely slow our application down. If there’s a large number of concurrent updates then this could bring your application grinding to a halt.

It gets worse.

What if, back in session one, we now update the star rating for another hotel?

-- back in session 1
update hotel_rooms_1nf
set hotel_star_rating = 4
where hotel_name = 'rgfKtpWGpAQmjTsTFXIq';

-- in session 2 again:
SQL Error: ORA-00060: deadlock detected while waiting for resource

Deadlock! We’ve now got a bug in our application :(

So what are the ways around this issue? Here’s a few I can think of:

  • Live with it
  • Accept inconsistent data
  • Introduce session locks to prevent concurrent data modification
  • Normalize your schema

Let’s inspect these in more detail:

Live With It

Perhaps the application is insert-only (e.g. it’s a data warehouse) or the rate of (concurrent) updates are very low. In these cases we may be able to get away with the schema as-is and put this down as a “known bug”.

Personally I don’t think this is an acceptable argument for an OLTP application. If users make frequent (concurrent) updates, I doubt they will either.

Accept Inconsistent Data

This ia variation of “live with it”. Rather than do nothing to address the issue the idea here is to “code around” it – e.g. by making all updates only modify a row at a time and commit immediately.

The big issue with this is inconsistent data are visible to clients while an update loop is in progress. Users are likely to get confused if a hotel is showing as 4 star for some rooms but 5 star for others.

Perhaps we can find a way around this e.g. by including a last_updated timestamp and using that to resolve conflicts. This makes the application code complex and won’t necessarily fix the deadlock issue!

This is possibly the worst approach to take.

Introduce Session Locking

Rather than rely on Oracle’s statement level concurrency management, we could roll our own by using lock table, dbms_lock or application level locks. This would take out an application level lock before executing any update against the HOTEL_ROOMS_1NF table.

While this may enable us to avoid the deadlock issue, it means that we won’t have any concurrent updates happening anywhere on the table. At least with “live with it” we’d only be blocking/deadlocking users who’s updates modified the same rows.

This solution will avoid the deadlock problem however. If the rate of concurrent updates is low to non-existent, the increasing level of blocking may be acceptable.

This still complicates our code though. If using an approach other than LOCK TABLE, updates bypassing the application (e.g. release scritps, batch jobs) may still cause us issues as well.

Better than the previous two options. Still not acceptable in my opinion.

Normalize Your Schema

The deadlock situation described above is impossible in the 3NF version of this schema. The updates happen on completely separate tables so we’ve got no concurrency issues either.

You may point out that deadlock is still possible in our 3NF schema (e.g. by having session 1 update hotels then prices while session 2 updates prices then hotels). Fixing this should be a matter of making central APIs that always process updates in the same order – far easier than the solutions proposed in the other cases.

I believe this to be by far the best solution :)

If we’re building the application from scratch then the difference in effort between creating the 1NF and 3NF schemas is trivial.

If this is an existing application with hundreds of thousands of records populated it is a different matter. We’ll have to create a migration script, possibly incurring downtime while the migration is done. Best to avoid this situation and just normalize to start with!

So to recap:

  • Denormalized schemas generally perform (slightly) better when doing primary key lookups
  • Normalized schemas generally perform better for other types of queries – sometimes spectacularly so
  • Normalized schemas generally require less storage
  • Normalized schemas affect far fewer records when updating records
  • Concurrency and deadlock issues that could mean the death of a denormalized schema simply aren’t possible in normalized schemas.

So joins are always good?

Not quite so fast, in the next article we’ll look at some cases where joins are “expensive” and see what we can do about them.

Want to know more about building databases, normalization and designing “for performance”? Enter your email address in the form below to receive the free SQLfail newsletter.

 Leave a Reply

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>