Convert to Third Normal Form (3NF)
Page Navigation: [ Forms of normalization | Additional normalization tips ]Steve Litt is the author of the Universal Troubleshooting Process Courseware, which can be presented either by Steve or by your own trainers.He is also the author of Troubleshooting Techniques of the Successful Technologist, Rapid Learning: Secret Weapon of the Successful Technologist, and Samba Unleashed. Second Normal Form: Each column must depend on the *entire* primary key. As an example, the customer information could be put in the line item table (see above). The trouble with that is that the customer goes with the invoice, not with each line on the invoice. Putting customer information in the line item table will cause redundant data, with it's inherant overhead and difficult modifications. Second form normalization requires you place the customer information in the invoice table (see below). Third Normal Form: Each column must depend on *directly* on the primary key. As an example, the customer address could go in the invoice table (see above), but this would cause data redundancy if several invoices were for the same customer. It would also cause an update nightmare when the customer changes his address, and would require extensive programming to insert the address every time an existing customer gets a new invoice. Third form normalization requires the customer address go in a separate customer table with its own key (customer), with only the customer identifier in the invoice table (see below). [ Troubleshooters.Com | Top of page | Home page | Email Steve Litt ] Additional normalization tips Make a table for each list Do this right away. It will save a fortune in time. Go through the department or enterprise, ferreting out all lists. Document them. Each should be a table if their information is needed, and if practical. Use non-meaningful primary keys If employee numbers starting with C mean the person's stationed in Chicago, and the person moves to Los Angeles, what do you do with his employee number. Making primary keys non-meaningful means changes in environment or business rules can't render them ineffective. [ Troubleshooters.Com | Top of page | Home page | Email Steve Litt ] Copyright (C)1995, 1996 by Steve Litt. See the copyright notice for any restrictions on access and reuse of information provided in the Litt's Tips web pages.
Back to Table of Contents.We ended the last chapter with a simple set of tables and models that is seemingly impervious to invalid data. Through careful unit testing of both the data layer and the models in the application layer, we guaranteed that all references between tables will be valid, and that each individual column can contain only appropriate data.It’s tempting at this point to leave the realm of data modeling and begin writing a front-end for the theatre tickets website. We can imagine additional requirements for even the first version of our site, though, such as saving orders, or knowing how many seats there are for sale in a given auditorium.As secure as the physical layer we put together seems to be, the design itself is constricting. Features such as those just mentioned will be difficult to add in an elegant way. In this chapter, we will refactor the data model so that it is more open to future changes. First, the concept of third normal form (3NF) will be introduced. Applying 3NF will afford us the flexibility to add additional information to pieces of data that are bound currently, such as auditoriums and movie ratings. We will then add additional tables we know we will need relating to ticket orders; doing so presents additional opportunities for refactoring.In database theory, there are numerous normalization patterns, most of which are numbered ordinally: non-first normal form (NF2), first normal form (1NF), second normal form (2NF), and so on. The study and formalization of normalization patterns has a set of overriding themes. The first is to guarantee the correctness of data across a series of insertions, updates, and deletes. The second is to facilitate the process of querying tables. There’s also a positive by-product: the higher the level of normalization, the less repeated data there is in the data model. In addition to saving space, avoiding repeated data prevents data corruption because updating a value only requires changing it in single place, rather than meticulously updating it in many places. All of these properties of normalization tend to make our lives as programmers much easier.In this chapter, we will discuss third normal form, which is a sufficient target normalization to make our data model more open to future changes. The principle behind 3NF is that no columns in a table depend on any non-key columns.Consider the movies table from the previous chapter, shown in Figure 6-1. The id column is the primary key. Technically, this table is 3NF, because no columns depend on any other column that is not the primary key. However, if we decided to add additional information about the rating, we would be in trouble.Figure 6-2 shows the movies table extended to hold a description of the rating. With this schema, as soon as we have more than one movie with the same rating, we begin to have duplicated data in the table. Not only are we wasting space, but we now run the risk of having two movies that are rated R, but have different rating descriptions. If the descriptions happened to be different, how would we know which was the correct one?Figure 6-1. The movies table; 3NF not yet violatedFigure 6-2. The movies table: 3NF is violated by the rating_description columnThe solution is to normalize the table on the rating column by creating a ratings table, and referencing that table with a rating_id column in the movies table, as shown in Figure 6-3. Now there is a single place where the movie description is stored, and we 3NF is satisfied.Figure 6-3. The movies table refactored to 3NF by referencing a ratings tableIn general, good candidates for normalization are columns that fit the following criteria:We apply the same logic to the auditoriums table, which allows us to add a column to store the number of seats available in each auditorium. The resulting 3NF data model is shown in Figure 6-4.Figure 6-4. movie_showtimes table refactored to reference an auditoriums table and avoid 3NF violationNote that we didn’t stop at adding an auditorium_id column to movie_showtimes. We also removed the theatre_id column. That’s because theatre_id is actually functionally dependent on, or bound to, the auditorium_id. We can find the theatre for a showtime by first traversing the auditorium_id relationship to the auditorium table, and then traversing the theatre_id relationship to the theatres table.If we had instead kept the theatre_id column to make the association simpler, as in Figure 6-5, it would be possible to have a movie_showtimes record that referred directly to some theatre A, but also referred indirectly, through its auditorium, to some other theatre B. This would be anomalous and disastrous, so at this stage, we don’t give any more thought to keeping the theatre_id column around.Figure 6-5. The theatre_id column in movie_showtimes is functionality dependent on auditorium_id, and must be removed.But it does seem strange to not be able to tell which theatre a movie plays in within first examining an auditorium record. In fact, it is strange. We’ll come back to this problem in this chapter’s exercises, and again in Chapter 8.In any case, now our schema is worlds more flexible. We can easily extend information about ratings or auditoriums and we only need to worry about updating a single record to do so.Here’s the DDL for these new tables:Because we aren’t done with all of our refactoring yet, we won’t worry about models or tests here. Thinking about the unit tests necessary for these tables and what validations might be necessary are left as exercises for the reader.We could have created a separate ratings table from the start, when we defined our schema in the previous chapter, but we didn’t know then that we might want to store extra information about ratings. Stemming violations of 3NF is something we’d like to do before the violation occurs. The trick is to notice bound data—the rating was literally bound to the movie—early in the design phase of the data layer. Once lots of code is written atop a data model, making changes can be painful. Worse, if your production application has collected lots of data and you can’t start from scratch, the effort and time required to re-arrange data within a live database can be prohibitive. Spotting columns that are likely to need fully fledged tables in the next iteration of feature sets is a skill that requires experience as well as intuition about the problem at hand.Often, novice data modelers – even those who know about third normal form – skip these sorts of normalizations because the immediate benefits are not apparent. The short-term gain of not having to define a new table, model class, and unit tests seems irresistible. Skipping normalization for a quick gain is highly discouraged, though. While it is absolutely true that it takes more effort up front in order to achieve a defensively normalized schema, the time required to normalize after the fact can easily take 10 times the effort of doing so right away. At that point, it’s often tempting to pick a bad solution, such as adding a seats_available column to movie_showtimes rather than to an auditoriums table, where it belongs.Once a schema goes down a sloppy path, it eventually—and sooner than you might think—becomes unusable. The application layer becomes riddled with bugs because the data layer is too permissive. For example, if the application layer must always update the seats_available column in movie_showtimes, code can creep in that sets it incorrectly: perhaps to zero, or some random value. Or, possibly worse—those doing data entry might be expected to enter the value every time they add a new showtime. Suddenly the business is experiencing massive problems because the system oversold every show, or perhaps refused to sell any tickets at all.Since we know our website’s purpose is to sell movie tickets, it makes sense to add tables to hold sales data before we begin coding up the frontend. Figure 6-5 shows our first stab at adding tables to store orders and ticket purchases. The orders table holds each transaction, including the purchaser’s name, address, and credit card information. The purchased_tickets table is a line item table for each ticket purchased in the transaction. Rather than simply providing a column in the orders table to record the number of tickets purchased, we split the tables up, foreseeing the need to account for tickets at different prices, such as student or senior discounted tickets, matinees, and so on.In adding these additional tables, we find a glaring violation of the Don’t Repeat Yourself (DRY) principle. The violation of DRY occurs in the repetition of address data between the theatres table and our new orders table. If we kept the schema as is, with the data duplicated, we would also end up duplicating a lot of other code, including validation code and display logic for addresses.Figure 6-6. Theatre tickets data model with orders and purchased_tickets table, but with address schema repeated in two tablesInstead of duplicating this structure, we create a separate addresses table. If mimicking the previous examples, we would create address_id columns in the theatres and orders tables, and refer to individual addresses that are inserted in the addresses table.We won’t do that here, though. This data is not in violation of 3NF. It is dependent only on the primary key of the table it is in, not some other column in the table as in our other examples. The problem we are trying to solve is of repeated code, not repeated data. In code, we can solve this sort of problem with class inheritance (or in Ruby, with mixins). The same can be accomplished at the data layer with Postgres’s table inheritance mechanism. We’ll use this feature to solve our current problem. First, we create the addresses table, without a primary key – the tables that implement addresses functionality already have their own.The addresses table does not correspond directly to a Rails model class. Instead, its columns are sucked into tables that need it using the inherits keyword:Because Postgres allows multiple inheritance, we want to allow for the same flexibility within Rails. Ruby provides for multiple inheritance through the use of mixins, and Rails takes this one step further with plugins. Below is our plugin for address support, based on the template from Chapter 3. Example 6-1 shows the file acts_as_address.rb, in the plugin’s lib/ directory.Example 6-1. acts_as_address.rb, from our acts_as_address pluginExample 6-2 shows the init.rb file in the plugin’s top-level directory.Example 6-2. The init.rb file for our acts_as_address pluginNow we can simply say, within the theatre and order classes:Our new schema diagram, shown in Figure 6-7, is much cleaner now that we have normalized the repetitive address information into its own table. Each table that inherits addresses is much easier to understand as well. In addition, our models that implement the Addresses module stay clean. In both the database and our application code, if we need to make changes to the way addresses work, we need only do it in one place.Figure 6-7. Theatre tickets data model with address information refactored at the schema level, using database inheritanceAre there more opportunities for normalization here? Before reading on, examine the data model and brainstorm about the business. Put yourself in the shoes of the developer of this website. What are features you might want that would require dependent data?Drawing from my own experience, I would say that there has never been a time where a zip code was entered in a database that the site in question would not have benefited from knowing additional information about the zip code. For example, in our application, what theatres within 25 miles of 02139 are playing the latest Harry Potter? Certainly, we can solve this by adding latitude and longitude columns to the theatres table, but we risk duplicating data (and therefore creating conflicting data) if we have multiple theatres in the same area. Further, if we ever want to validate user input, such as verifying that a city, state, zip code combination is valid, we will need an auxiliary table of zip code data.We will leave the refactoring of the zip code table for the next chapter.Figure 6-8. A table not in third normal formFigure 6-9. The color and dependent data from Figure 6-7 extracted to its own tableWe can create the table shown above with the following SQL statement. Note that it is up to you to account for any data inconsistencies caused by improper normalization:Figure 6-10. The table from Figure 6-8, with a foreign key reference to the table in Figure 8