Database Normalization - 1NF & 2NF#

Objective#

This lecture aims to introduce the concepts of the First Normal Form (1NF) and the Second Normal Form (2NF) in database normalization. By understanding these foundational concepts, students will learn to design databases that reduce redundancy and enhance data integrity.

Understanding and applying the principles of 1NF and 2NF is crucial in designing efficient, reliable databases. By ensuring that each table adheres to these normalization forms, you can avoid common data integrity issues and build a solid foundation for your database systems.


Database Normalization#

What is Database Normalization?

  • Definition: A systematic approach of organizing data in a database to reduce redundancy and improve data integrity.

    • Purpose: Ensures that the database is free of unwanted anomalies, makes the database more efficient to query, and enhances the consistency of the dataset.

Why Normalize?

  • To avoid duplicate data.

  • To reduce the complexity of the database.

  • To ensure relationships between tables are consistent.

  • To make it easier to maintain and update the database.


First Normal Form (1NF)#

Definition of 1NF:

  • A table is in the First Normal Form if:

    1. All table cells contain atomic (indivisible) values.

    • See Kunal’s slides on ACID principles.

    1. Each column or attribute has a unique name.

    2. The order in which data is stored does not affect the database’s integrity.

    3. Each column contains values of a single type.

Why is 1NF Important?

  • Ensures that each piece of data is accessible and queryable.

  • Prevents multiple values from being stored in a single cell, reducing complexity.

Example:#

  • Non-1NF Table: A Contacts table where the Phone column contains multiple phone numbers for some contacts.

ContactID

Name

Phone

1

John Doe

555-1234, 555-5678

2

Jane Smith

555-8765

  • 1NF Table: The same Contacts table modified to adhere to 1NF by introducing a new table for phone numbers.

    Contacts Table

ContactID

Name

1

John Doe

2

Jane Smith

Phones Table

ContactID

Phone

1

555-1234

1

555-5678

2

555-8765


Second Normal Form (2NF)#

Definition of 2NF:

  • A table is in the Second Normal Form if:

    1. It is in the First Normal Form.

    2. All non-key attributes are fully functionally dependent on the primary key (i.e., no partial dependency).

Understanding Functional Dependency:

  • A column Y is functionally dependent on a column X if each X value is associated with precisely one Y value.

Partial Dependency:

  • Occurs when a column is functionally dependent on part of a composite primary key.

Why is 2NF Important?

  • Eliminates redundancy that occurs when data is stored in tables with composite primary keys but does not require all parts of the key.

  • Ensures that each piece of information is stored only once, reducing the potential for inconsistencies.

Example:#

  • Pre-2NF Table: An OrderDetails table with a composite primary key (OrderID, ProductID) where the ProductPrice is partially dependent on ProductID.

OrderID

ProductID

ProductPrice

Quantity

1

A

10

2

1

B

15

1

2

A

10

1

  • 2NF Tables: Splitting the OrderDetails table into two to remove partial dependencies.

    Orders Table

OrderID

ProductID

Quantity

1

A

2

1

B

1

2

A

1

Products Table

ProductID

ProductPrice

A

10

B

15


Recommendations and Best Practices for Spatial Data#

Focusing primarily on spatial data involves understanding the unique challenges and considerations that come with managing geospatial information in databases, especially when applying normalization principles.

Here’s a detailed look into how these recommendations apply to spatial data:

1. Start with a Thorough Data Analysis#

  • Understand Spatial Relationships: Analyze how spatial entities relate to each other. For instance, are there one-to-many relationships between geographic features, like multiple buildings within a single city block? Understanding these spatial relationships is crucial for designing your database schema.

    • Initial analysis can take place in QGIS, PostgreSQL (with the PostGIS extension), or through a GIS analytical framework of your choosing.

  • Identify Spatial Dependencies: Determine if spatial attributes depend on primary keys or if they exhibit partial dependencies. For example, a Landmarks table might have a Location column that’s functionally dependent on the landmark ID.

  • Consider Spatial Data Types: Be familiar with spatial data types provided by PostGIS, such as GEOMETRY and GEOGRAPHY, and understand when to use them based on your data’s requirements.

2. Use Normalization as Part of Your Database Design#

  • Apply Spatial Indexes Wisely: While normalization is about the logical structure of your database, it’s also important to consider physical aspects like indexing.

    • Considering which tables and columns will benefit most from them.

  • Normalization and Spatial Queries: Consider how normalization will affect your spatial queries. Sometimes, denormalizing spatial data slightly (e.g., duplicating some spatial columns in another table for faster access) can improve query performance for specific, frequently used operations.

3. Remember that Over-normalization Can Lead to Excessive Complexity#

  • Balancing Normalization and Performance: Spatial queries can be computationally intensive. Over-normalization can lead to multiple joins between tables, which might degrade performance for complex spatial queries. It is often a good practice to balance between normalization for data integrity and denormalization for query performance, especially in spatial databases where queries might involve calculations over large datasets.

  • Use Materialized Views for Complex Spatial Data: When dealing with complex spatial data and queries that span multiple normalized tables, consider using materialized views to store pre-computed joins or spatial operations. This can improve read performance at the cost of additional storage and the need to refresh the view after updates.

4. Additional Considerations for Spatial Data#

  • Data Integrity and Spatial Features: Use constraints and triggers to maintain data integrity, especially for spatial data. For example, triggers can automatically update spatial columns (like calculating the centroid of a polygon) when related non-spatial information changes.

  • Storage Considerations: Spatial data, especially when dealing with high-resolution geometries or large geographic areas, can be large. Consider the storage implications of your normalization decisions, particularly for how spatial data is stored and retrieved.

Practical Example:#

Consider a database that stores information about cities and their points of interest (POIs). A simple approach might store everything in a single table, including city names, POI names, and the POIs’ geographic locations.

Normalization would involve:

  1. Creating a separate Cities table with city names and identifiers.

  2. Creating a POIs table with references to the Cities table, POI names, and spatial data representing their locations.

  3. Ensuring that spatial queries are optimized by applying spatial indexes on the POIs table’s location column.

This structure reduces redundancy (e.g., city names repeated for each POI) and improves data integrity while allowing efficient spatial queries, such as finding all POIs within a specific city.


Exercise 1: Normalize to First Normal Form (1NF)#

Scenario#

You’re given a table CustomerOrders that tracks customer orders for products. However, the table violates the First Normal Form (1NF) because it contains repeating groups in the ProductIDs and Quantities columns.

CustomerOrders Table (Non-1NF):

OrderID

CustomerName

ProductIDs

Quantities

1

John Doe

101, 102

2, 1

2

Jane Smith

103

3

3

Mike Brown

101, 104, 105

1, 2, 1

Objective#

Normalize the CustomerOrders table to 1NF by ensuring that each row/column intersection contains only a single value.

Steps and SQL Code:#

  1. Create Two Separate Tables:

    • One for Orders to track each order.

    • Another for OrderDetails to track each product within an order.

Step 1: Create the Orders Table

CREATE TABLE Orders (
    OrderID SERIAL PRIMARY KEY,
    CustomerName VARCHAR(255)
);

Step 2: Create the OrderDetails Table

CREATE TABLE OrderDetails (
    DetailID SERIAL PRIMARY KEY,
    OrderID INT,
    ProductID INT,
    Quantity INT,
    FOREIGN KEY (OrderID) REFERENCES Orders(OrderID)
);
  1. Insert Data Into the Orders Table:

INSERT INTO Orders (OrderID, CustomerName) VALUES
(1, 'John Doe'),
(2, 'Jane Smith'),
(3, 'Mike Brown');
  1. Insert Data Into the OrderDetails Table:

For this step, you’ll manually split the ProductIDs and Quantities into separate rows. In a real-world scenario, this might require a script or manual data manipulation based on the original data format.

INSERT INTO OrderDetails (OrderID, ProductID, Quantity) VALUES
(1, 101, 2),
(1, 102, 1),
(2, 103, 3),
(3, 101, 1),
(3, 104, 2),
(3, 105, 1);

Result:#

Now, the data is in 1NF. Each row in OrderDetails represents a single product order, eliminating the repeating groups.


Exercise 2: Normalize to Second Normal Form (2NF)#

Scenario#

The OrderDetails table from Exercise 1 is in 1NF but not in 2NF because the Quantity field is only dependent on ProductID, not on the composite key (OrderID, ProductID).

Objective#

Normalize the data structure to 2NF by ensuring all non-key attributes are fully functionally dependent on the primary key.

Steps and SQL Code:#

  1. Identify Partial Dependencies: The partial dependency is that Quantity depends on ProductID but not on the composite key. However, this scenario might be a misunderstanding of 2NF principles, as Quantity actually depends on both OrderID and ProductID (how many of a product were ordered in each specific order). A more accurate example of partial dependency would involve product information like ProductName or ProductPrice depending on ProductID alone.

  2. Create a Separate Table for Products: Assuming a more fitting scenario for 2NF, where ProductPrice depends only on ProductID, not on the composite key of an order detail.

CREATE TABLE Products (
    ProductID INT PRIMARY KEY,
    ProductName VARCHAR(255),
    ProductPrice DECIMAL(10, 2)
);
  1. Remove the Partial Dependency from OrderDetails:

    • Adjust OrderDetails to reference Products.

    • Assume OrderDetails already exists without ProductPrice.

Step 1: Populate the Products Table

INSERT INTO Products (ProductID, ProductName, ProductPrice) VALUES
(101, 'Product A', 10.00),
(102, 'Product B', 15.50),
(103, 'Product C', 7.25),
(104, 'Product D', 12.00),
(105, 'Product E', 5.75);

Step 2: Adjust OrderDetails (if needed)

The OrderDetails table should reference Products without storing product-specific information like price, which eliminates the partial dependency.

-- Assuming we need to adjust the schema (if it previously included ProductPrice, for instance

ALTER TABLE OrderDetails DROP COLUMN ProductPrice; -- Only if ProductPrice was previously included.

Result:#

By ensuring that all attributes in OrderDetails are fully functionally dependent on the primary key and by separating product-specific information into its own table, the database schema is now in 2NF. Each non-key attribute in OrderDetails is fully functionally dependent on the primary key (DetailID), and product information is correctly normalized into the Products table.