Database Normalization - 1NF & 2NF#
Objective#
This lecture aims to introduce the concepts of the First Normal Form (1NF) and the Second Normal Form (2NF) in database normalization. By understanding these foundational concepts, students will learn to design databases that reduce redundancy and enhance data integrity.
Understanding and applying the principles of 1NF and 2NF is crucial in designing efficient, reliable databases. By ensuring that each table adheres to these normalization forms, you can avoid common data integrity issues and build a solid foundation for your database systems.
Database Normalization#
What is Database Normalization?
Definition: A systematic approach of organizing data in a database to reduce redundancy and improve data integrity.
Purpose: Ensures that the database is free of unwanted anomalies, makes the database more efficient to query, and enhances the consistency of the dataset.
Why Normalize?
To avoid duplicate data.
To reduce the complexity of the database.
To ensure relationships between tables are consistent.
To make it easier to maintain and update the database.
First Normal Form (1NF)#
Definition of 1NF:
A table is in the First Normal Form if:
All table cells contain atomic (indivisible) values.
See Kunal’s slides on ACID principles.
Each column or attribute has a unique name.
The order in which data is stored does not affect the database’s integrity.
Each column contains values of a single type.
Why is 1NF Important?
Ensures that each piece of data is accessible and queryable.
Prevents multiple values from being stored in a single cell, reducing complexity.
Example:#
Non-1NF Table: A
Contacts
table where thePhone
column contains multiple phone numbers for some contacts.
ContactID |
Name |
Phone |
---|---|---|
1 |
John Doe |
555-1234, 555-5678 |
2 |
Jane Smith |
555-8765 |
1NF Table: The same
Contacts
table modified to adhere to 1NF by introducing a new table for phone numbers.Contacts Table
ContactID |
Name |
---|---|
1 |
John Doe |
2 |
Jane Smith |
Phones Table
ContactID |
Phone |
---|---|
1 |
555-1234 |
1 |
555-5678 |
2 |
555-8765 |
Second Normal Form (2NF)#
Definition of 2NF:
A table is in the Second Normal Form if:
It is in the First Normal Form.
All non-key attributes are fully functionally dependent on the primary key (i.e., no partial dependency).
Understanding Functional Dependency:
A column Y is functionally dependent on a column X if each X value is associated with precisely one Y value.
Partial Dependency:
Occurs when a column is functionally dependent on part of a composite primary key.
Why is 2NF Important?
Eliminates redundancy that occurs when data is stored in tables with composite primary keys but does not require all parts of the key.
Ensures that each piece of information is stored only once, reducing the potential for inconsistencies.
Example:#
Pre-2NF Table: An
OrderDetails
table with a composite primary key (OrderID
,ProductID
) where theProductPrice
is partially dependent onProductID
.
OrderID |
ProductID |
ProductPrice |
Quantity |
---|---|---|---|
1 |
A |
10 |
2 |
1 |
B |
15 |
1 |
2 |
A |
10 |
1 |
2NF Tables: Splitting the
OrderDetails
table into two to remove partial dependencies.Orders Table
OrderID |
ProductID |
Quantity |
---|---|---|
1 |
A |
2 |
1 |
B |
1 |
2 |
A |
1 |
Products Table
ProductID |
ProductPrice |
---|---|
A |
10 |
B |
15 |
Recommendations and Best Practices for Spatial Data#
Focusing primarily on spatial data involves understanding the unique challenges and considerations that come with managing geospatial information in databases, especially when applying normalization principles.
Here’s a detailed look into how these recommendations apply to spatial data:
1. Start with a Thorough Data Analysis#
Understand Spatial Relationships: Analyze how spatial entities relate to each other. For instance, are there one-to-many relationships between geographic features, like multiple buildings within a single city block? Understanding these spatial relationships is crucial for designing your database schema.
Initial analysis can take place in QGIS, PostgreSQL (with the PostGIS extension), or through a GIS analytical framework of your choosing.
Identify Spatial Dependencies: Determine if spatial attributes depend on primary keys or if they exhibit partial dependencies. For example, a
Landmarks
table might have aLocation
column that’s functionally dependent on the landmark ID.Consider Spatial Data Types: Be familiar with spatial data types provided by PostGIS, such as
GEOMETRY
andGEOGRAPHY
, and understand when to use them based on your data’s requirements.
2. Use Normalization as Part of Your Database Design#
Apply Spatial Indexes Wisely: While normalization is about the logical structure of your database, it’s also important to consider physical aspects like indexing.
Considering which tables and columns will benefit most from them.
Normalization and Spatial Queries: Consider how normalization will affect your spatial queries. Sometimes, denormalizing spatial data slightly (e.g., duplicating some spatial columns in another table for faster access) can improve query performance for specific, frequently used operations.
3. Remember that Over-normalization Can Lead to Excessive Complexity#
Balancing Normalization and Performance: Spatial queries can be computationally intensive. Over-normalization can lead to multiple joins between tables, which might degrade performance for complex spatial queries. It is often a good practice to balance between normalization for data integrity and denormalization for query performance, especially in spatial databases where queries might involve calculations over large datasets.
Use Materialized Views for Complex Spatial Data: When dealing with complex spatial data and queries that span multiple normalized tables, consider using materialized views to store pre-computed joins or spatial operations. This can improve read performance at the cost of additional storage and the need to refresh the view after updates.
4. Additional Considerations for Spatial Data#
Data Integrity and Spatial Features: Use constraints and triggers to maintain data integrity, especially for spatial data. For example, triggers can automatically update spatial columns (like calculating the centroid of a polygon) when related non-spatial information changes.
Storage Considerations: Spatial data, especially when dealing with high-resolution geometries or large geographic areas, can be large. Consider the storage implications of your normalization decisions, particularly for how spatial data is stored and retrieved.
Practical Example:#
Consider a database that stores information about cities and their points of interest (POIs). A simple approach might store everything in a single table, including city names, POI names, and the POIs’ geographic locations.
Normalization would involve:
Creating a separate
Cities
table with city names and identifiers.Creating a
POIs
table with references to theCities
table, POI names, and spatial data representing their locations.Ensuring that spatial queries are optimized by applying spatial indexes on the
POIs
table’s location column.
This structure reduces redundancy (e.g., city names repeated for each POI) and improves data integrity while allowing efficient spatial queries, such as finding all POIs within a specific city.
Exercise 1: Normalize to First Normal Form (1NF)#
Scenario#
You’re given a table CustomerOrders
that tracks customer orders for products. However, the table violates the First Normal Form (1NF) because it contains repeating groups in the ProductIDs
and Quantities
columns.
CustomerOrders
Table (Non-1NF):
OrderID |
CustomerName |
ProductIDs |
Quantities |
---|---|---|---|
1 |
John Doe |
101, 102 |
2, 1 |
2 |
Jane Smith |
103 |
3 |
3 |
Mike Brown |
101, 104, 105 |
1, 2, 1 |
Objective#
Normalize the CustomerOrders
table to 1NF by ensuring that each row/column intersection contains only a single value.
Steps and SQL Code:#
Create Two Separate Tables:
One for
Orders
to track each order.Another for
OrderDetails
to track each product within an order.
Step 1: Create the Orders
Table
CREATE TABLE Orders (
OrderID SERIAL PRIMARY KEY,
CustomerName VARCHAR(255)
);
Step 2: Create the OrderDetails
Table
CREATE TABLE OrderDetails (
DetailID SERIAL PRIMARY KEY,
OrderID INT,
ProductID INT,
Quantity INT,
FOREIGN KEY (OrderID) REFERENCES Orders(OrderID)
);
Insert Data Into the
Orders
Table:
INSERT INTO Orders (OrderID, CustomerName) VALUES
(1, 'John Doe'),
(2, 'Jane Smith'),
(3, 'Mike Brown');
Insert Data Into the
OrderDetails
Table:
For this step, you’ll manually split the ProductIDs
and Quantities
into separate rows. In a real-world scenario, this might require a script or manual data manipulation based on the original data format.
INSERT INTO OrderDetails (OrderID, ProductID, Quantity) VALUES
(1, 101, 2),
(1, 102, 1),
(2, 103, 3),
(3, 101, 1),
(3, 104, 2),
(3, 105, 1);
Result:#
Now, the data is in 1NF. Each row in OrderDetails
represents a single product order, eliminating the repeating groups.
Exercise 2: Normalize to Second Normal Form (2NF)#
Scenario#
The OrderDetails
table from Exercise 1 is in 1NF but not in 2NF because the Quantity
field is only dependent on ProductID
, not on the composite key (OrderID
, ProductID
).
Objective#
Normalize the data structure to 2NF by ensuring all non-key attributes are fully functionally dependent on the primary key.
Steps and SQL Code:#
Identify Partial Dependencies: The partial dependency is that
Quantity
depends onProductID
but not on the composite key. However, this scenario might be a misunderstanding of 2NF principles, asQuantity
actually depends on bothOrderID
andProductID
(how many of a product were ordered in each specific order). A more accurate example of partial dependency would involve product information likeProductName
orProductPrice
depending onProductID
alone.Create a Separate Table for Products: Assuming a more fitting scenario for 2NF, where
ProductPrice
depends only onProductID
, not on the composite key of an order detail.
CREATE TABLE Products (
ProductID INT PRIMARY KEY,
ProductName VARCHAR(255),
ProductPrice DECIMAL(10, 2)
);
Remove the Partial Dependency from
OrderDetails
:Adjust
OrderDetails
to referenceProducts
.Assume
OrderDetails
already exists withoutProductPrice
.
Step 1: Populate the Products
Table
INSERT INTO Products (ProductID, ProductName, ProductPrice) VALUES
(101, 'Product A', 10.00),
(102, 'Product B', 15.50),
(103, 'Product C', 7.25),
(104, 'Product D', 12.00),
(105, 'Product E', 5.75);
Step 2: Adjust OrderDetails
(if needed)
The OrderDetails
table should reference Products
without storing product-specific information like price, which eliminates the partial dependency.
-- Assuming we need to adjust the schema (if it previously included ProductPrice, for instance
ALTER TABLE OrderDetails DROP COLUMN ProductPrice; -- Only if ProductPrice was previously included.
Result:#
By ensuring that all attributes in OrderDetails
are fully functionally dependent on the primary key and by separating product-specific information into its own table, the database schema is now in 2NF. Each non-key attribute in OrderDetails
is fully functionally dependent on the primary key (DetailID
), and product information is correctly normalized into the Products
table.