Database Normalization - 3NF & 4NF#

Third Normal Form (3NF)#

3NF is focused on eliminating transitive dependency from a database table. A table is in 3NF if it is in 2NF and all its attributes are not only fully functionally dependent on the primary key but also only on the primary key. In simpler terms, no non-primary key attribute should depend on another non-primary key attribute.

  • Transitive Dependency: This occurs when a non-primary key attribute depends on another non-primary key attribute.

  • Objective: To ensure that every non-primary key attribute is directly dependent on the primary key, promoting data integrity and reducing redundancy.

Example with Spatial Data:#

Imagine a database that stores information about properties for a real estate agency. Consider a table before applying 3NF:

Properties Table:
- PropertyID (Primary Key)
- Address
- City
- CityPopulation
- State
- Country

In this table, CityPopulation is a transitive dependency because it depends on City, not directly on PropertyID. To bring this table to 3NF, we would remove CityPopulation and place it in a separate table where City is the primary key.

Properties Table (3NF):
- PropertyID (Primary Key)
- Address
- City
- State
- Country

CityDemographics Table (3NF):
- City (Primary Key)
- CityPopulation

Now, all non-primary key attributes in the Properties table directly depend on the primary key, and CityPopulation is stored in a separate table, eliminating the transitive dependency.

Fourth Normal Form (4NF)#

4NF takes normalization a step further by addressing multi-valued dependencies. A table is in 4NF if it is in 3NF and does not have any multi-valued dependencies unless they are trivial.

  • Multi-valued Dependency: This occurs when two attributes in a table are independent of each other but both depend on a third attribute (the primary key).

  • Objective: To separate attributes into different tables such that each table contains no multi-valued dependencies, thereby reducing redundancy and promoting data integrity.

###Example with Spatial Data: Let’s consider a scenario in our real estate database where a property can have multiple types of spatial data associated with it, like zoning types and available utilities.

PropertyZoningUtilities Table:
- PropertyID (Primary Key)
- ZoningType
- Utility

A property could be zoned for both residential and commercial use and could have utilities like electricity and water. This setup introduces multi-valued dependencies because ZoningType and Utility are independent of each other but both depend on PropertyID.

To normalize to 4NF, we split this into two tables:

PropertyZoning Table (4NF):
- PropertyID (Primary Key)
- ZoningType

PropertyUtilities Table (4NF):
- PropertyID (Primary Key)
- Utility

Now, each table contains no multi-valued dependencies, and the database structure more accurately reflects the reality that zoning types and utilities are independent aspects of a property.

Why this is important.#

By implementing 3NF, we ensure that all attributes in a table are only dependent on the primary key, removing any transitive dependencies. Advancing to 4NF further refines our database design by separating attributes with multi-valued dependencies into their tables, thereby reducing redundancy and promoting a cleaner, more efficient database structure.

These principles of normalization help in designing databases that are logical, maintainable, and scalable, especially important when dealing with complex data types like spatial data.


Normalizing Spatial Data#

Let’s go over another example involving spatial data, focusing on a Geographic Information Systems (GIS) for managing city infrastructure. This example will cover designing and normalizing a database that records information about public parks within a city, including their locations, amenities, and maintenance schedules.

Scenario: City Parks GIS Database#

The city’s Parks and Recreation Department requires a database to manage information about city parks, including park locations, available amenities (e.g., playgrounds, restrooms, picnic areas), and maintenance schedules for these amenities.

Part 1: Define the Initial Table Structure#

Start by creating an initial table that combines all the information into one structure. This table will not be normalized and will serve as the starting point for the exercise:

CREATE TABLE CityParks (
    ParkID SERIAL PRIMARY KEY,
    ParkName VARCHAR(255),
    ParkLocation GEOMETRY(Point, 4326), -- Using PostGIS for spatial data
    AmenityType VARCHAR(100), -- e.g., Playground, Restroom
    MaintenanceSchedule TEXT, -- e.g., "Every Monday"
    MaintenanceContact VARCHAR(255) -- Contact information for maintenance crew
);

Part 2: Normalize to 1NF#

Ensure that the table is in the First Normal Form (1NF) by eliminating repeating groups and ensuring atomicity. In the initial design, AmenityType, MaintenanceSchedule, and MaintenanceContact could potentially form repeating groups if a park has multiple amenities, which violates 1NF.

Solution:

  1. Separate amenities into a new table to ensure atomicity and eliminate repeating groups.

CREATE TABLE ParkAmenities (
    AmenityID SERIAL PRIMARY KEY,
    ParkID INT REFERENCES CityParks(ParkID),
    AmenityType VARCHAR(100),
    MaintenanceSchedule TEXT,
    MaintenanceContact VARCHAR(255)
);
  1. Remove AmenityType, MaintenanceSchedule, and MaintenanceContact from CityParks.

Part 3: Normalize to 2NF#

Ensure all non-key attributes in ParkAmenities are fully functionally dependent on the primary key.

Since ParkID is a foreign key and not part of a composite key, and all other attributes depend on AmenityID, the table is already in 2NF.

Part 4: Normalize to 3NF#

To achieve 3NF, ensure that there are no transitive dependencies.

If MaintenanceContact depends on AmenityType rather than AmenityID, it’s a transitive dependency.

Solution:

  1. Create a separate table for maintenance information related to each amenity type.

CREATE TABLE AmenityMaintenance (
    AmenityType VARCHAR(100) PRIMARY KEY,
    MaintenanceSchedule TEXT,
    MaintenanceContact VARCHAR(255)
);
  • What’s the difference between TEXT and VARCHAR(n)?

    • In SQL, the difference between VARCHAR(n) and TEXT is that VARCHAR(n) limits the maximum length of a column. For example, VARCHAR(255) will not allow a string longer than 255 characters.

    • VARCHAR uses less storage than TEXT for strings up to 255 characters.

    • In the past, many database management systems have set 255 characters as the maximum length for a VARCHAR. Even today, this limit often remains effective, especially when utilizing UTF-8 encoding and indexing columns due to restrictions on index length. It’s the largest number of characters that can be counted with an 8-bit number.

    • TEXT is deprecated in SQL, so best to use VARCHAR(n) and specificy length of n – for now, set to “255” for less storage useage.

  1. Update ParkAmenities to remove MaintenanceSchedule and MaintenanceContact, and adjust foreign key relationships as necessary.

Part 5: Normalize to 4NF#

Address any multi-valued dependencies.

If the database design has reached this point without multi-valued dependencies, then it is already in 4NF. For our scenario, assuming that each AmenityType has a unique MaintenanceSchedule and MaintenanceContact, the design should meet 4NF criteria.

Part 6: Querying Spatial Data#

  1. Inserting Spatial Data:

    INSERT INTO CityParks (ParkName, ParkLocation)
    VALUES ('Central Park', ST_GeomFromText('POINT(longitude latitude)', 4326));
    
  • latitude & longitude in this case would need to be actual coordinates.

  1. Finding Nearby Parks:

    SELECT ParkName
    FROM CityParks
    WHERE ST_DWithin(
        ParkLocation,
        ST_MakePoint(longitude, latitude)::geography,
        1000 -- Search within 1km (default is meters since data type is GEOGRAPHY)
        -- What would the units be if data type were GEOMETRY?
    );
    

In-class Exercise#

  • Normalize the provided initial table structure through 1NF to 4NF, ensuring efficient data organization and integrity.

  • Populate the tables with sample data, including at least three parks with multiple amenities.

  • Use spatial queries to demonstrate finding parks within a specified distance from a given point.

This example provides a practice scenario to normalize in a real-world context, enhancing our understanding of database design and spatial data management in GIS applications.