0% found this document useful (0 votes)
41 views37 pages

Healthcare SQL Project

The project 'Healthcare Data Analytics using SQL Server' develops a relational database for managing healthcare data, utilizing SQL Server Management Studio to create a normalized schema for various entities. It employs analytical SQL queries to derive insights on hospital performance, patient demographics, and billing trends, revealing valuable patterns for strategic planning. The project serves as a foundational framework for advanced healthcare analytics, with potential extensions into business intelligence and predictive analytics.

Uploaded by

Shruti
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views37 pages

Healthcare SQL Project

The project 'Healthcare Data Analytics using SQL Server' develops a relational database for managing healthcare data, utilizing SQL Server Management Studio to create a normalized schema for various entities. It employs analytical SQL queries to derive insights on hospital performance, patient demographics, and billing trends, revealing valuable patterns for strategic planning. The project serves as a foundational framework for advanced healthcare analytics, with potential extensions into business intelligence and predictive analytics.

Uploaded by

Shruti
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Healthcare Data Analytics using

SQL Server

An End-to-End Relational Project

Submitted By: Shruti Khadtar


Abstract

The project titled “Healthcare Data Analytics using SQL Server” focuses on devel-
oping a relational database system for storing, managing, and analyzing healthcare data.
Using SQL Server Management Studio (SSMS), a structured and normalized database
schema was designed to represent entities such as patients, doctors, hospitals, billing, and
insurance claims. The dataset was sourced from Kaggle and preprocessed to ensure data
consistency and integrity.
Analytical SQL queries were implemented to explore hospital performance metrics,
patient demographics, billing trends, and insurance claim insights. The project high-
lights the efficiency of SQL operations such as joins, aggregations, and subqueries for
data-driven decision-making in the healthcare domain. Results reveal patterns in patient
visits, hospital revenue, and doctor performance, providing valuable insights for strategic
planning and resource management. This project establishes a foundational framework
for advanced healthcare analytics and can be extended with business intelligence dash-
boards and predictive analytics.

1
Contents

1 Introduction 6
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Scope of the Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5 Organization of the Report . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Literature Review 9
2.1 Healthcare Databases and Management Systems . . . . . . . . . . . . . . 9
2.2 SQL-Based Data Analytics in Healthcare . . . . . . . . . . . . . . . . . . 10
2.3 Data Warehousing and Decision Support Systems . . . . . . . . . . . . . 10
2.4 Summary of Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3 Dataset Description 12
3.1 Dataset Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Attributes Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.4 Data Import into SQL Server . . . . . . . . . . . . . . . . . . . . . . . . 14

4 Database Design 16
4.1 Normalization Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.1.1 First Normal Form (1NF) . . . . . . . . . . . . . . . . . . . . . . 16
4.1.2 Second Normal Form (2NF) . . . . . . . . . . . . . . . . . . . . . 16
4.1.3 Third Normal Form (3NF) . . . . . . . . . . . . . . . . . . . . . . 17
4.2 Entity–Relationship (ER) Model . . . . . . . . . . . . . . . . . . . . . . . 17
4.3 Relational Schema Design . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.4 SQL Table Creation Statements . . . . . . . . . . . . . . . . . . . . . . . 18
4.4.1 Patient Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.4.2 Doctor Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.4.3 Hospital Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.4.4 Visit Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2
4.4.5 Billing Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.4.6 Medication Table . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.5 Entity Relationships Explained . . . . . . . . . . . . . . . . . . . . . . . 20
4.6 Design Advantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5 SQL Implementation and Analytical Queries 21


5.1 Database Setup and Data Insertion . . . . . . . . . . . . . . . . . . . . . 21
5.1.1 Sample Data Insertion . . . . . . . . . . . . . . . . . . . . . . . . 21
5.2 Data Validation Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.3 Analytical Queries and Insights . . . . . . . . . . . . . . . . . . . . . . . 23
5.3.1 1. Average Billing by Hospital . . . . . . . . . . . . . . . . . . . . 23
5.3.2 2. Doctor Performance by Number of Patients Treated . . . . . . 23
5.3.3 3. Insurance Provider Billing Distribution . . . . . . . . . . . . . 24
5.3.4 4. Average Length of Stay by Admission Type . . . . . . . . . . . 24
5.3.5 5. Gender-Based Medical Condition Distribution . . . . . . . . . 24
5.3.6 6. Hospital Revenue Analysis . . . . . . . . . . . . . . . . . . . . 24
5.4 Advanced Query Example: Top Performing Doctor by Revenue . . . . . 25

6 Results and Discussion 26


6.1 Overview of Analytical Outputs . . . . . . . . . . . . . . . . . . . . . . . 26
6.1.1 Medical Condition Occurrences . . . . . . . . . . . . . . . . . . . 26
6.1.2 Billing Summary Analysis . . . . . . . . . . . . . . . . . . . . . . 27
6.1.3 Date-Based Derived Analysis . . . . . . . . . . . . . . . . . . . . 27
6.1.4 Insurance Provider Billing Distribution . . . . . . . . . . . . . . . 28
6.1.5 Monthly Admissions Trend . . . . . . . . . . . . . . . . . . . . . . 29
6.1.6 Gender-Based Distribution of Medical Conditions . . . . . . . . . 29
6.1.7 Total Revenue Generated by Each Hospital . . . . . . . . . . . . . 30
6.1.8 Top Performing Doctors by Total Billing Amount . . . . . . . . . 31
6.2 Discussion of Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

7 Conclusion and Future Scope 33


7.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
7.2 Key Learnings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
7.3 Challenges Faced . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
7.4 Future Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
7.5 Final Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3
List of Figures

6.1 Distribution of Top Medical Conditions in the Healthcare Dataset . . . . 27


6.2 Summary of Billing Records and Billing Range . . . . . . . . . . . . . . . 27
6.3 Sample Patient Data with Derived Date Fields and Stay Duration . . . . 28
6.4 Billing distribution by insurance provider showing claim count, total billed,
and average billed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.5 Monthly Patient Admission Trends from 2019 to 2024 . . . . . . . . . . . 29
6.6 Gender-Based Distribution of Medical Conditions . . . . . . . . . . . . . 30
6.7 Total Revenue Generated by Each Hospital . . . . . . . . . . . . . . . . . 31
6.8 Top 5 Performing Doctors by Total Billing Amount . . . . . . . . . . . . 32

4
List of Tables

3.1 Dataset Attributes Description . . . . . . . . . . . . . . . . . . . . . . . . 13

4.1 Main Tables in the Healthcare Database . . . . . . . . . . . . . . . . . . 18

5
1
Introduction

1.1 Background
In the modern era of digital transformation, healthcare organizations generate enormous
volumes of data daily from hospitals, clinics, laboratories, and insurance systems. Manag-
ing and analyzing this data efficiently is crucial for improving patient outcomes, hospital
performance, and decision-making processes. Structured Query Language (SQL) plays a
pivotal role in the storage, retrieval, and analysis of such structured healthcare data. By
using SQL, healthcare professionals and data analysts can organize patient information,
monitor medical conditions, track hospital admissions, and perform financial analyses
with precision and consistency.
Healthcare analytics aims to extract meaningful insights from patient records, billing
information, and operational metrics. SQL-based analysis helps identify disease trends,
evaluate treatment outcomes, and reduce unnecessary costs. It also provides a foun-
dation for integrating advanced techniques like predictive analytics, data visualization,
and machine learning models. Thus, the integration of relational database systems such
as Microsoft SQL Server with healthcare data ensures data integrity, accessibility, and
actionable insights.

1.2 Motivation
The motivation behind this project stems from the increasing importance of data-driven
decision-making in the healthcare industry. Hospitals and healthcare institutions often
face challenges in handling large datasets that include patient demographics, diagnosis
details, medications, and billing information. Without efficient database systems, the
management of such data becomes error-prone and inefficient.
By developing a well-structured SQL-based system, healthcare data can be systemat-
ically stored, queried, and analyzed to provide meaningful insights. This project focuses

6
on using SQL Server Management Studio (SSMS) to implement an end-to-end data an-
alytics solution for healthcare datasets. The idea is to explore how simple yet powerful
SQL queries can be used to uncover valuable trends such as hospital-wise billing perfor-
mance, patient readmission rates, and cost distribution across different age groups and
medical conditions.

1.3 Objectives
The main objectives of this project are as follows:

1. To design and implement a relational database system for healthcare analytics using
SQL Server.

2. To import and preprocess the dataset obtained from Kaggle for structured storage
and analysis.

3. To apply SQL operations such as joins, aggregations, and subqueries to extract key
insights.

4. To generate analytical reports summarizing trends in billing, demographics, and


hospital performance.

5. To document the findings and demonstrate the role of SQL in healthcare data
management.

1.4 Scope of the Project


The scope of this project is limited to structured healthcare data containing details such
as patient demographics, hospital information, doctors, medical conditions, medications,
and billing amounts. The project does not include real patient data to ensure privacy
and compliance with healthcare regulations. Instead, a synthetic dataset from Kaggle is
used for demonstration and analysis.
The system developed focuses primarily on data analysis using SQL queries rather
than the development of a complete healthcare management application. However, the
same schema and analytical logic can be extended to real-time hospital systems, data
warehouses, or dashboards in the future. Through this work, the project demonstrates
how relational database management systems (RDBMS) and SQL can serve as founda-
tional tools for healthcare analytics.

7
1.5 Organization of the Report
This report is structured as follows:

• Chapter 2 presents a detailed literature review of existing research and applications


of SQL in healthcare analytics.

• Chapter 3 describes the dataset, its attributes, and preprocessing methods.

• Chapter 4 discusses the database design, normalization, and schema structure.

• Chapter 5 explains SQL implementation, table creation, and data import steps.
presents the analytical queries and their results.

• Chapter 6 summarizes the findings and discussions based on query outputs.

• Chapter 7 concludes the report and discusses future work possibilities.

This project provides a comprehensive study on how SQL Server can be effectively
utilized for healthcare data analytics, emphasizing query-driven insights and systematic
data handling.

8
2
Literature Review

Healthcare data analytics is one of the most data-intensive domains, requiring robust
tools for storage, querying, and decision-making. Traditional healthcare information sys-
tems primarily focused on data storage; however, modern systems emphasize analytical
processing, data integration, and actionable insights. This provides an overview of related
studies and existing systems that have utilized SQL and relational databases for health-
care data management and analytics. The literature reviewed focuses on three main
areas: healthcare databases, SQL-based analytical systems, and data-driven healthcare
decision support.

2.1 Healthcare Databases and Management Systems


Several studies have discussed the design and management of healthcare databases to
improve clinical workflows and patient outcomes. According to Kumar et al. (2020), an
optimized database structure ensures faster data retrieval and efficient use of resources in
hospital management systems. They emphasized the role of normalization and relational
modeling to avoid redundancy and maintain consistency in large-scale patient datasets.
Similarly, Ramesh and Gupta (2021) developed a relational healthcare management
system using MySQL that integrated patient demographics, medical history, billing, and
doctor information. Their system allowed hospital administrators to quickly access and
analyze patient data, thus reducing manual effort and improving accuracy. The study
concluded that relational databases are suitable for real-time healthcare environments
where data integrity and scalability are critical.
In another research by Ali et al. (2019), the authors implemented a hospital database
system using SQL Server to analyze patient admission trends and billing distribution. The
research demonstrated that SQL queries could be used effectively to calculate hospital
occupancy rates, average length of stay, and cost variations among different hospitals.

9
2.2 SQL-Based Data Analytics in Healthcare
SQL has proven to be a powerful tool for analyzing structured medical data. Singh
and Sharma (2022) highlighted how SQL queries can be used to identify patterns in
disease prevalence and hospital performance. Their work showcased aggregation and join
operations on medical data to extract key performance metrics. This approach made
it easier for hospital administrators to identify underperforming departments and take
corrective actions.
In a study conducted by Prasad et al. (2021), the researchers used a synthetic health-
care dataset—similar to the one used in this project—to demonstrate how SQL can
facilitate exploratory data analysis. Their experiments included patient-level analyt-
ics, billing trend analysis, and insurance claim validation. The results indicated that
SQL-based systems provide high transparency and reproducibility, which are essential
for healthcare auditing and compliance.
Another example is the work of Wang et al. (2020), who applied SQL queries to
electronic health records (EHRs) to detect potential medical anomalies and duplicate
patient entries. Their work combined data cleaning and validation through SQL triggers
and stored procedures, ensuring higher data quality in healthcare systems.

2.3 Data Warehousing and Decision Support Sys-


tems
While transactional databases handle operational healthcare data, data warehouses and
analytical databases support decision-making at an organizational level. According to
George and Mathew (2020), data warehousing in healthcare enables integration of multi-
ple data sources—such as patient demographics, treatment records, and financial trans-
actions—into a single analytical platform. SQL-based ETL (Extract, Transform, Load)
processes are essential for cleaning and transferring healthcare data into these warehouses.
A related study by Lopez and Pereira (2021) demonstrated a decision support system
for predicting hospital readmissions using historical SQL data. Their model extracted
features such as admission frequency, diagnosis type, and billing history through SQL
queries before feeding them into a predictive machine learning model. This combination
of SQL-based preprocessing and data mining enhanced decision-making capabilities for
healthcare providers.

10
2.4 Summary of Findings
The literature reviewed reveals that SQL continues to play an integral role in healthcare
data management and analysis. Its capabilities in data querying, joining, aggregation,
and normalization make it ideal for healthcare applications where precision, consistency,
and scalability are required. However, most research emphasizes either the operational
or analytical aspects of healthcare systems, with few projects demonstrating complete
end-to-end implementations.
The proposed project bridges this gap by implementing an end-to-end SQL-based
healthcare analytics system. It not only focuses on data storage and query optimization
but also demonstrates practical analytical insights derived from real-world queries. By
integrating SQL Server with a synthetic healthcare dataset, this project showcases how
relational databases can serve as both storage and analytical engines in modern healthcare
environments.

11
3
Dataset Description

The dataset used in this project is the Healthcare Dataset developed by Prasad22,
published on the Kaggle platform. It is a synthetically generated dataset created for
educational and research purposes in the fields of data analytics, database management,
and machine learning. The dataset simulates real-world healthcare data with attributes
representing patients, doctors, hospitals, diagnoses, insurance providers, medications, and
billing information. Since it is synthetic, it does not contain any personally identifiable
information, making it suitable for open-source analytical projects.
This provides a comprehensive description of the dataset structure, attributes, prepro-
cessing steps, and the process of importing the data into SQL Server Management Studio
(SSMS). The objective is to ensure data readiness for performing SQL-based analytics
and visualization.

3.1 Dataset Overview


The Kaggle Healthcare Dataset consists of multiple attributes that describe patient de-
mographics, clinical information, and financial data. The dataset is provided in comma-
separated values (CSV) format and contains approximately 55,500 records. Each record
corresponds to a unique patient admission record within a hospital.
The dataset was downloaded from Kaggle and stored locally before being imported
into SQL Server. A new database schema named healthcare db was created to organize
the data and perform analytical operations. The dataset offers a good balance between
clinical and administrative data, making it ideal for demonstrating SQL-based analytics
such as aggregation, joins, and subqueries.

3.2 Attributes Description


Table 3.1 lists the key attributes available in the dataset along with their data types and
brief descriptions.

12
Table 3.1: Dataset Attributes Description
Attribute Name Data Type Description
Name VARCHAR(200) Full name of the patient.
Age INT Age of the patient in years.
Gender VARCHAR(10) Gender of the patient (Male/Female/Other).
Blood Type VARCHAR(10) Patient’s blood group.
Medical Condition VARCHAR(200) Diagnosis or disease name.
Date of Admission DATE Date when the patient was admitted.
Doctor VARCHAR(200) Name of the attending doctor.
Hospital VARCHAR(200) Hospital name where treatment occurred.
Insurance Provider VARCHAR(200) Name of the patient’s insurance company.
Billing Amount DECIMAL(12,2) Total billing cost for the treatment.
Room Number VARCHAR(50) Hospital room number assigned to the patient.
Admission Type VARCHAR(50) Type of admission (Emergency, Routine, Urgent).
Discharge Date DATE Date of discharge from the hospital.
Medication VARCHAR(255) Prescribed medication details.
Test Results VARCHAR(50) Summary of diagnostic results (Normal/Abnormal).

3.3 Data Preprocessing


Before loading the dataset into SQL Server, basic data preprocessing was performed to
ensure accuracy and consistency. The following steps were undertaken:

1. Data Cleaning: Duplicate records and blank entries were removed using spread-
sheet filters and Python-based cleaning scripts. Missing values for non-critical fields
such as Room Number and Medication were replaced with ‘Unknown‘ or ‘N/A‘.

2. Date Formatting: The Date of Admission and Discharge Date columns were
converted to the standard ‘YYYY-MM-DD‘ format to maintain compatibility with
SQL Server date data types.

3. Numeric Standardization: The Billing Amount column was validated to ensure


all records contained numeric values. Non-numeric entries were either corrected or
removed.

4. Categorical Normalization: Inconsistent text values (e.g., “male” vs. “Male”)


were standardized using the proper case to maintain uniformity across categorical
fields such as Gender, Admission Type, and Test Results.

5. Validation: Random sampling was performed after cleaning to validate that no


incorrect or corrupted records remained.

After preprocessing, the dataset was saved as a clean CSV file named healthcare [Link]
and prepared for import into SQL Server.

13
3.4 Data Import into SQL Server
The data import process was carried out using SQL Server Management Studio (SSMS).
The steps followed were as below:

1. A new database named healthcare db was created using the command:

CREATE DATABASE healthcare_db;


GO

2. A staging table named stg healthcare was created to temporarily hold the im-
ported CSV data:

CREATE TABLE stg_healthcare (


Name VARCHAR(200),
Age INT,
Gender VARCHAR(10),
[Blood Type] VARCHAR(10),
[Medical Condition] VARCHAR(200),
[Date of Admission] DATE,
Doctor VARCHAR(200),
Hospital VARCHAR(200),
[Insurance Provider] VARCHAR(200),
[Billing Amount] DECIMAL(12,2),
[Room Number] VARCHAR(50),
[Admission Type] VARCHAR(50),
[Discharge Date] DATE,
Medication VARCHAR(255),
[Test Results] VARCHAR(50)
);

3. The CSV file was imported using the SSMS Import Data Wizard, mapping each
column to its respective SQL data type.

4. A validation query was executed to ensure successful import:

SELECT COUNT(*) AS Total_Records,


COUNT(DISTINCT Hospital) AS Unique_Hospitals
FROM stg_healthcare;

14
5. Once validated, the dataset became ready for query execution and analysis.

The dataset used in this project provides a well-balanced and comprehensive repre-
sentation of healthcare-related data suitable for SQL-based analytical operations. The
preprocessing ensured data quality and consistency, while the SQL Server import enabled
efficient querying and exploration. This prepared foundation sets the stage for database
design, normalization, and analytical queries, which are discussed in subsequent chapters.

15
4
Database Design

The database design phase is one of the most crucial components of any data-driven
project. It defines how data is logically structured, related, and stored to ensure consis-
tency, accuracy, and scalability. In this project, the raw healthcare dataset obtained from
Kaggle was transformed into a structured relational database schema in SQL Server
Management Studio (SSMS). The primary objective of the design was to reduce re-
dundancy, enforce data integrity, and create an efficient schema that supports analytical
queries related to patients, doctors, hospitals, treatments, and billing.
This explains the database design process, normalization steps, entity–relationship
(ER) modeling, and final schema creation.

4.1 Normalization Process


Normalization is a systematic approach used to eliminate redundancy and ensure logical
data dependencies. The healthcare dataset was normalized up to the Third Normal
Form (3NF).

4.1.1 First Normal Form (1NF)


In 1NF, each table column holds atomic (indivisible) values, and each record is unique.
The original dataset contained repeating information such as multiple medications in one
field. These were separated into distinct rows or moved into separate tables. Each record
was assigned a unique Patient ID.

4.1.2 Second Normal Form (2NF)


In 2NF, all non-key attributes must depend entirely on the primary key. Partial depen-
dencies were removed by creating separate tables for Doctor, Hospital, and Insurance
Provider. This ensured that doctor-related information (like Doctor Name and Specialization)
was not repeated for every patient.

16
4.1.3 Third Normal Form (3NF)
In 3NF, all attributes must depend only on the primary key and not on other non-
key attributes. For instance, Billing Amount and Insurance Provider details were
separated into independent tables to avoid transitive dependencies. This resulted in a
clean and efficient schema design.

4.2 Entity–Relationship (ER) Model


The ER model visually represents how entities interact with each other in the healthcare
system. The main entities identified are:

• Patient – Represents individuals receiving treatment.

• Doctor – Represents healthcare professionals providing care.

• Hospital – Represents the facility where the treatment occurs.

• Visit / Admission – Represents the relationship between a patient and a hospital


visit.

• Billing – Contains financial information associated with each hospital visit.

• Insurance – Contains details of the insurance providers.

• Medication / Diagnosis – Contains treatment and diagnostic details.

Each patient may have multiple visits, each visit is associated with one doctor and
one hospital, and each billing record corresponds to one visit. This ensures a one-to-many
relationship between core entities.

4.3 Relational Schema Design


The final relational schema for the healthcare db is presented below. It captures all
major relationships between entities.

17
Table 4.1: Main Tables in the Healthcare Database
Table Name Primary Key Description
Patient Patient ID Stores patient demographic details.
Doctor Doctor ID Contains doctor information such as special-
ization and assigned hospital.
Hospital Hospital ID Stores hospital-level details including loca-
tion and department.
Visit Visit ID Links patients to doctors and hospitals dur-
ing a specific admission.
Billing Billing ID Stores financial and insurance-related de-
tails.
Medication Medication ID Contains prescribed medicines and diagnos-
tic results.

4.4 SQL Table Creation Statements


The following SQL commands were executed in SSMS to create the normalized tables.

4.4.1 Patient Table


CREATE TABLE Patient (
Patient_ID INT IDENTITY(1,1) PRIMARY KEY,
Name VARCHAR(200),
Age INT,
Gender VARCHAR(10),
Blood_Type VARCHAR(10)
);

4.4.2 Doctor Table


CREATE TABLE Doctor (
Doctor_ID INT IDENTITY(1,1) PRIMARY KEY,
Doctor_Name VARCHAR(200),
Specialization VARCHAR(200),
Hospital_ID INT,
FOREIGN KEY (Hospital_ID) REFERENCES Hospital(Hospital_ID)
);

4.4.3 Hospital Table


CREATE TABLE Hospital (
Hospital_ID INT IDENTITY(1,1) PRIMARY KEY,

18
Hospital_Name VARCHAR(200),
City VARCHAR(100),
State VARCHAR(100)
);

4.4.4 Visit Table


CREATE TABLE Visit (
Visit_ID INT IDENTITY(1,1) PRIMARY KEY,
Patient_ID INT,
Doctor_ID INT,
Hospital_ID INT,
Admission_Date DATE,
Discharge_Date DATE,
Admission_Type VARCHAR(50),
FOREIGN KEY (Patient_ID) REFERENCES Patient(Patient_ID),
FOREIGN KEY (Doctor_ID) REFERENCES Doctor(Doctor_ID),
FOREIGN KEY (Hospital_ID) REFERENCES Hospital(Hospital_ID)
);

4.4.5 Billing Table


CREATE TABLE Billing (
Billing_ID INT IDENTITY(1,1) PRIMARY KEY,
Visit_ID INT,
Insurance_Provider VARCHAR(200),
Billing_Amount DECIMAL(12,2),
Payment_Status VARCHAR(50),
FOREIGN KEY (Visit_ID) REFERENCES Visit(Visit_ID)
);

4.4.6 Medication Table


CREATE TABLE Medication (
Medication_ID INT IDENTITY(1,1) PRIMARY KEY,
Visit_ID INT,
Medication_Name VARCHAR(255),
Test_Result VARCHAR(50),
FOREIGN KEY (Visit_ID) REFERENCES Visit(Visit_ID)
);

19
4.5 Entity Relationships Explained
• Each Patient can have multiple Visits, but each visit belongs to one patient.

• Each Visit is linked to one Doctor and one Hospital.

• Each Visit generates exactly one Billing record.

• Each Visit may involve multiple Medications.

These relationships ensure data normalization and allow analysts to run efficient join
operations across entities to extract insights about treatments, costs, and hospital per-
formance.

4.6 Design Advantages


The following benefits are achieved through this relational model:

1. Reduced Redundancy: Information such as hospital and doctor details are stored
only once.

2. Data Integrity: Referential constraints maintain accurate relationships between


entities.

3. Query Efficiency: Optimized for joins and analytical queries (billing trends, pa-
tient outcomes, etc.).

4. Scalability: New data attributes or tables (e.g., laboratory reports) can be added
easily.

This provided an overview of the database design process, normalization steps, and im-
plementation of the healthcare schema in SQL Server. The structure adheres to relational
database principles and supports advanced analytical queries that will be demonstrated
in subsequent chapters. The design ensures the system can handle future data expansions
without compromising data integrity or performance.

20
5
SQL Implementation and Analytical
Queries

After completing the database design and normalization process, the next step was to
implement the schema in SQL Server Management Studio (SSMS). This phase involves
creating the database, inserting sample data, and executing a variety of SQL queries to
analyze healthcare data. The purpose of this implementation is to demonstrate the use of
SQL for real-world healthcare analytics such as patient statistics, hospital billing, doctor
performance, and treatment outcomes.
This presents the SQL implementation process and discusses various analytical queries
that extract meaningful insights from the dataset.

5.1 Database Setup and Data Insertion


The normalized schema described in the previous chapter was implemented using SQL
DDL (Data Definition Language) commands. Once the tables were created, sample data
from the Kaggle Healthcare Dataset (Prasad22) was imported into corresponding tables.
Below are representative examples of data insertion statements used during setup.

5.1.1 Sample Data Insertion


-- Insert Patient Records
INSERT INTO Patient (Name, Age, Gender, Blood_Type)
VALUES
(’Ravi Kumar’, 45, ’Male’, ’O+’),
(’Sneha Patil’, 32, ’Female’, ’A+’),
(’Arjun Mehta’, 60, ’Male’, ’B+’);

-- Insert Hospital Records

21
INSERT INTO Hospital (Hospital_Name, City, State)
VALUES
(’Apollo Hospital’, ’Mumbai’, ’Maharashtra’),
(’Fortis Hospital’, ’Delhi’, ’Delhi’),
(’Global Care’, ’Pune’, ’Maharashtra’);

-- Insert Doctor Records


INSERT INTO Doctor (Doctor_Name, Specialization, Hospital_ID)
VALUES
(’Dr. R. Sharma’, ’Cardiology’, 1),
(’Dr. P. Nair’, ’Neurology’, 2),
(’Dr. S. Gupta’, ’Orthopedics’, 3);

-- Insert Visit Records


INSERT INTO Visit (Patient_ID, Doctor_ID, Hospital_ID, Admission_Date,
Discharge_Date, Admission_Type)
VALUES
(1, 1, 1, ’2024-03-01’, ’2024-03-10’, ’Emergency’),
(2, 2, 2, ’2024-04-05’, ’2024-04-15’, ’Routine’),
(3, 3, 3, ’2024-05-12’, ’2024-05-19’, ’Urgent’);

-- Insert Billing Records


INSERT INTO Billing (Visit_ID, Insurance_Provider, Billing_Amount, Payment_Status)
VALUES
(1, ’ICICI Lombard’, 52000.00, ’Paid’),
(2, ’Star Health’, 37000.00, ’Pending’),
(3, ’HDFC Ergo’, 42000.00, ’Paid’);

-- Insert Medication Records


INSERT INTO Medication (Visit_ID, Medication_Name, Test_Result)
VALUES
(1, ’Aspirin’, ’Normal’),
(2, ’Neurobion’, ’Abnormal’),
(3, ’Pain Relief’, ’Normal’);

5.2 Data Validation Queries


After inserting records, validation queries were executed to confirm data accuracy and
relational integrity.

22
-- Count total patients
SELECT COUNT(*) AS Total_Patients FROM Patient;

-- Verify number of visits per hospital


SELECT H.Hospital_Name, COUNT(V.Visit_ID) AS Total_Visits
FROM Hospital H
JOIN Visit V ON H.Hospital_ID = V.Hospital_ID
GROUP BY H.Hospital_Name;

-- Check total billing by insurance provider


SELECT Insurance_Provider, SUM(Billing_Amount) AS Total_Billing
FROM Billing
GROUP BY Insurance_Provider;

5.3 Analytical Queries and Insights


The following analytical queries were performed to extract actionable insights from the
database.

5.3.1 1. Average Billing by Hospital


SELECT H.Hospital_Name,
AVG(B.Billing_Amount) AS Avg_Billing
FROM Billing B
JOIN Visit V ON B.Visit_ID = V.Visit_ID
JOIN Hospital H ON V.Hospital_ID = H.Hospital_ID
GROUP BY H.Hospital_Name
ORDER BY Avg_Billing DESC;

Insight: The query provides the average treatment cost per hospital, helping identify
which hospital incurs the highest or lowest patient charges.

5.3.2 2. Doctor Performance by Number of Patients Treated


SELECT D.Doctor_Name, [Link],
COUNT(V.Visit_ID) AS Total_Patients
FROM Doctor D
JOIN Visit V ON D.Doctor_ID = V.Doctor_ID
GROUP BY D.Doctor_Name, [Link]
ORDER BY Total_Patients DESC;

23
Insight: Displays doctors with the most patients treated. Useful for analyzing doctor
workload or specialization demand.

5.3.3 3. Insurance Provider Billing Distribution


SELECT Insurance_Provider,
COUNT(Billing_ID) AS Total_Claims,
SUM(Billing_Amount) AS Total_Billed,
AVG(Billing_Amount) AS Avg_Claim
FROM Billing
GROUP BY Insurance_Provider;

Insight: Helps understand which insurance companies handle the most claims and the
average claim value.

5.3.4 4. Average Length of Stay by Admission Type


SELECT Admission_Type,
AVG(DATEDIFF(DAY, Admission_Date, Discharge_Date)) AS Avg_Stay_Days
FROM Visit
GROUP BY Admission_Type;

Insight: Useful for hospital administration to analyze the typical patient stay duration
by admission urgency.

5.3.5 5. Gender-Based Medical Condition Distribution


SELECT [Link], M.Medication_Name, COUNT(*) AS Prescription_Count
FROM Patient P
JOIN Visit V ON P.Patient_ID = V.Patient_ID
JOIN Medication M ON V.Visit_ID = M.Visit_ID
GROUP BY [Link], M.Medication_Name
ORDER BY Prescription_Count DESC;

Insight: Identifies which medications are most prescribed by gender, aiding targeted
healthcare planning.

5.3.6 6. Hospital Revenue Analysis


SELECT H.Hospital_Name,
SUM(B.Billing_Amount) AS Total_Revenue,
COUNT(B.Billing_ID) AS Total_Bills

24
FROM Hospital H
JOIN Visit V ON H.Hospital_ID = V.Hospital_ID
JOIN Billing B ON V.Visit_ID = B.Visit_ID
GROUP BY H.Hospital_Name
ORDER BY Total_Revenue DESC;

Insight: Determines each hospital’s contribution to total revenue, helping track financial
performance.

5.4 Advanced Query Example: Top Performing Doc-


tor by Revenue
SELECT TOP 1 D.Doctor_Name,
[Link],
SUM(B.Billing_Amount) AS Total_Revenue
FROM Doctor D
JOIN Visit V ON D.Doctor_ID = V.Doctor_ID
JOIN Billing B ON V.Visit_ID = B.Visit_ID
GROUP BY D.Doctor_Name, [Link]
ORDER BY Total_Revenue DESC;

Insight: Highlights the top-performing doctor based on total billing revenue generated,
which can be useful for performance evaluation.
The SQL implementation phase successfully demonstrated the use of SQL Server
for healthcare data management and analytics. The system allows for data retrieval,
relationship validation, and multi-table joins, enabling insights into hospital efficiency,
doctor workload, and patient demographics. These analytical queries can be extended
further to build dashboards and predictive models.
Key outcomes of this implementation:

• The normalized database successfully minimized data redundancy.

• Analytical queries yielded valuable operational and clinical insights.

• SQL Server proved to be a robust environment for executing large-scale data anal-
ysis.

25
6
Results and Discussion

This presents the outcomes and interpretations derived from the analytical queries exe-
cuted in SQL Server Management Studio (SSMS). Each query described in the previous
chapter was designed to extract key insights from the normalized healthcare database.
The focus of this discussion is to evaluate hospital performance, patient demographics,
doctor efficiency, and insurance trends based on the results generated by SQL queries.
By interpreting these outcomes, this section demonstrates how SQL can be effectively
used for healthcare data analytics and decision-making.

6.1 Overview of Analytical Outputs


After successful execution of all SQL queries, the results were exported and reviewed
to derive patterns and insights. For better understanding, each major query result is
summarized and discussed below.

6.1.1 Medical Condition Occurrences


The frequency analysis of medical conditions reveals that arthritis, diabetes, and hyper-
tension are the most common diagnoses. This indicates a prevalence of chronic, lifestyle-
related diseases within the dataset.

26
Figure 6.1: Distribution of Top Medical Conditions in the Healthcare Dataset

6.1.2 Billing Summary Analysis


The billing summary highlights variability in medical charges. The minimum billing value
indicates possible refunds or discounts, while the maximum reaches 52,764, suggesting
premium treatment or extended care durations.

Figure 6.2: Summary of Billing Records and Billing Range

6.1.3 Date-Based Derived Analysis


Using SQL’s DATEDIFF() and YEAR() functions, patient stay duration and admission
year/month were derived. This allows temporal trend analysis for hospital operations.

27
Figure 6.3: Sample Patient Data with Derived Date Fields and Stay Duration

6.1.4 Insurance Provider Billing Distribution


This analysis focuses on understanding how billing amounts are distributed among dif-
ferent insurance providers. The SQL query aggregated the total billing amounts, the
number of claims processed, and the average billing per claim using the SUM(), COUNT(),
and AVG() functions respectively. As shown in Figure 6.4, Cigna processed the high-
est number of claims (11,249) with a total billed amount of approximately 287 million,
closely followed by Medicare and Blue Cross. Although the total billed amounts among
the top providers are fairly close, slight variations in the average billed amounts indicate
differences in patient treatment costs and claim policies.

28
Figure 6.4: Billing distribution by insurance provider showing claim count, total billed,
and average billed.

6.1.5 Monthly Admissions Trend


The time-series analysis shows a steady inflow of patients each month, with peaks during
certain seasons. This reflects typical real-world hospital admission cycles.

Figure 6.5: Monthly Patient Admission Trends from 2019 to 2024

6.1.6 Gender-Based Distribution of Medical Conditions


The analysis presents the distribution of various medical conditions across male and
female patients. It highlights that females show a higher patient count for conditions
such as Arthritis, Diabetes, and Obesity, whereas males exhibit comparatively higher

29
counts for Diabetes and Hypertension. This distribution provides valuable insights into
gender-specific health trends within the dataset.

Figure 6.6: Gender-Based Distribution of Medical Conditions

6.1.7 Total Revenue Generated by Each Hospital


The analysis summarizes the financial performance of hospitals based on total revenue,
patient count, and average billing amount. It reveals that Johnson PLC achieved the
highest total revenue, followed closely by LLC Smith and Smith PLC. Hospitals with
higher patient volumes generally contribute to greater total revenue, though average
billing rates vary significantly across institutions. This indicates differences in treatment
costs and service offerings among hospitals.

30
Figure 6.7: Total Revenue Generated by Each Hospital

6.1.8 Top Performing Doctors by Total Billing Amount


The analysis identifies the top five doctors based on the total billing amount generated.
Michael Smith leads with the highest total revenue, followed by Robert Smith and
John Smith. The data also shows variations in the number of patients treated and
average billing values, indicating differences in patient load and service pricing among
doctors. This ranking highlights the most financially impactful practitioners within the
healthcare system.

31
Figure 6.8: Top 5 Performing Doctors by Total Billing Amount

6.2 Discussion of Findings


Overall, the SQL analyses provided valuable operational and clinical insights:

1. Hospital-level insights: Larger hospitals manage a higher number of admissions


and contribute more to total revenue.

2. Doctor-level insights: Specialists in high-demand fields such as cardiology and


neurology handle more patients and generate higher billing.

3. Insurance insights: A small number of insurance providers dominate claim pro-


cessing, indicating potential for negotiation or diversification.

4. Patient trends: Gender-based prescription patterns reveal demographic influences


on medical conditions.

5. Operational efficiency: Average hospital stay duration metrics help optimize


bed management and treatment planning.

32
7
Conclusion and Future Scope

7.1 Conclusion
The healthcare analytics project implemented using SQL Server successfully demon-
strated how structured data management and analytical querying can deliver meaning-
ful insights for hospital administration, patient management, and financial monitoring.
Starting with the Kaggle Healthcare Dataset by Prasad22, the project involved a com-
prehensive end-to-end process including data cleaning, normalization, relational schema
design, SQL implementation, and the execution of analytical queries.
The project effectively illustrated the role of SQL as a powerful tool for healthcare
data analysis. The normalized database model eliminated redundancy and maintained
data integrity across interlinked entities such as patients, doctors, hospitals, billing, and
medications. Through the use of joins, aggregation functions, and conditional queries, a
variety of analyses were conducted—ranging from hospital performance metrics to patient
demographic insights.
Key accomplishments from the project include:
1. Development of a fully normalized and relational healthcare database schema in
SQL Server.

2. Execution of diverse SQL operations such as JOIN, GROUP BY, HAVING, and sub-
queries for analytical evaluation.

3. Discovery of actionable insights regarding hospital revenue, doctor performance,


and insurance claim trends.

4. Preparation of a reusable SQL-based data analytics framework suitable for integra-


tion into advanced business intelligence tools.
The overall outcome reaffirms the practical importance of database systems in health-
care management. SQL remains a foundational skill that supports more advanced an-
alytical and machine learning tasks. This project highlights how proper data modeling

33
and structured query techniques can drive evidence-based decisions in clinical and ad-
ministrative contexts.

7.2 Key Learnings


Throughout the course of this project, several important lessons were learned:

• The significance of normalization in avoiding redundancy and ensuring efficient


query performance.

• The importance of referential integrity and foreign key constraints for maintaining
consistency across related entities.

• The role of SQL joins, subqueries, and aggregate functions in extracting actionable
business insights.

• The value of exploratory data validation before analysis to ensure the quality and
accuracy of results.

In addition, the project reinforced the understanding that clean and well-structured
data is the cornerstone of successful analytics. Even the most sophisticated analysis tools
depend heavily on the foundation established through effective database design.

7.3 Challenges Faced


While implementing the project, several challenges were encountered:

1. Data inconsistencies in the raw Kaggle dataset required manual cleaning and stan-
dardization.

2. Ensuring relational integrity across multiple tables was complex due to overlapping
attributes.

3. Some healthcare attributes (e.g., medical conditions, test results) lacked standard-
ized terminologies, making categorization challenging.

4. Query optimization was necessary for efficient execution when joining large datasets.

These challenges were addressed through systematic data validation, normalization


up to the Third Normal Form (3NF), and the use of SQL Server’s indexing and constraint
management features.

34
7.4 Future Scope
Although this project achieved its primary objectives, there are several opportunities for
future enhancement and expansion:

1. Integration with Business Intelligence (BI) Tools: Future versions can con-
nect the SQL Server database to Power BI or Tableau for real-time dashboards,
enabling hospital administrators to visualize KPIs such as patient inflow, depart-
ment revenue, and treatment outcomes.

2. Predictive Analytics using Machine Learning: By exporting SQL data to


Python or R, predictive models could be built for forecasting hospital occupancy,
predicting disease occurrence, or estimating billing amounts.

3. Automation of Data Pipelines: ETL (Extract, Transform, Load) processes


could be automated using SQL Server Integration Services (SSIS) to support con-
tinuous data updates.

These extensions will not only make the system more powerful but also align it with
current industry practices where data-driven healthcare is becoming increasingly impor-
tant.

7.5 Final Remarks


This SQL-based healthcare analytics project successfully bridged theoretical database
concepts with real-world applications. It demonstrated how structured data and well-
designed queries can empower decision-makers in the healthcare domain.
The project’s outcomes reinforce the idea that data management, when combined
with analytical rigor, can lead to improved healthcare delivery, financial transparency,
and patient satisfaction. The same framework can easily be adapted to other domains
such as banking, retail, or insurance with minor modifications.
In conclusion, this project serves as a strong foundation for future work in health-
care data analytics and demonstrates the capability of SQL Server as a comprehensive
environment for managing, analyzing, and visualizing complex healthcare data.

35
References

• Prasad22. (2021). Healthcare Dataset. Kaggle. [Link]


prasad22/healthcare-dataset

• Ali, F., Khan, M., & Patel, R. (2019). Implementation of Hospital Database System
Using SQL Server. International Journal of Computer Applications, 178(8), 23–29.

• George, A., & Mathew, R. (2020). Data Warehousing for Healthcare Decision
Support. Journal of Health Informatics Research, 5(2), 115–130.

• Kumar, S., Sharma, P., & Desai, R. (2020). Optimization of Relational Databases
for Hospital Management Systems. International Journal of Information Technol-
ogy, 12(4), 987–995.

• Lopez, J., & Pereira, L. (2021). Decision Support Systems in Healthcare Using SQL
and Predictive Analytics. Healthcare Data Science Journal, 8(3), 201–215.

• Prasad, A., Mehta, S., & Dey, R. (2021). Healthcare Analytics using SQL and
Python. Data Science Review, 9(2), 78–91.

• Ramesh, V., & Gupta, K. (2021). Relational Database Model for Efficient Hospital
Management. Journal of Information Systems Engineering, 7(1), 33–42.

• Singh, R., & Sharma, A. (2022). SQL-Based Data Analysis of Hospital Performance
Metrics. Journal of Computing and Data Engineering, 10(1), 45–60.

• Wang, T., Li, H., & Zhao, P. (2020). Data Cleaning and Validation of Electronic
Health Records using SQL Triggers. International Journal of Medical Informatics,
135(5), 220–229.

36

Common questions

Powered by AI

Major challenges in implementing an end-to-end healthcare data analytics solution using SQL include ensuring data consistency and integrity during preprocessing, maintaining data privacy and compliance as real patient data cannot be used, and the complexity of designing a normalized relational database schema. Additionally, performing efficient query operations and extending these findings to real-time hospital systems or dashboards present significant challenges .

The SQL-based healthcare analytics project framework can be extended by integrating business intelligence tools to convert SQL query results into visual dashboards that provide real-time updates and insights. Predictive analytics can be incorporated by using SQL-extracted data features for machine learning models to forecast healthcare trends, patient outcomes, or financial scenarios. These enhancements could provide proactive decision-support capabilities and improve strategic planning in healthcare settings .

Gender-based analysis of medical conditions in the healthcare dataset reveals valuable insights into demographic health trends, showing that females recorded higher cases of conditions like arthritis, diabetes, and obesity, while males showed comparatively higher counts for diabetes and hypertension. These insights allow healthcare providers to tailor interventions and resources according to gender-specific health needs, which can enhance patient outcomes and improve healthcare resource allocation .

Analytical SQL queries, such as joins, aggregations, and subqueries, were used to explore aspects like hospital performance metrics, patient demographics, billing trends, and insurance claims from the healthcare dataset. These queries led to insights such as patterns in patient visits, hospital revenue generation, high-demand medical conditions, gender-specific health trends, and doctor performance as mirrored by billing amounts. These insights assist in strategic planning and resource management within healthcare settings .

The SQL Server-based healthcare analytics project demonstrates the role of SQL in modern healthcare data management by showcasing SQL's capability to manage structured data through normalization, data integrity maintenance, and the execution of complex queries for insight generation. These insights cover several aspects, such as hospital revenue, doctor performance, and insurance claims. The project highlights SQL's fundamental role as a tool for evidence-based decision-making and its compatibility with advanced analytics and machine learning tasks within healthcare systems .

Healthcare administrators benefit from SQL-based hospital performance analytics by obtaining detailed insights into trends such as hospital revenue, patient admission cycles, doctor performance, and billing variations. These insights enable administrators to make informed decisions on resource allocation, strategic planning, and operational improvements, ultimately enhancing hospital efficiency and patient care quality .

The analysis of billing distribution among insurance providers provides strategic insights into claim processing dynamics, showing which providers process the highest number of claims and their respective total and average billing values. This information helps healthcare providers to negotiate better terms, diversify their insurance partnerships, optimize reimbursement strategies, and identify potential areas for financial improvement .

Using a synthetic dataset in the SQL healthcare analytics project is significant for maintaining patient privacy and complying with healthcare regulations. It allows researchers to explore data management and analytics techniques without the complexities of using real patient data. Consequently, this approach ensures ethical standards are met while still enabling the demonstration and development of comprehensive analytic methodologies suitable for real-world applications .

The primary components of a relational database system designed for healthcare analytics using SQL Server include structured and normalized database schemas representing entities such as patients, doctors, hospitals, billing, and insurance claims. The system involves data preprocessing to ensure consistency and integrity, the design of SQL queries for extracting insights on hospital performance, patient demographics, billing trends, and insurance claims. These components facilitate data-driven decision-making and strategic planning within the healthcare domain .

The SQL time-series analysis of monthly patient admissions identified a steady patient inflow with seasonal peaks, reflecting typical real-world hospital admission cycles. These trends imply that hospitals could anticipate and prepare for fluctuations in patient numbers, manage staffing and resource allocation more effectively, and plan for peak periods to enhance operational efficiency and patient care during high-demand times .

You might also like