Orina Dennis
Silas Wachira
Peter Mwangi
Humphrey Okaka
2. Given the challenges with missing or inconsistent data in medical datasets, how would you
design a more robust data preprocessing pipeline that can handle complex medical variables,
considering both imputation and feature engineering techniques?
Data Cleaning and Validation
Automated Data Validation Rules:
o Flag implausible values (e.g., BMI < 10 or > 80, blood pressure < 60 or > 250).
o Check logical consistency (e.g., no insulin data without a diabetes diagnosis).
Outlier Detection:
o Use IQR or Z-score methods for numeric features.
o Apply domain knowledge thresholds (e.g., clinically meaningful ranges for
glucose or HbA1c).
2. Advanced Missing Value Imputation
Rather than basic median/mode imputation:
A. Context-Aware Imputation
Use Multiple Imputation by Chained Equations (MICE) for numeric and categorical data,
which models each variable conditionally based on the others.
Incorporate Temporal Information: If time-series data is available (e.g., patient check-
ups), use last known values or interpolate trends.
B. Model-Based Imputation
Train a regression or classification model to predict missing values using complete cases.
o E.g., Predict cholesterol based on age, BMI, blood pressure, and glucose.
C. Missingness Indicators
Add binary flags to indicate where data is missing. This helps the model learn patterns of
missingness (which can be clinically meaningful).
3. Feature Engineering for Medical Context
A. Derived Clinical Indices
BMI categories (Underweight, Normal, Overweight, Obese).
Metabolic Syndrome Score: Combine waist circumference, glucose, HDL, triglycerides,
and blood pressure.
Diabetes Risk Scores: Use established tools like the FINDRISC score as features.
B. Interaction Terms
Create features like:
o Age × BMI (to account for aging-related weight effects).
o HbA1c × Glucose (to detect potential prediabetes).
C. Categorical Groupings
Group rare categories in ethnicity or occupation into “Other”.
Encode ordinal relationships (e.g., exercise frequency: Never < Occasionally < Regularly).
4. Data Normalization & Encoding
StandardScaler or MinMaxScaler for numerical features.
Target or Frequency Encoding for high-cardinality categorical variables (like zip code or
occupation).
5. Feature Selection & Dimensionality Reduction
Use SHAP values or Recursive Feature Elimination (RFE) to identify high-impact
variables.
Consider PCA or Autoencoders to reduce noise in high-dimensional clinical data.
6. Real-Time Data Consistency Checks (for Deployment)
Implement backend validation scripts during EMR data entry to ensure:
o Required fields are populated.
o Realistic value ranges are enforced in real-time.