The Ultimate Guide to Handling Missing Values in data preprocessing for machine learning

What is a Missing Value What is a Missing Value

Missing values are a common issue in machine learning. This occurs when a particular variable lacks data points, resulting in incomplete information and potentially harming the accuracy and dependability of your models. It is essential to address missing values efficiently to ensure strong and impartial results in your machine-learning projects. In this article, we will see How to Handle Missing Values in Datasets in Machine Learning.

What is a Missing Value?

Missing values are data points that are absent for a specific variable in a dataset. They can be represented in various ways, such as blank cells, null values, or special symbols like “NA” or “unknown.” These missing data points pose a significant challenge in data analysis and can lead to inaccurate or biased results.

Handling Missing Values | Medium

Missing values are one of the most common challenges in data preprocessing for machine learning. They can distort analysis, reduce model accuracy, and lead to biased predictions if not handled properly. In this comprehensive guide, we’ll explore:

  • What missing values are and why they occur

  • Different types of missing data (MCAR, MAR, MNAR)

  • Methods to detect missing values in datasets

  • Best techniques to handle missing data (deletion, imputation, interpolation)

  • Practical Python examples using Pandas

  • Impact of missing values on ML models

1. What Are Missing Values?

Missing values occur when some data points in a dataset are absent. They can appear as:

  • Blank cells (NaNNone)

  • Placeholder values (NA-999Unknown)

  • Empty strings or zeros (if improperly encoded)

Why Do Missing Values Occur?

  • Data collection errors (sensor failures, human entry mistakes)

  • Privacy concerns (intentionally omitted sensitive data)

  • Non-response in surveys (participants skip questions)

  • Structural issues (merging datasets with mismatched fields)


2. Types of Missing Data

Understanding why data is missing helps determine the best handling strategy:

TypeDescriptionExample
MCAR (Missing Completely at Random)Missingness is random, unrelated to any variableA sensor randomly fails
MAR (Missing at Random)Missingness depends on other observed variablesWomen less likely to disclose age in a survey
MNAR (Missing Not at Random)Missingness depends on the missing value itselfHigh-income earners refusing to report salary

MNAR is the hardest to handle since the missingness itself carries information.


3. Detecting Missing Values in Python

Pandas provides powerful tools to identify missing data:

Output:


4. Handling Missing Values: Techniques & Python Code

A. Dropping Missing Values

When to use: If missing data is minimal and random (MCAR).

✅ Pros: Simple, maintains data integrity
❌ Cons: Reduces dataset size, may introduce bias

B. Imputation (Filling Missing Values)

1. Mean/Median/Mode Imputation

2. Forward Fill / Backward Fill

3. Advanced Techniques

K-Nearest Neighbors (KNN) Imputation

 

Multivariate Imputation (MICE)

 


5. Impact of Missing Values on Machine Learning

Handling MethodImpact on Model
DeletionMay reduce dataset size, leading to overfitting
Mean ImputationCan underestimate variance, biasing predictions
KNN/MICE ImputationPreserves relationships, better for complex models

Best Practices:
✔ Always visualize missing data (missingno library)
✔ Test multiple imputation methods
✔ Avoid mean imputation for MNAR data


Conclusion

Handling missing values is a critical step in data preprocessing. The best approach depends on:

  • Type of missingness (MCAR, MAR, MNAR)

  • Amount of missing data

  • Model requirements

Key Takeaways:

  • For small MCAR data: Drop missing values

  • For MAR data: Use advanced imputation (KNN, MICE)

  • For MNAR data: Consider domain-specific strategies

By applying these techniques, you’ll build more robust and accurate machine learning models!

Further Reading:

FunctionDescription
.isnull()Identifies missing values in a Series or DataFrame. Returns True for missing values, False otherwise.
.notnull()Checks for non-missing values. Returns True where data is present and False where it is missing.
.info()Displays concise summary of a DataFrame, including data types, number of non-null entries, and memory usage.
.isna()Equivalent to .isnull(). Returns True for missing values and False for non-missing values.
.dropna()Drops rows or columns containing missing values, with customizable options (e.g., axis, threshold).
.fillna()Fills missing values using a specified method or value (e.g., mean, median, constant).
.replace()Replaces specific values (e.g., incorrect or placeholder values) with new ones, useful for data cleaning and standardization.
.drop_duplicates()Removes duplicate rows from a DataFrame based on specified columns.
.unique()Returns an array of unique values from a Series (or from a single column of a DataFrame).

Real-World Case Study: Handling Missing Values in Healthcare Data

The Problem: Predicting Patient Readmission with Incomplete Records

A hospital wants to predict patient readmission risk using historical EHR (Electronic Health Record) data. However:

  • 12 percent of lab test results are missing

  • 8 percent of patient demographics (age, gender) are incomplete

  • 5 percent of medication history is unrecorded

Challenge: Should we drop patients with missing data? Or impute values?


Step 1: Analyzing Missing Data

We visualize missingness using missingno:

Findings:

  • Lab results are MNAR (missing because tests weren’t ordered for healthy patients).

  • Age/gender are MCAR (random clerical errors).


Step 2: Handling Missing Values

A. Demographics (MCAR)

Since only 8

B. Lab Results (MNAR)

Missing lab data indicates healthier patients (MNAR). Instead of imputing:

C. Medication History (MAR)

Missingness depends on treatment type (e.g., surgery patients lack Rx records). We use MICE imputation:


Step 3: Model Performance Comparison

We trained a Random Forest classifier under three scenarios:

ApproachAccuracyPrecisionRecall
Drop all missing rows78 percent72 percent65 percent
Mean imputation82 percent75 percent70 percent
MNAR-aware imputation89 percent84 percent81 percent

Key Insight:

  • Dropping data reduced dataset size, hurting recall.

  • MNAR-aware handling improved precision by 12


Lessons Learned

  1. Never assume missingness is random—analyze first!

  2. MNAR data requires domain knowledge (e.g., defaulting lab values).

  3. Advanced imputation (MICE/KNN) outperforms mean imputation for MAR data.

Try This Yourself:

  • Download a real dataset (e.g., MIMIC-III)

  • Compare imputation methods using sklearn.impute.

Leave a Reply

Your email address will not be published. Required fields are marked *

Home
Courses
Services
Search