Custom Exceptions in Python Data Pipelines

Designing clear, contextual error handling for robust, testable, and observable data-science workflows

Diogo Ribeiro

~7 min read · May 30, 2025 (Updated: May 30, 2025) · Free: No

Introduction

Exception handling is central to building reliable, maintainable software. In data-science pipelines — where you ingest, transform, model, and persist data — errors can arise at many layers: missing configurations, malformed input, download failures, model pickle errors, merge mismatches, or unexpected model types. Relying solely on built-in exceptions often leads to opaque tracebacks and tangled try/except trees that hide the true source of a problem.

Custom exception classes give you:

Clarity. Each error carries a clear, descriptive name that pinpoints the failure mode.
Context. You can embed relevant metadata (user ID, model key, table name).
Hierarchy. You can group related errors under a common base, catching broad categories or drilling down to specifics.
Observability. Logs and monitoring tools can filter by exception class.
Testability. Unit tests can assert that a specific input raises the correct exception.

In this article, we'll explore how to design a robust exception hierarchy, where and how to raise these errors in a typical data-science pipeline, and best practices for catching, logging, and testing them.

1. Why Custom Exceptions Matter

Imagine you have a function that downloads a CSV from S3, loads it into a pandas DataFrame, merges it with stats from DynamoDB, trains a model, and then persists alerts. If any step fails, you want:

A clear exception name.
A message that includes the resource you were operating on.
The original stack trace for debugging.

Consider this naive code:

try:
    raw_csv = s3_query.download_object(bucket, key)
    df = pd.read_csv(BytesIO(raw_csv))
    stats = dynamo_query.get_data("stats", cid, start, window)
    merged = pd.merge(df, stats, on="cid")
    model = pickle.load(open(model_path, "rb"))
    alert = model.predict(merged)
    iceberg_query.insert_data(alert)
except Exception as e:
    logger.error("Pipeline failed: %s", e)
    return False

Here you lose specificity. A download failure, parse error, merge key error, pickle failure, or database error all collapse into Exception. Your logs say "Pipeline failed: …" but not why.

By contrast, custom exceptions let you write:

try:
    raw_csv = s3_query.download_object(bucket, key)
    if not raw_csv:
        raise DataDownloadError(bucket, key)
    df = pd.read_csv(BytesIO(raw_csv))
except DataDownloadError as dde:
    logger.error("Could not fetch CSV: %s", dde)
    raise  # or handle
except pd.errors.ParserError as pe:
    raise DataValidationError(f"CSV parsing failed for key={key}") from pe

Now each failure mode is self-documenting.

2. Designing an Exception Hierarchy

A sensible hierarchy groups related errors under a common base. For example:

BaseDataPipelineError (inherits Exception)
├─ ConfigurationError
├─ DataLayerError
│  ├─ DataDownloadError
│  ├─ EmptyDataError
│  └─ MergeDataError
├─ ModelError
│  ├─ ModelLoadingError
│  ├─ SerializationError
│  ├─ UnexpectedModelTypeError
│  └─ PredictionError
└─ ProcessingError
   ├─ DataValidationError
   └─ ParameterError

Catch at different levels:

try:
    pipeline()
except ModelError as me:
    alert_ops.notify_dev_team(me)
except DataLayerError as dle:
    # maybe retry download
except BaseDataPipelineError as bpe:
    logger.error("Pipeline error: %s", bpe)
    raise

This gives you flexibility: broad recovery for data-layer failures, escalation for model failures, and a catch-all at the base.

3. Defining Custom Exception Classes

Create a module errors.py:

# errors.py

from typing import Optional, Any


class BaseDataPipelineError(Exception):
    """Root of all pipeline errors."""
    pass


class ConfigurationError(BaseDataPipelineError):
    """Environment configuration missing or invalid."""
    def __init__(self, message: Optional[str] = None) -> None:
        default = "Missing or invalid environment configuration."
        super().__init__(message or default)


class DataLayerError(BaseDataPipelineError):
    """Base for all data access failures."""
    pass


class DataDownloadError(DataLayerError):
    """
    Raised when an object cannot be downloaded from external storage.
    
    Attributes:
        bucket: S3 bucket name.
        key: Object key.
    """
    def __init__(self, bucket: str, key: str, message: Optional[str] = None) -> None:
        context = f"bucket='{bucket}', key='{key}'"
        default = f"Failed to download object from S3 ({context})."
        super().__init__(message or default)
        self.bucket = bucket
        self.key = key


class EmptyDataError(DataLayerError):
    """Raised when a data source returns no records but at least one was expected."""
    def __init__(self, source: str, message: Optional[str] = None) -> None:
        default = f"No data returned from source: '{source}'."
        super().__init__(message or default)
        self.source = source


class MergeDataError(DataLayerError):
    """
    Raised when merging pandas DataFrames does not produce expected keys.
    
    Attributes:
        left_keys: keys from left DataFrame.
        right_keys: keys from right DataFrame.
    """
    def __init__(self, left_keys: Any, right_keys: Any, message: Optional[str] = None) -> None:
        default = f"Merge failed. Left keys: {left_keys}, Right keys: {right_keys}."
        super().__init__(message or default)
        self.left_keys = left_keys
        self.right_keys = right_keys


class ModelError(BaseDataPipelineError):
    """Base for all model-related failures."""
    pass


class ModelLoadingError(ModelError):
    """Raised when a model file cannot be loaded (e.g. bad pickle)."""
    def __init__(self, model_key: str, message: Optional[str] = None) -> None:
        default = f"Failed to load model from key: '{model_key}'."
        super().__init__(message or default)
        self.model_key = model_key


class SerializationError(ModelError):
    """Raised when serialization or deserialization fails."""
    def __init__(self, obj_desc: str, message: Optional[str] = None) -> None:
        default = f"Serialization error for object: {obj_desc}."
        super().__init__(message or default)
        self.obj_desc = obj_desc


class UnexpectedModelTypeError(ModelError):
    """Raised when encountering a model type that is not supported."""
    def __init__(self, model_type: Any, message: Optional[str] = None) -> None:
        default = f"Unsupported model type: '{model_type}'."
        super().__init__(message or default)
        self.model_type = model_type


class PredictionError(ModelError):
    """Raised when model prediction fails unexpectedly."""
    def __init__(self, model_type: str, user_cid: str, message: Optional[str] = None) -> None:
        default = f"Prediction failed for user '{user_cid}' using model '{model_type}'."
        super().__init__(message or default)
        self.model_type = model_type
        self.user_cid = user_cid


class ProcessingError(BaseDataPipelineError):
    """Base for all data-processing failures."""
    pass


class DataValidationError(ProcessingError):
    """Raised when input data fails validation checks (missing columns, wrong dtypes)."""
    def __init__(self, message: Optional[str] = None) -> None:
        default = "Input data validation failed."
        super().__init__(message or default)


class ParameterError(ProcessingError):
    """Raised when a function receives an argument it cannot handle."""
    def __init__(self, param_name: str, param_value: Any, message: Optional[str] = None) -> None:
        default = f"Invalid parameter '{param_name}': {param_value!r}."
        super().__init__(message or default)
        self.param_name = param_name
        self.param_value = param_value


class TimeZoneError(BaseDataPipelineError):
    """Raised when updating to an invalid or unsupported time zone."""
    def __init__(self, time_zone: str, message: Optional[str] = None) -> None:
        default = f"Invalid time zone provided: '{time_zone}'."
        super().__init__(message or default)
        self.time_zone = time_zone

Each class:

Inherits from a logical base.
Carries context as attributes.
Has a clear docstring.
Provides a default message but allows overrides.

4. Where to Raise Custom Errors

Below are common failure points in a data pipeline and sample code showing how to incorporate custom exceptions.

4.1 Configuration Loading

# config.py

import os
from errors import ConfigurationError

def get_env_variable(name: str) -> str:
    value = os.environ.get(name)
    if not value:
        raise ConfigurationError(f"{name} environment variable is required")
    return value

def load_config() -> dict:
    config = {}
    config['REGION'] = get_env_variable("REGION")
    config['STAGE'] = get_env_variable("STAGE")
    # ...
    return config

Fail early if anything is missing rather than let code downstream break with KeyError or NoneType.

4.2 Data Download

# in predict_no_activity_probabilities

from errors import DataDownloadError

model_results_csv = s3_query.download_object(bucket, key)
if not model_results_csv:
    raise DataDownloadError(bucket, key)

4.3 Empty Data

# in preprocess_event_data

if not raw_data:
    raise EmptyDataError(source="raw_event_data")

data = pd.DataFrame(raw_data)
required_cols = {'ts', 'type', 'cid'}
missing = required_cols - set(data.columns)
if missing:
    raise DataValidationError(f"Missing columns in raw_data: {missing}")

4.4 Model Loading

# in predict_no_activity_probabilities

try:
    loaded_model = pickle.load(BytesIO(model_data))
except (pickle.UnpicklingError, EOFError) as e:
    raise SerializationError(obj_desc=f"model for user {user_cid}") from e

4.5 Unexpected Model Type

# determine which prediction routine to run
if best_model_type not in {'Hurdle', 'Tie', 'ZIP', 'Poisson'}:
    raise UnexpectedModelTypeError(best_model_type)

4.6 Prediction

try:
    if best_model_type in ['Hurdle', 'Tie']:
        prob = 1 - loaded_model.predict(user_data)
    elif best_model_type == 'ZIP':
        prob = calculate_zero_probability_zip_model(loaded_model, user_data)
    else:
        prob = calculate_zero_probability_poisson_model(loaded_model, user_data)
except Exception as e:
    # wrap any unexpected failure in PredictionError
    raise PredictionError(best_model_type, user_cid) from e

4.7 Merging DataFrames

from errors import MergeDataError

merged = pd.merge(left, right, on="cid", how="left")
if 'cid' not in merged or merged.empty:
    raise MergeDataError(left_keys=left['cid'].unique(), right_keys=right['cid'].unique())

4.8 Time Zone Updates

from errors import TimeZoneError
from dateutil.tz import gettz

tz = gettz(time_zone)
if tz is None:
    raise TimeZoneError(time_zone)
os.environ['TZ'] = time_zone

5. Catching and Handling Exceptions

Organize exception handling by level:

def handler(event, context):
    try:
        main(event)
    except ConfigurationError as ce:
        logger.error("Config issue: %s", ce)
        raise  # cannot proceed
    except DataDownloadError as dde:
        logger.warning("Transient download failure, retrying: %s", dde)
        retry_download(dde.bucket, dde.key)
    except ModelError as me:
        error_monitor.notify(me)
        # maybe skip this user
    except BaseDataPipelineError as bpe:
        logger.error("Pipeline error: %s", bpe)
        raise

ConfigurationError: fatal, re-raise.
DataDownloadError: transient, retry logic.
ModelError: alert devs, skip or fallback.
BaseDataPipelineError: catch-all for logging and cleanup.

6. Logging and Observability

When an exception arises, include context in logs:

try:
    do_something()
except DataDownloadError as dde:
    logger.error("Download failed for user=%s, bucket=%s, key=%s",
                 user_cid, dde.bucket, dde.key, exc_info=True)
    raise

Use exc_info=True to attach full traceback.
Log structured data (user ID, table name, timestamps).
Monitor specific exception classes in your observability stack (e.g. Sentry, CloudWatch).

7. Testing Custom Exceptions

Write unit tests that assert both the exception class and message:

import pytest
from errors import EmptyDataError, DataValidationError

def test_empty_raw_data_raises():
    with pytest.raises(EmptyDataError) as exc:
        preprocess_event_data([], stats_data)
    assert "raw_event_data" in str(exc.value)

def test_missing_columns_raises():
    raw = [{"ts": 1234567890}]  # missing 'type' and 'cid'
    with pytest.raises(DataValidationError) as exc:
        preprocess_event_data(raw, stats_data)
    assert "Missing columns" in str(exc.value)

Use pytest.raises to catch the specific exception.
Inspect exc.value attributes for contextual fields.

8. Best Practices and Pitfalls

Don't over-customize. Only create an exception when it adds clarity.
Keep hierarchy shallow. 3–4 levels are usually enough.
Include context. Store attributes (e.g. bucket, model_key) for logging and downstream handling.
Avoid catching bare Exception. Catch only what you expect.
Propagate the original error. Use raise ... from to preserve traceback.
Document your exceptions. In docstrings and in a central errors.py.

9. Real-World Example: No-Activity Alert Pipeline

Putting it all together in a snippet:

# handler.py

import os
import pickle
import pandas as pd
from io import BytesIO

from config import load_config
from errors import (
    ConfigurationError, DataDownloadError, EmptyDataError,
    SerializationError, MergeDataError, PredictionError
)

CONFIG = load_config()

def preprocess(raw_data: list, stats_data: list) -> pd.DataFrame:
    if not raw_data:
        raise EmptyDataError(source="raw_event_data")
    df = pd.DataFrame(raw_data)
    required = {"ts", "cid", "type"}
    missing = required - set(df.columns)
    if missing:
        raise DataValidationError(f"Missing columns: {missing}")
    # … further processing …
    return df

def predict(df: pd.DataFrame, cid: str) -> float:
    key = f"models/{cid}.pkl"
    raw = s3_query.download_object(CONFIG['ML_BUCKET'], key)
    if not raw:
        raise DataDownloadError(CONFIG['ML_BUCKET'], key)
    try:
        model = pickle.load(BytesIO(raw))
    except Exception as e:
        raise SerializationError(obj_desc=f"model for {cid}") from e

    try:
        return model.predict(df)[0]
    except Exception as e:
        raise PredictionError(model_type=type(model).__name__, user_cid=cid) from e

def handler(event, context):
    for record in event['Records']:
        user = record['body']
        try:
            raw = get_raw_events(user)
            stats = get_stats(user)
            df = preprocess(raw, stats)
            prob = predict(df, user)
            if prob < 0.1:
                send_alert(user, prob)
        except EmptyDataError:
            continue  # nothing to do
        except ConfigurationError:
            raise
        except DataDownloadError as dde:
            retry_later(dde.bucket, dde.key)
        except BaseDataPipelineError as bpe:
            logger.error("Pipeline error for user=%s: %s", user, bpe)
            notify_dev(bpe)

Conclusion

Custom exceptions transform messy, opaque error flows into a self-documenting, testable, and observable architecture. By:

Designing a clear hierarchy.
Defining classes with context.
Raising them at well-defined failure points.
Catching and handling them with intent.
Logging structured, detailed messages.
Testing for specific error types.

you'll build data-science pipelines that are easier to debug, extend, and operate. Next time you encounter a silent failure or tangled traceback, reach for a well-named custom exception instead. Your future self (and your team) will thank you.

Happy coding!

#data-science #mlops #machine-learning #deep-learning #development