Feature Platform — On Lakehouse

An Idea to build a feature platform on Real-Time Lakehouse

Amit Singh Rathore

~3 min read · July 31, 2025 (Updated: July 31, 2025) · Free: Yes

A feature platform is a critical component for machine learning (ML) systems, enabling feature registration, online feature serving, offline feature serving, and a feature store to manage and serve feature data efficiently. Using the real-time lakehouse stack of Apache Flink, Apache Paimon, Apache Iceberg, and high-performance query engines like Apache Doris, StarRocks, or Presto, we can build a robust feature platform that supports both real-time and batch ML workloads.

Features of this setup:

Streaming-native with batch fallback
Unified compute path, multiple materializations
Fresh, fast features online with deep, consistent history offline
Zero vendor lock-in and high flexibility

Components

Feature Registration — A system to define, version, and manage feature definitions.
Feature Store — A centralized repository to store and manage feature data, supporting both online and offline use cases.
Online Feature Serving — Low-latency feature retrieval for real-time ML inference (e.g., recommendation systems).
Offline Feature Serving — High-throughput feature retrieval for training ML models or batch inference.

Feature Registration (Central Registry)

Metadata about features (name, data type, freshness, owner, tags, lineage)
Tracks what features are materialized where (Paimon, Iceberg, etc.)
Can be built using: (DB + REST API)
Or integrate with Feast, or open-source alternatives like Tecton SDK + metadata DB

A centralized Feature repository to store and manage feature data, supporting both online and offline use cases.

Online Feature Store (Paimon)

Why Paimon? Paimon supports streaming writes and low-latency reads, making it ideal for online feature serving. It handles incremental updates and changelog-based processing, perfect for real-time feature updates

Write feature vectors as Flink changelogs
Can integrate with Flink SQL or Table API

Supports:

Primary-key-based upserts
Time-versioned lookups
Low-latency serving via key-based retrieval

Offline Feature Store (Iceberg)

Why Iceberg? Iceberg's ACID transactions, schema evolution, and partitioning make it ideal for large-scale, historical feature data used in model training.

Store snapshots of features for:

Model training
Batch scoring
Backfills & re-computation

Supports

Time-travel (for training/inference consistency)
Schema evolution
Large-scale batch analytics
Flink or Spark aggregates Paimon data (e.g., daily or hourly aggregates) and writes to Iceberg.
Iceberg tables are partitioned by date or user_id for efficient querying.

Online Feature Serving

For online feature serving, low-latency access is critical. Paimon serves as the primary store, with Fluss as an optional in-memory layer for ultra-low-latency use cases (e.g., <10ms).

Paimon for Online Serving

Paimon tables store precomputed features (e.g., user_click_rate, last_purchase_amount).
Flink continuously updates Paimon tables with fresh data from event streams.
A serving layer (e.g., a REST API or gRPC service) queries Paimon for features using key-value lookups.

Redis for Ultra-Low Latency (Optional)

For ultra-low-latency use cases, Redis can cache hot features in memory.
Flink writes to Redis for ephemeral features with a TTL, which are then flushed to Paimon for persistence.
Example: A bidding system retrieves user_bid_score from Redis

Implementation:

Deploy a Flink job to compute and update features in Paimon (or Fluss).
Use a lightweight API server (e.g., FastAPI, gRPC) to serve features from Paimon/Fluss to ML models.

Fast Offline Serving: Doris / ClickHouse / Presto

Query feature tables via external table connectors (Iceberg/Paimon)
Vectorized OLAP for: Batch inference, Monitoring dashboards, Training set exploration

Materialization Paths

Feature Consistency Strategy

Optional Enhancements

Feature Lineage UI → Show data flows and freshness across stores
Feature Testing Framework → Alert on drift/null spikes/value range anomalies
Monitoring Dashboard → Doris/StarRocks over Iceberg for model observability

#data-science #mlops #machine-learning #platform-engineering #streaming