Ensure to size wisely all the microservice that compose your data mesh could be a pain. Here a few tips to helps you in this challenging area.
Critical parameters
- CPU Limit
This one will be challenged mainly at startup time. As an example on our Project, we are working with Spring Boot as an ease of life framework, this one come with a cost. Thankfully, CPU limit is a shared resource.
- CPU Request
For this one, we usually use the pre-production environment, and try to have the average CPU usage over a relevant business period of time (usually one day). It allows tailoring it as efficiently as possible.
- Memory Heap (XMS/XMX)
The maximum heap usage can be very well anticipated :
Number of Sub-topology in your Stream * Number of Partition * Reserved Buffer for Producers (Default to 32 MB) + Reserved Buffer for Consumer (Default to 50 MB) + Your other usages (Spring Boot "footprint" in our project is around 50 MB)
- Non-Heap Memory
If your Stream application is fully stateless, there is almost no non-heap usage.
As soon as you are using *Join, KTable, GlobalKTable or transform with State stores, you must reserve Number of Partition * Number Of Statestore * RocksDB Buffer (Default to ~= 100 MB) of memory outside Heap
- Disk
Last but not the least, the disk usage is the most complex one. From none with a stateless stream, It will fully depend on the design of your topology for stateful ones : retention of your windowed joins, size of your KTable and GlobalKTable, purging strategies of your handcrafted state stores.
There are a lot of strategies to reduce this footprint to a minimal one, but it will depend mainly on your business logic.
That why it's one of the most important parameters to monitor in production environment.
External resources
Some references to help you identify the number of state stores in your topology and the number of sub-topologies :
- A topology visualizer, from a topology description (see below) :
https://zz85.github.io/kafka-streams-viz/
- My work-in-progress tool for extracting input, outputs and state store names from a topology description (see below) :
How to get a Topology Description
The output will be :
Topologies: Sub-topology: 0 Source: KSTREAM-SOURCE-0000000000 (topics: [Topic1]) --> KSTREAM-MAP-0000000001 Processor: KSTREAM-MAP-0000000001 (stores: []) --> KSTREAM-FILTER-0000000002 <-- KSTREAM-SOURCE-0000000000 Processor: KSTREAM-FILTER-0000000002 (stores: []) --> KSTREAM-SINK-0000000003 <-- KSTREAM-MAP-0000000001 Sink: KSTREAM-SINK-0000000003 (topic: Topic2) <-- KSTREAM-FILTER-0000000002