Lattice cryptography is considered the key way to protect digital devices, processes, and data in the post-quantum era, when Cryptographically Relevant Quantum Computers (CRQC) will create serious problems for traditional public key cryptographic systems. Unlike traditional elliptic curve cryptography, lattice schemes like ML-KEM/Kyber and ML-DSA/Dilithium use much larger keys and complex mathematical computations.
Most modern lattice-based schemes are implemented using polynomial rings for their operations. The bottleneck in these systems is polynomial multiplication. Number theoretic transforms (NTT) that reduce the multiplication complexity are the most critical optimisation for performance. We have published a blog explaining the origins, evolution, and emerging frontiers of NTT.
An important technique is the lazy reduction, which avoids the modular reduction after every polynomial addition or multiplication. It is suggested to accumulate results in a larger register and perform a single reduction at the end to save CPU cycles. It is also possible to leverage hybrid modular reduction techniques for ultra-constrained cores like ARM Cortex-M0. These methods will help fix the problem of not having enough high-speed multiplication instructions.
Another important measure is to use seed-based key generation. Instead of storing the full private key of 1 KB size, we can store a 32-byte cryptographic seed. We can apply key derivation techniques to derive the expanded private key only when it is needed during the signing or decryption process. It is also suggested to avoid using the static allocation approach. Embedded systems must use pre-allocated buffers to prevent heap fragmentation and deterministic timing.
Another crucial aspect is the prevention of side-channel attacks. As IoT devices are physically accessible, they are more vulnerable to power and electromagnetic analysis. Every operation, particularly the rejection sampling used in signature schemes, should be executed in constant time. It is also suggested to split secret variables into multiple randomised shares for the polynomial multiplication steps.
When we look at the practical implementation, the lattice public key may exceed the 127-byte MTU of Zigbee or the small payloads of BLE. Hence, a robust packet fragmentation implementation will be crucial. Lattice schemes also require high-quality randomness for error vector generation. Hence, it is recommended to use a hardware-based True Random Number Generator (TRNG) rather than a software-based pseudo-randomness.