Handling 100GB Data in Real Time with Node.js and Bloom Filters in Redis

Imagine you're building a system that analyzes sensor data streaming in at 100GB per hour. Traditional methods might buckle under the…

Oleh Teslenko

Level Up Coding

· ~4 min read · February 14, 2024 (Updated: February 14, 2024) · Free: No

Imagine you're building a system that analyzes sensor data streaming in at 100GB per hour. Traditional methods might buckle under the pressure, but there's a hidden gem named Bloom Filters that can help you process data in real time with surprising efficiency. In this article, we'll explore how to use Bloom Filters in Redis to tackle this high-load challenge.

Why Bloom Filters?

Bloom Filters are probabilistic data structures that offer space-efficient membership testing. They use space cleverly, allowing you to check if an item might be present in a dataset without storing the entire dataset itself. This trade-off comes with a small chance of false positives (saying an item is there when it isn't), but for many high-throughput scenarios, it's a worthwhile compromise.

Our Scenario:

Think you have a network of sensors sending real-time readings, generating 100GB of data hourly. You need to identify unique sensor IDs quickly to trigger certain actions. Storing all IDs in memory can be expensive and slow. Bloom Filters to the rescue!

Using RedisBloom:

RedisBloom is an extension that adds Bloom Filter functionality to Redis, a popular in-memory data store. Here's how we can use it:

Install and start the Redis database via docker:

docker run -p 6379:6379 -it --rm redis/redis-stack-server:latest

2. Establish a connection to your Redis server, using node-redis npm package, and create a Bloom Filter named sensor_ids with a capacity suitable for your expected number of unique IDs.

import { createClient } from 'redis';

const client = createClient();

await client.connect();

const bloomFilterName = 'sensor_ids';
const filterCapacity = 1000000000; // 1 billion

await client.bf.reserve(bloomFilterName, 0.01, filterCapacity);

3. As sensor data arrives, efficiently add each unique ID to the Bloom Filter using the bf.add command. For large datasets, consider batching or splitting into smaller chunks. In this case, we simulate a big-sized file that includes 1 billion rows. The capacity is very important for the accuracy of your result. The bigger the capacity the more your filter takes place in memory but the better results he shows.

let promisesArr = [];

for (let i = 0; i < filterCapacity; i++) {
    promisesArr.push(client.bf.add(bloomFilterName, 'id' + i));
    if (i % 100000 === 0) {
        await Promise.all(promisesArr);
        promisesArr = [];
    }
    
}

4. When you need to verify if a new ID is unique, use the bf.exists command. It returns a true or false: if it's false, the ID is likely unique; if true, it might already exist.

const existValue = 'id1';
const notExistValue = 'id1000000001';
const isExists = await client.bf.exists(bloomFilterName, existValue);
const notExists = await client.bf.exists(bloomFilterName, notExistValue);

5. Keep an eye on your Bloom Filter's size and false positive rate using the bf.info command. You might need to adjust its parameters or consider alternative approaches for extremely large datasets.

const info = await client.bf.info(bloomFilterName);
// info looks like this:
//
//  {
//    capacity: 1000,
//    size: 1531,
//    numberOfFilters: 1,
//    numberOfInsertedItems: 12,
//    expansionRate: 2
//  }

Remember:

Bloom Filters offer a trade-off: space efficiency for occasional false positives. Evaluate if this suits your tolerance for inaccuracies.
Consider alternative approaches (e.g., partitioned Bloom Filters) for extremely large datasets or stricter accuracy requirements.

The full source code:

// run redis db
//docker run -p 6379:6379 -it --rm redis/redis-stack-server:latest

import { createClient } from 'redis';

const client = createClient();

await client.connect();

const bloomFilterName = 'sensor_ids';
const filterCapacity = 1000000000; // 1 billion
// Delete any pre-existing Bloom Filter.
await client.del(bloomFilterName);

try {
    await client.bf.reserve(bloomFilterName, 0.01, filterCapacity);
    console.log('Reserved Bloom Filter.');
} catch (e) {
    if (e.message.endsWith('item exists')) {
        console.log('Bloom Filter already reserved.');
    } else {
        console.log('Error, maybe RedisBloom is not installed?:');
        console.log(e);
    }
}

let promisesArr = [];
for (let i = 0; i < filterCapacity; i++) {
    promisesArr.push(client.bf.add(bloomFilterName, 'item'+i));
    if (i % 100000 === 0) {
        await Promise.all(promisesArr);
        promisesArr = [];
    }
    if (i % 10000000 === 0) {
        console.log('Added 10 million members to Bloom Filter.', i);
    }
}

const existValue = 'item1';
const notExistValue = 'item1000000001';
const simonExists = await client.bf.exists(bloomFilterName, existValue);
const notExists = await client.bf.exists(bloomFilterName, notExistValue);
console.log(`${existValue} ${simonExists ? 'may' : 'NOT'} exist`);

console.log(`${notExistValue} ${notExists ? 'may' : 'NOT'} exist`);

const info = await client.bf.info(bloomFilterName);
// info looks like this:
//
//  {
//    capacity: 1000,
//    size: 1531,
//    numberOfFilters: 1,
//    numberOfInsertedItems: 12,
//    expansionRate: 2
//  }
await client.quit();

By understanding Bloom Filters and using RedisBloom, you can empower your high-load projects to handle massive data streams efficiently and in real time. And there we have it. You can find the full source code on my GitHub page. I strongly recommend looking at the official documentation for additional information.

Thank you for reading. If you have any questions you can find me on LinkedIn. Feel free to leave comments. You can also read my previous articles:

Encrypt & decrypt password | Node.js

Safety is one of the most part in development. We always need to think about users personal data.

towardsdev.com

Memory Leak in Node.js Apps

This is my experience of detecting memory leaks in Node.js applications. I’m working on a project where the load on the…