How to Ingest Historical Data into Azure Data Explorer (Kusto)

Lessons learnt ingesting historical data into Azure Data Explorer and backdating the creation time to respect caching and retention…

Olaf Wrieden

~8 min read · July 4, 2022 (Updated: July 18, 2023) · Free: No

Lessons learnt ingesting historical data into Azure Data Explorer and backdating the creation time to respect caching and retention policies.

What is Azure Data Explorer?

Also known as "ADX" or "Kusto", Azure Data Explorer is a fully managed, high-performance, big data analytics platform that makes it easy to analyze high volumes of data in near real time over structured, semi-structured, and unstructured data. The Azure Data Explorer toolbox gives you an end-to-end solution for data ingestion, query, visualization, and management.

Why should we treat historical data differently?

First off, dealing with historical data is common, but because ADX is designed as an append-only database, we need to maintain a healthy cache utilization. Hence, it is important to understand what happens when data is ingested.

Upon ingesting data (indicated in blue blocks below) the creation time is captured and depending on the caching policy, this data is retained in the ADX hot data cache (shown in the pink box below) for a specified number of days, weeks, months, years or, in fact, an unlimited time.

After this time, the data is evicted from the fast SSDs of the hot data cache and moves into the cold data cache, where it remains query-able until the end of the total data retention period (shown in the grey and pink boxes combined).

Simulation of Hot Cache (pink) and Cold Store (grey) over time

Once the ingested data's creation time is older than the retention period, it falls off (aka. is "evicted" from) the database entirely and thus no longer query-able, meaning it must be re-ingested. This is useful for age-based data (think: diagnostic events) that may become uninteresting after one month.

Considerations: Backfilling Historical Data

In short: We can make our data appear as if it was ingested last year; by setting the CreationTime at ingestion. As such, this data will be evicted from the cache relative to its creation time, in accordance with with caching and retention policy set on the table or database.

Example 1: With a total data retention period of 7 years, and a hot cache of 5 years, a new record ingested today resides in the hot cache for 5 years before it is evicted into the cold store, where it remains for a further 2 years (5 years hot + 2 years cold = 7 years total retention).

Example 2: Now, let's think about ingesting a record that was created 2 years ago (in 2020). If we ingest it today, we want to ensure it is backdated to fall off the hot cache in 3 years, not in 5 (given the data is already 2 years old).

However, when we query the ingested data and view its ingestion_time() using a KQL query like this:

[Table] | project column1, column2, ingestion_time()

We notice something strange; the ingestion time does not appear to have been altered in accordance with the CreationTime we defined. Because:

⚠️ Ingestion Time is not the same as the Extent Creation Time!

What's the difference?

The ingestion_time() is the UTC datetime at which the record was ingested, as measured by the ADX service. It can't be overridden, regardless of which client API or ingestion pipeline you're using.
Overriding creationTime, allows you to make sure retention and caching policies (which are measured based on the time of the data shard's creation, that you can override) gets applied accordingly.

Source: Stack Overflow

What You'll Need to Know

Extents (Data Shards)

Azure Data Explorer uses extents (aka "data shards"). Extents hold a collection of records that are physically arranged in columns, a technique known as columnar store, which enables efficient encoding and compression of the data, deeming it suitable for a huge number of records (rows) and large amounts of data.

Extents are immutable and can never be modified. They can only be queried, reassigned to a different node, or dropped out of the table. Data modification happens by creating one or more new extents and transactionally swapping old extents with new ones.

Each extent contains data and metadata. Among various optional tags, one of the enforced metadata items is the Extent Creation Time.

👉🏻 Pay attention to Extent Creation Time

Extent Creation Time is important for 3 scenarios:

Retention — Extents that were created earlier will be dropped earlier
Caching — Extents that were created recently will be kept in hot cache)
Sampling — Recent extents are favored, when using query operations such as take

In fact, Kusto tracks two datetime values per extent: MinCreatedOn and MaxCreatedOn. Initially, the two values are the same. When the extent is merged with other extents, the new values are according to the original minimum and maximum values of the merged extents.

Showing Table Extents (Min/MaxCreatedOn)

Normally, an extent's creation time is set according to the time in which the data in the extent is ingested. However, you can optionally overwrite the extent's creation time, by providing an alternative creation time in the ingestion properties.

✨ Lesson: Overwriting is useful, for example for retention purposes, if we want to re-ingest data and don't want it to appear as if it arrived late.

This means the retention policy controls the mechanism that auto removes data from tables and materialized views (configurable at the table or database level).

Source: Azure Data Explorer | Microsoft Docs

How-To: Ingesting Historical Data into ADX

This section focuses on a hypothetical (but common) scenario of historical data ingestion into Azure Data Explorer.

"My organization needs to ingest a backlog of electricity meter data into Azure Data Explorer, this will be a one-time ingestion of historical data located in an Azure Data Lake Storage container. Additionally, any newly generated data should be ingested near real-time into the cluster via Event Grid." — Data Engineer

✨ Learning: Use LightIngest to make it easy! Use Event Grid or other steam-like connectors for newly generated data.

For the purpose of this article, the focus will remain on dealing with existing data via LightIngest. However, it is possible (using the kustoCreationTime metadata field described below) to tag all files appropriately and drop them in a landing zone where they appear to Event Grid as newly generated data. As the data has been correctly tagged, ADX will then backdate it.

✨ Lesson: The downside of this approach is the lack of observability into this process. Therefore, LightIngest is preferential.

LightIngest

LightIngest is a command line utility which works great for processing a huge backlog of historical data. It further permits a -listOnly:True flag that can be used to "smoke test" the ingestion without actually ingesting any data.

3 Options for Specifying the CreationTime

1️⃣ Azure Blob Metadata: You can override the creation time by setting the following properties on each blob you are ingesting with LightIngest:

Set Metadata: kustoCreationTime or kustoCreationTimeUtc on the blob. This is interpreted as a UTC timestamp. More info.

✨ Learning: The metadata key does not seem to be case sensitive as I discovered, as such, both camel case (as per the documentation) and lowercase seem to work.

2️⃣ Date in Filename: You can also override the creation time if the filename of the blobs being ingested contain a date. Using LightIngest, we can then set:

-creationTimePattern:"'dailymeterdata'yyyyMMdd'.json'"

This approach is a little less flexible as it must adhere to three strict requirements (source):

Constant text immediately preceding the timestamp format, enclosed in single quotes (prefix)
The timestamp format, in standard .NET DateTime notation
Constant text immediately following the timestamp (suffix).

✨ Lesson: These rule out the use of GUIDs in the filename!

3️⃣ Date in File Path: If your data is organized by a folder structure containing the date https://storageaccount/meterdata/2022/07/01/reading.json you can simply set the creation time to the folder structure by specifying:

-creationTimePattern:"'meterdata/'yyyy/MM/dd'/'"

✨ Learning: This approach enables the blob to be named independently of the date, as the date is read from the folder structure, not the filename.

LightIngest: Smoke Test

As the data ingestion volume could be very high, thus timely, we can make use of the -listOnly:true argument. It enables us to "smoke test" our ingestion by discovering (but not posting) the data in our Data Lake for ingestion.

CreationTime not specified on blob, default to today's date

CreationTime specified in blob metadata, backdates the data

LightIngest discovered the CreationTime based of the kustoCreationTime metadata we set earlier. We have now verified that upon ingestion, data will be appropriately retained based on creation time, caching, and retention policies.

Example: This LightIngest command will list a maximum of 10 (specified by -limit:10) JSON files (specified by -pattern:"*.json"), allowing you to verify the proper CreationTime has been set (as per the screenshot above).

LightIngest.exe
"https://[cluster].[region].kusto.windows.net;Fed=True"
-database:"database"
-table:"table"
-source:"[blob-container-sas-url]"
-format:"multijson"
-ingestionMappingRef:"[mapping-name]"
-pattern:"*.json"
-limit:10
-dontWait:true
-listOnly:true

✨ Learning: Depending on the data range of the data, please see this informational message and read about the Extents Merge Policy. It is important to consider the Lookback property, which defines the timespan during which extents are considered for rebuild/merge.

Looking good? Ready to Ingest?

Once you are confident the ingestion query works, simply remove the smoke test -listOnly:true and -limit:10 arguments to begin ingesting and backdating all JSON records in the source container.

Thank you 👏🏼

Do you enjoy content like this? Drop me a comment, follow for similar articles or connect with me on LinkedIn to stay in touch.

#azure-data-explorer #data-engineering #azure #data-science #database

< Go to the original