Druid Data Cookbook: Deconstructing Nested JSON Objects
In the previous episode of the Druid Data Cookbook, I showed how to extract and process all elements out of a nested array of objects in Druid. But what if the elements you want to process are scattered over different levels of the object hierarchy? By extending the UNNEST
paradigm, we can handle these cases too!
A data sample
Our data looks like this:
{
"timestamp": "2024-09-01",
"id": "1",
"org": {
"name": "Team 1",
"members": [
{
"name": "Alice",
"gender": "F"
},
{
"name": "Bob",
"gender": "M"
},
{
"name": "Carol",
"gender": "F"
}
]
}
}
Note how there are name
fields at the first nesting level of org
, but also at the members
level.
Here’s the full dataset in jsonl format:
{ "timestamp": "2024-09-01", "id": "1", "org": { "name": "Team 1", "members": [ { "name": "Alice", "gender": "F" }, { "name": "Bob", "gender": "M" }, { "name": "Carol", "gender": "F" } ] } }
{ "timestamp": "2024-09-01", "id": "2", "org": { "name": "Team 2", "members": [ { "name": "Dan", "gender": "M" }, { "name": "Eve", "gender": "F" }, { "name": "Frank", "gender": "M" } ] } }
Load this data sample into Druid (version 30 or higher.) It should be easy to ingest using the wizard, or you can submit this query:
REPLACE INTO "teams_nested" OVERWRITE ALL
WITH "ext" AS (
SELECT *
FROM TABLE(
EXTERN(
'{"type":"inline","data":"{ \"timestamp\": \"2024-09-01\", \"id\": \"1\", \"org\": { \"name\": \"Team 1\", \"members\": [ { \"name\": \"Alice\", \"gender\": \"F\" }, { \"name\": \"Bob\", \"gender\": \"M\" }, { \"name\": \"Carol\", \"gender\": \"F\" } ] } }\n{ \"timestamp\": \"2024-09-01\", \"id\": \"2\", \"org\": { \"name\": \"Team 2\", \"members\": [ { \"name\": \"Dan\", \"gender\": \"M\" }, { \"name\": \"Eve\", \"gender\": \"F\" }, { \"name\": \"Frank\", \"gender\": \"M\" } ] } }"}',
'{"type":"json"}'
)
) EXTEND ("timestamp" VARCHAR, "id" VARCHAR, "org" TYPE('COMPLEX<json>'))
)
SELECT
TIME_PARSE(TRIM("timestamp")) AS "__time",
"id",
"org"
FROM "ext"
PARTITIONED BY DAY
Extracting the leaf paths
Now, because the data does not neatly come in an array, we cannot extract elements at a specified level. But we can do another trick: we can obtain an array of the JSONPaths of all leaf elements in an object by calling JSON_PATHS
. Run this query:
SELECT
t.__time,
t.id,
t.org,
x.leaf_path
FROM "teams_nested" t CROSS JOIN UNNEST(JSON_PATHS(org)) x("leaf_path")
It returns one row for each leaf element in the org
object.
In the next step, let’s feed these values back into JSON_VALUE
.
(This will really only work with Druid 30 or better. In earlier versions, JSON_PATHS
required a string literal as its second element, limiting its flexibility.)
Some query examples
We can use the above query as a CTE and do further processing in the main query.
Get all name
fields, regardless of the hierarchy level, using a LIKE
filter:
WITH cte AS (
SELECT
t.__time,
t.id,
t.org,
x.leaf_path
FROM "teams_nested" t CROSS JOIN UNNEST(JSON_PATHS(org)) x("leaf_path")
)
SELECT
__time,
id,
leaf_path,
JSON_VALUE(org, "leaf_path")
FROM cte
WHERE leaf_path LIKE '%.name'
Or put all those names back into an array per team with an array aggregator:
WITH cte AS (
SELECT
t.__time,
t.id,
t.org,
x.leaf_path
FROM "teams_nested" t CROSS JOIN UNNEST(JSON_PATHS(org)) x("leaf_path")
)
SELECT
__time,
id,
ARRAY_AGG(JSON_VALUE(org, "leaf_path"))
FROM cte
WHERE leaf_path LIKE '%.name'
GROUP BY __time, id
You can do arbitrary gymnastics on the JSONPath expressions using regular expression filters. Here I use this technique to emulate a JSONPath expression like "$.members[*].gender"
:
WITH cte AS (
SELECT
t.__time,
t.id,
t.org,
x.leaf_path
FROM "teams_nested" t CROSS JOIN UNNEST(JSON_PATHS(org)) x("leaf_path")
)
SELECT
__time,
id,
leaf_path,
JSON_VALUE(org, "leaf_path")
FROM cte
WHERE REGEXP_LIKE(leaf_path, 'members\[\d+\]\.gender$')
Conclusion
- By unnesting the result of
JSON_PATHS
, you get access to the entire structure of a nested (JSON) object in Druid. - In conjunction with regular expression filters, you get processing capabilities that are almost as powerful as
jq
.
“This image is taken from Page 500 of Praktisches Kochbuch für die gewöhnliche und feinere Küche” by Medical Heritage Library, Inc. is licensed under CC BY-NC-SA 2.0 .