Running SQL on Nested JSON | Rockset
Megaphone icon
IMPORTANT UPDATE: OpenAI Acquires RocksetRead More ->
Close icon
Rockset Logo
  • Login

Running SQL on Nested JSON

December 7, 2018

,

When we surveyed the market, we saw the need for a solution that could perform fast SQL queries on fluid JSON data, including arrays and nested objects:

The Challenge of SQL on JSON

Some form of ETL to transform JSON to tables in SQL databases may be workable for basic JSON data with fixed fields that are known up front. However, JSON with nested objects or new fields that “can spring up every 2-4 weeks,” as the original Stack Overflow poster put it, is impossible to handle in such a rigid manner.

Relational databases offer alternative approaches to accommodate more complex JSON data. SQL Server stores JSON in varchar columns, while Postgres and MySQL have JSON data types. In these scenarios, users can ingest JSON data without conversion to SQL fields, but take a performance hit when querying the data because these columns support minimal indexing at best.

SQL on Nested JSON Using Rockset

With lots of fields that change, get added/removed, etc, it can be rather cumbersome to maintain ETL pipelines. Rockset was designed to help with this problem—by indexing all fields in JSON documents, along with all type information, and exposing a SQL API on top of it.

For example, with a Rockset collection named new_collection, I can start by adding a single document to an empty collection that looks like:

{
    "my-field": "doc1",
    "my-other-field": "some text"
}

... and then query it.

rockset> select "my-field", "my-other-field" 
         from new_collection;

+------------+------------------+
| my-field   | my-other-field   |
|------------+------------------|
| doc1       | some text        |
+------------+------------------+

Now, if a new JSON document comes in with some new fields - maybe with some arrays, nested JSON objects, etc, I can still query it with SQL.

{
    "my-field": "doc2",
    "my-other-field":[
        {
            "c1": "this",
            "c2": "field",
            "c3": "has",
            "c4": "changed"
        }
    ]
}

I add that to the same collection and can query it just as before.

rockset> select "my-field", "my-other-field" 
         from new_collection;

+------------+---------------------------------------------------------------+
| my-field   | my-other-field                                                |
|------------+---------------------------------------------------------------|
| doc1       | some text                                                     |
| doc2       | [{'c1': 'this', 'c2': 'field', 'c3': 'has', 'c4': 'changed'}] |
+------------+---------------------------------------------------------------+

I can further flatten nested JSON objects and array fields at query time and construct the table I want to get to - without having to do any transformations beforehand.

rockset> select mof.* 
         from new_collection, unnest(new_collection."my-other-field") as mof;

+------+-------+------+---------+
| c1   | c2    | c3   | c4      |
|------+-------+------+---------|
| this | field | has  | changed |
+------+-------+------+---------+

In addition to this, there is strong type information stored, which means I won't get tripped up by having mixed types, etc. Adding a third document:

{
    "my-field": "doc3",
    "my-other-field":[
        {
            "c1": "unexpected",
            "c2": 99,
            "c3": 100,
            "c4": 101
        }
    ]
}

It still adds my document as expected.

rockset> select mof.* 
         from new_collection, unnest(new_collection."my-other-field") as mof;

+------------+-------+------+---------+
| c1         | c2    | c3   | c4      |
|------------+-------+------+---------|
| unexpected | 99    | 100  | 101     |
| this       | field | has  | changed |
+------------+-------+------+---------+

... and the fields are strongly typed.

rockset> select typeof(mof.c2) 
         from new_collection, unnest(new_collection."my-other-field") as mof;

+-----------+
| ?typeof   |
|-----------|
| int       |
| string    |
+-----------+

If being able to run SQL on complex JSON, without any ETL, data pipelines, or fixed schema, sounds interesting to you, you should give Rockset a try.

heap.identify('Email')