Athena - Working with JSON

Photo by Patrick on Unsplash

Athena - Working with JSON

JSON Extraction in AWS Athena

Introduction

If we are using Athena and consume source data that we have no control over and which varies in it's schema (perhaps event data from multiple sources providing unique information), then we need to provide a table that is flexible enough to cope with this scenario.

One method we can use is to provide columns which map to Encoded JSON.

The following is a quick overview of working with JSON extraction in Athena.

How can we query the data?

So how can we effectively query the data?

If you are used to using SQL and have experience working with datasets storing encoded JSON, then you may already know the answer. We can 'extract' the JSON, such as using JSON_QUERY in SQL Server or JSON_VALUE in Oracle.

For Athena, the commands available to us are the following:

  json_extract(<column>, '$.<key_1>.<key_2>')
  json_array_get(<column>, '$.<array_key>'), 0) 
  json_extract_scalar(<column>, '$.<array_key>[0].<nested_array_object>')

Let's have quick look at some examples.

Imagine we have the following data schema:

{"CHANGE":0.68,"PRICE":99.68,"TICKER_SYMBOL":"NFLX","SECTOR":"TECHNOLOGY","NESTED":{"MORE_NESTED":{"NESTED_VALUE_KEY":"NESTED_VALUE"},"NESTED_ARRAY": [{"NESTED_ARRAY_OBJECT":"THE VALUE"},{"NESTED_ARRAY_OBJECT_2":"THE VALUE 2"}]}}

We can query the data in a number of ways depending on the scenario.

Extract a field using dot notation:

SELECT
json_extract(nested, '$.more_nested.nested_value_key') as nested_object_field
FROM
"nested_data_test"

Get an array field by index:

SELECT
json_array_get(json_extract(nested, '$.nested_array'), 0) AS nested_array_value
FROM
"nested_data_test"

Extract the value of an array :

SELECT
json_extract(json_array_get(json_extract(nested, '$.nested_array'), 0), '$.nested_array_object') as nested_array_value
FROM
"nested_data_test"

Extract a Scalar value:

SELECT
json_extract_scalar(nested, '$.nested_array[0].nested_array_object') AS nested_array_value_scalar
FROM
"nested_data_test"

The following query shows an example of each and the data output generated:

SELECT
change,
sector,
ticker_symbol,
json_extract(nested, '$.more_nested.nested_value_key') as nested_object_field,
json_extract(nested, '$.nested_array') as nested_array,
json_array_get(json_extract(nested, '$.nested_array'), 0) AS nested_array_value,
json_extract_scalar(nested, '$.nested_array[0].nested_array_object') AS nested_array_value_scalar,
json_extract(json_array_get(json_extract(nested, '$.nested_array'), 0), '$.nested_array_object') as nested_array_value_2
FROM
"nested_data_test"  
limit 10

Which produces:

#changesectorticker_symbolnested_object_fieldnested_arraynested_array_valuenested_array_value_scalarnested_array_value_2
10.68TECHNOLOGYNFLX"NESTED_VALUE"[{"nested_array_object":"THE VALUE"},{"nested_array_object_2":"THE VALUE 2"}]{"nested_array_object":"THE VALUE"}THE VALUE"THE VALUE"
2-1.73RETAILQXZ"NESTED_VALUE"[{"nested_array_object":"THE VALUE"},{"nested_array_object_2":"THE VALUE 2"}]{"nested_array_object":"THE VALUE"}THE VALUE"THE VALUE"

How can we handle nested arrays

If we have an array with hundreds of values, we don't really want to have to extract each value individually, this would be too painful to imagine. However, we can attempt to solve this by joining our data onto itself and presenting each value as it's own row in the result set.

SELECT nested, e
FROM "nested_data_test"  
CROSS JOIN UNNEST(CAST(json_extract(nested, '$.nested_array') AS array(json))) t(e)

The above query will join the nested arrays values onto the table and output something similar to:

#nestedarray_value
1{"more_nested":{"nested_value_key":"NESTED_VALUE"},"nested_array":[{"nested_array_object":"THE VALUE"},{"nested_array_object_2":"THE VALUE 2"}]}{"nested_array_object":"THE VALUE"}
2{"more_nested":{"nested_value_key":"NESTED_VALUE"},"nested_array":[{"nested_array_object":"THE VALUE"},{"nested_array_object_2":"THE VALUE 2"}]}{"nested_array_object_2":"THE VALUE 2"}

As you can see, the flexibility of the JSON functionality can allow for good options in reading the data. There may be performance costs associated with using the JSON methods, but I haven't dug into this currently, so it is not something I can comment on (perhaps this is something I will investigate in the future).

Are there any requirements on the table to accommodate columns storing encoded JSON?

Yes there are!

The following shows a basic DDL statement to define our table.

CREATE EXTERNAL TABLE IF NOT EXISTS `athena-example-glue-db`.`nested_data_test` (
  `change` double,
  `price` double,
  `ticker_symbol` string,
  `sector` string,
  `nested` string
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'ignore.malformed.json' = 'true',
'field.delim' = ''
)
LOCATION 's3://athena-example-bucket/'
TBLPROPERTIES ('has_encrypted_data'='false');

Make sure the column storing your encoded JSON is set to have the 'string' type.

The other important piece of configuration here is:

ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'ignore.malformed.json' = 'true',
'field.delim' = ''
),

When 'ignore.malformed.json' is set to true, any malformed JSON is returned as NULL.

'field.delim' defines the delimmiter character that separates the fields within the JSON.

'serialization.format' is representative of the delimmiter character in the resultant serialised JSON.

Conclusion

I hope the the above gives someone a good overview of how we can work with JSON encoded strings stored in Athena tables. The following links should provide more in depth information regarding Athena and JSON.

Extracting Data From JSON Searching Arrays Flattening Arrays