Working with large JSON files in Snowflake

Paul Horan
Snowflake
Published in
5 min readFeb 19, 2021

--

Introduction

One of Snowflake’s key differentiating features is our native support for semi-structured data formats, including JSON, XML, Parquet, ORC, and AVRO. Rather than serializing these formats into one long string and forcing the developer to parse them apart at query time, or requiring a complex pre-loading transformation, Snowflake allows them to be stored in their native format, directly in the database. This uses a special datatype called a variant. As semi-structured data is being loaded into a Snowflake variant column, the columnarized metadata for the document is captured and stored. More of a “schema on load” vs. “schema on read”.

Executing queries against that semi-structured variant column is then extremely easy. A simple dot-notation syntax allows direct SQL access to any element value, at any level in the document. You can even reference elements in repeating arrays with a subscript notation.

OK, define “large”…

But, (there’s always a “but”…), Snowflake variant columns are currently limited to a gzip-compressed 16MB of data. (I can’t discuss pending future enhancements, but suffice it to say that we’re exploring options here.) While most JSON documents easily fit into 16MB (compressed), there are situations where this isn’t the case. I’ve been handed massive JSON files and asked to provide detailed analytics on their contents. For example, the FDA has published every reported adverse reaction to every pharmaceutical product, all the way back to 2004. All this data is available at open.fda.gov, and is provided as a set of zipped JSON files. When unzipped, some of these exceed 1GB in size, which prevents them from being loaded into Snowflake variant columns, even when using STRIP_OUTER_ARRAY.

SnowSQL output

jq to the rescue!!!

One solution is to split the file into smaller chunks prior to executing the PUT and COPY statements. The problem with splitting JSON is that it’s not line-based, like CSV files. A single JSON node may consist of dozens, or even hundreds of lines in a single file. Fortunately, there’s a way to extract entire chunks of JSON into separate files, and it’s a utility called jq. JQ is the Swiss Army knife of JSON utilities, and this article will present a commandline script that uses jq to split JSON files into Snowflake-sized chunks.

Installing jq

This utility can run anywhere, so review their downloads page for your specific platform. I’m on a Mac OS/X machine, so I used homebrew with the command brew install jq. What I’ve written is a bash shell script that reads in the larger file and spits it out in smaller chunks, keeping the integrity of the JSON structure intact.

JSON refresher

The first step is to fully understand the JSON file you’ve been asked to work with. The whole point of this script is to identify the “repeating entities” in the JSON that can be safely extracted and written as a batch into a set of smaller files. For example, here’s an excerpt from one of those openFDA files I mentioned earlier:

{
"results": [
{
"product_ndc": "0002-0800",
"generic_name": "diluent",
"labeler_name": "Eli Lilly and Company",
"brand_name": "Sterile Diluent",
"active_ingredients": [
{
"name": "WATER",
"strength": "1 mL/mL"
}
],
"marketing_start_date": "19870710",
"product_id": "0002-0800_662164fd-5ea0-4a08-bfd1-6b08bdd73342",
"application_number": "BLA018781",
"brand_name_base": "Sterile Diluent"
},
{
"product_ndc": "0002-1200",
"generic_name": "Florbetapir F 18",
"labeler_name": "Eli Lilly and Company",
"brand_name": "Amyvid",
"active_ingredients": [
{
"name": "FLORBETAPIR F-18",
"strength": "51 mCi/mL"
}
],
"marketing_start_date": "20120601",
"product_id": "0002-1200_480fceef-6596-4478-97de-677c155506b3",
"application_number": "NDA202008",
"brand_name_base": "Amyvid",
}
]
)

You can see the nested structure of the file here. There is an outermost “wrapper” array element called results. That’s an array of nested nodes, starting with the product_ndc node. There are over 135,000 of these array elements in this one file, which itself is over 1 GB. That makes it too big to be included in a Snowflake COPY statement.

What my script does is loop through the file, writing out batches of individual array elements into separate files. These can then be uploaded into internal or external stages, and loaded into Snowflake using a COPY statement.

split_json.sh script

Copy this code block into a text file named split_json.sh. Drop it in the same folder with the large JSON file you need to split.

while getopts c:s:b:f:t: flag
do
case "${flag}" in
c) Counter=${OPTARG};;
s) StartIndex=${OPTARG};;
b) BatchSize=${OPTARG};;
f) FileName=${OPTARG};;
t) TagName=${OPTARG};;
esac
done
JsonInput=$(cat ./$FileName)
echo "File Name: $FileName";
echo "Calculating ArrayLength...";
ArrayLength=$(echo $JsonInput | jq --arg foo "$TagName" '.[$foo] | length')
echo "ArrayLength: $ArrayLength";
echo "Size: $BatchSize";
echo "CounterStart: $Counter";
# StartIndex=0
EndIndex=$BatchSize
while [[ $EndIndex -le $ArrayLength ]]
do
echo "Starting batch #$Counter";
echo $JsonInput | jq ."$TagName"[$StartIndex:$EndIndex] >> ./output-$Counter.json
if [[ $EndIndex -eq $ArrayLength ]]; then break; fiStartIndex=$(( $StartIndex + $BatchSize ))
EndIndex=$(( $EndIndex + $BatchSize ))
if [[ $EndIndex -gt $ArrayLength ]]; then EndIndex=$ArrayLength; fi
Counter=$(($Counter + 1))
done

Invoking the script

The script takes 5 commandline arguments:

c: A simple counter used to generate unique output file names. Start at any number, and they’ll increment by 1 for each iteration.
s: StartIndex. Usually 0, but this allows you to skip a certain number of nodes before writing them out.
b: BatchSize. This is the number of nodes that get written out. Experiment with this to get the optimal file size. I’ve seen files of 300MB or more compress down to fit in a Snowflake variant.
f: the file name of the large JSON file.
t: The name of the outer array that contains the repeating nodes.

./split_json.sh -c 0 -s 0 -b 10000 -f drug-ndc-2021–01–29.json -t results

For this file, batches of 10,000 resulted in 13 files, named output-0.json through output-12.json. You can then PUT these into a stage, and execute SELECT or COPY statements against the entire folder.

Wrapup and opportunities for enhancement

This script works by streaming the entire file into memory on your local machine, and then splitting it out from there. For super large files, this can exceed the capacity of the machine, or significantly degrade performance. One possible enhancement would be to explore the streaming option of jq. With the — stream option, jq can parse input texts in a streaming fashion, allowing jq programs to start processing large JSON texts immediately rather than after the parse completes. If you have a single JSON text that is 1GB in size or larger, streaming it will allow processing to start much more quickly. That exercise is “left to the user”…

Let it Snow!

--

--

Paul Horan
Snowflake

Principal Sales Engineer with Snowflake — US Public Sector.