Working with XML in Snowflake: Part IV

The saga continues…

Introduction

In Part I of this series, we met our initial cast of characters. The VARIANT datatype, the COPY command, and how to use the parameter when loading XML content into a Snowflake table.

In Part II, the heroes of the story emerged: the SQL functions and . We showed how to use these functions to write standard SQL queries against an XML document stored in a variant column, and even how to reach into deeply nested XML elements.

In Part III, we met the villain: the repeating XML element array! We brought out the big guns — the join and the function, and they quickly dispatched this baddie. But we had a feeling it would return…

In this final chapter, the repeating array is back — but it’s not alone! It’s brought a few of its friends to cause mayhem and confusion... How will we ever deal with multiple repeating arrays in a single XML document?

OK — that’s a little melodramatic, but I think we’ve set the stage appropriately. Let’s dive in.

Setup

Chapters II and III used an example from the Insurance industry, so we’ll continue with that document here, but I’ll need to add a second repeating array. The original document contains a node that contains all the nodes from the specific claim. That looks like this:

See the array starting on line 65 and ending on line 130? There are two elements in that array, and each of those has one node and two nodes. We’ll be showing how to access data from both the and the arrays. But first, we need to backtrack a bit and explore the basic difference between relational SQL and hierarchical XML.

The Hierarchical vs. Relational Impedance

On the surface, you’d think that mapping an XML document to a set of relational tables would be simple. An XML “entity” would correspond to a database “table”, and the subnodes or attributes of that entity would become the table’s columns. Subnodes that are repeating arrays would be represented as “child” tables in the database model, and would have a primary/foreign key relationship to the “parent” table.

It’s the definition of relationships between these entities that present the challenge. Relational databases are relational because you define “relationships” between two tables based on one or more common columns. These relationships can be used for JOIN operations when writing SQL. XML and other semi-structured formats are hierarchical in nature, meaning that the relationship between a “parent” node and its “child” nodes is defined purely by their positions within the document. An XML element is a child of another because it is physically located “within” the start and end tags of its parent entity. There are no common elements or attributes in the child to join on.

The answer is to replicate, or denormalize, attributes from the parent element down into the definition of the child “table”. In the example above, we’re going to create a table named that contains data from the array, and a table named that flattens the element. Each of those will carry identifying elements (e.g., keys) from their parent elements.

Avoiding Cartesian Products

We’ll build on the techniques using and joins that I presented in Part III. However, if we try to build a single query that flattens both arrays, we’ll get the dreaded Cartesian Product! In the example above, there are four “rows” (i.e., subnodes) in the array, and two rows in the array. So the following query returns a total of 8 rows.

xml4_fig2.sql results

That’s clearly not the result we’re after. Imagine if this document had more arrays in it! Our Cartesian Product problem would grow exponentially…

Views to the Rescue!

The easiest way around this is to construct a view of each “path” into the XML. There will be a view that represents the array and one for the array. But it’s a little more complex than just sticking a in front of each section of SQL. This is where we morph the hierarchical structure of XML into a relational model.

  1. Figure out what the identifiers are for the parent nodes. Which field(s) represent the “primary key” of each level of the hierarchy above each array? In this example, let’s assume that is the unique identifier of each element.
  2. If you’re working with deeply nested arrays (i.e., GrandParent -> Parent
    -> Child), you’ll need a unique key for the intermediate Parent level too.
  3. Define these fields into the SELECT list of each view definition. These will function as the “foreign keys” for the child tables.
  4. Feel free to add as many additional non-key fields as you like into the view definition. This is the standard “denormalization” technique, and they’ll save you a lot of extra calls in your downstream SQL.

Here’s the Options view:

It has the from the root node as the primary key. I also added the field from the node as an example of a non-key attribute in the SELECT list. I don’t think they’d be repairing two different cars on the same repair order…

Here’s the view definition for the element.

Couple of interesting things going on in here.

  1. Line 4: repeating the RepairOrderID from the root node.
  2. Lines 6–8: retrieving values from the nodes below by name. Nothing we haven’t seen before.
  3. Lines 10–12: the node is nested below so we have to nest the calls to retrieve those values.
  4. Lines 14–19: now it gets interesting. The nodes are a repeating group, but they’re not nicely grouped under a single parent node, like element. They’re siblings of every other element under . Bad XML hygiene, but that’s just my opinion… This is going to force us to play some tricks in the FROM and WHERE clause.
  5. Line 23: flattening the array, as expected.
  6. Line 24: here’s the trick. We can’t use here, since there’s no “parent” level element above the two nodes that we can flatten. Instead, we use a call to that returns a reference to an array that contains the immediate children of — all eight of them. But we only want the two elements!
  7. Line 25: the WHERE clause here filters out all the unwanted children of , leaving only the elements, which are referenced with calls on lines 14–19. (As an experiment, remove the WHERE clause and see what happens to the result).

If the XML had been written with a parent “array” node, say or something, that wrapped the repeating elements, then the SQL would have been able to that parent node, and wouldn’t have needed a filter in the WHERE clause.

Wrapping Up

When presented with an XML document that contains multiple unrelated repeating arrays, I recommend using Snowflake views to present them to your users as separate and distinct entities using Snowflake views. Use the standard techniques of walking down the element hierarchy with calls and using a lateral join to flatten the repeating array. These views can then be joined on their common columns just like any other relational table.

Support for XML in Snowflake is admittedly a little less developed than the other semi-structured formats like JSON and Parquet. I’ve outlined a couple techniques that I’ve found to help simplify , but don’t let those complexities

Let it Snow!

Principal Sales Engineer with Snowflake — US Public Sector.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store