Understanding the Challenges of Semi-Structured Data

Broadly speaking, Pancake is designed to convert semi-structured hierarchical data into relational data using a deep recursive scanning process, user configuration, and then the generating of SQL statements to create extracted and flattened out Dynamic Tables. Also known as, Pancaking your JSON data.

The following sections will give you a brief overview of the problems Pancake is designed to solve, as well as information about the problems themselves to help you understand them in more depth.

Here are some of the problems that Pancake directly addresses:

  • Polymorphic Data

  • Schema Discovery

  • Parsing & Querying Semi-Structured Data

  • Schema Evolution/Drift

Polymorphic Data

Polymorphic data refers to attributes in a semi-structured document that have different structures or data types. This can occur when documents are receiving raw data from multiple sources, or changes are made to the underlying schema. It can easily result in queries which miss large portions of raw data or impact downstream features and analytics.

The following are examples of polymorphic attributes found in a single JSON document, they are intended to provide a basic understanding of polymorphism in data but are by no means exhaustive:

Example #1:

address (string) property_type (string):

{
    "_id": "re-8594abcd-3217-4889-ef01-23b45cd6789f",
    "property_id": "PR-5689",
    "property_type": "Mixed-use",
    "address": "123 Main St"
}

address (list) object array property_type (object):

{
    "_id": "re-8594efgh-3217-4889-ef01-23b45cd6789f",
    "property_id": "PR-5690",
    "property_type": {"usage":"Commercial", "sq_ft":20000},
    "address": [{
        "street": "1234 High Tower Rd",
        "city": "Skyline",
        "state": "NY",
        "zip_code": "10101"
        }]
}

Example #2:

amenities (list) both primitive and object array:

"amenities": [
    "Gym",
    "Rooftop Deck",
    {
        "name": "Conference Rooms",
        "availability": "Reservable",
        "capacity": 20
    }
]

Issues like the examples above are very common with a flexible format like JSON. Pancake is designed to help you avoid the issues which result from that flexibility resulting in poor data quality.

Schema Discovery

When parsing JSON data, it is frequently difficult to know with certainty that the entire schema has been fully documented. Schemaless structures can change at any time, and hierarchical structures like JSON have no limit to the depth of nested objects and arrays. The same things which make JSON a desirable storage format also create challenges for organizations hoping to leverage that data for applications and analytics.

Parsing and Querying Semi-Structured Data

Snowflake SQL provides excellent support for parsing and processing data, but complex JSON structures require equally complex queries if users want to unpack, unnest, flatten, and relate the data found in them. That level of data complexity and SQL can be difficult to test and troubleshoot. If attributes are not notated precisely you will receive a null value, and can easily exclude data from your result set if you are not familiar enough with the underlying data to know the value should not be null.

Schema Evolution (or Schema Drift)

The structure or schema of JSON can change easily and often accidentally, making it difficult to track and process the changes to upstream data sources. Schema drift can create sudden problems as downstream pipelines are impacted or even broken. This causes delays to reporting, analytics and even product rollouts when features depend on data from the JSON source. Applications that are built on document databases which use Snowflake as their analytics platform and repository suffer as a result of these uncommunicated and often overlooked schema changes. Worse, if the schema has not changed but data has inadvertently become polymorphic due to data migration errors, these types of adverse impacts can occur without something as obvious as a broken dashboard to call attention to the issue.

Last updated