Understanding the Challenges of Semi-Structured Data

DataPancake is designed to convert semi-structured hierarchical data into relational data using a deep recursive scanning process, user configuration, and then the generating of SQL statements to create extracted and flattened out Dynamic Tables. Also known as, Pancaking your semi-structured data.

The following sections will give you a brief overview of the problems DataPancake is designed to solve, as well as information about the problems themselves to help you understand them in more depth.

Here are some of the problems that DataPancake directly addresses:

Polymorphic Data (attributes using multiple data types)
Stringified, Embedded or Escaped JSON Data (JSON Blob inside a string attribute)
Schema Discovery (Deeply nested objects and arrays)
Parsing & Querying Semi-Structured Data
Schema Evolution/Drift

1 - Polymorphic Data

Polymorphic data refers to attributes in a semi-structured document that have different structures which could consist of different attributes from one document to another or identical attributes that have different data types from one document to another. This can occur when documents are receiving raw data from multiple sources, or changes are made to the underlying schema and historical data is not updated. It can easily result in queries which miss large portions of raw data or impact downstream features and analytics.

The following are examples of polymorphic attributes found in a single JSON document. They are intended to provide a basic understanding of polymorphism in data, but are by no means exhaustive:

Example #1:

property_type (string) address (string)

{
    "_id": "re-8594abcd-3217-4889-ef01-23b45cd6789f",
    "property_id": "PR-5689",
    "property_type": "Mixed-use",
    "address": "123 Main St"
}

property_type (object) address (object array)

{
    "_id": "re-8594efgh-3217-4889-ef01-23b45cd6789f",
    "property_id": "PR-5690",
    "property_type": {"usage":"Commercial", "sq_ft":20000},
    "address": [{
        "street": "1234 High Tower Rd",
        "city": "Skyline",
        "state": "NY",
        "zip_code": "10101"
        }]
}

Example #2:

amenities (array that contains both primitive and object elements)

"amenities": [
    "Gym",
    "Rooftop Deck",
    {
        "name": "Conference Rooms",
        "availability": "Reservable",
        "capacity": 20
    }
]

Issues like the examples above are very common with a flexible format like JSON. DataPancake is designed to help you avoid the issues which result from that flexibility, resulting in poor data quality.

2 - Embedded/Stringified/Escaped JSON

JSON data can be stored as an escaped string inside an attribute with a string data type. This type of JSON is especially difficult to parse because the schema may be unknown and have its own level of depth and polymorphic state. There is no limit to the level of recursive escaped JSON contained inside a single string attribute.

Example #1:

{
"customerData": {
    "customers": [
    {
        "customerID": "C011",
        "loyaltyProgram": {
        "enrollmentDate": "{\"datetime_iso\": \"2021-06-01 00:00:00\"}",
        "membershipID": "M011",
        "membershipLevel": "Gold",
        "points": 5000,
        "rewards": "[{\"rewardID\": \"R017\", \"description\": \"10% off next purchase\", \"expirationDate\": \"{\\"datetime_iso\\": \\"2023-12-31 23:59:59\\"}\"}, 
                     {\"rewardID\": \"R018\", \"description\": \"Free shipping\", \"expirationDate\": \"{\\"datetime_us\\": \\"12/31/2023 11:59 PM\\"}\"}
                    ]"
        }

3 - Schema Discovery

When parsing JSON data, it is frequently difficult to know with certainty that the entire schema has been fully documented. Schemaless structures can change at any time, and hierarchical structures like JSON have no limit to the depth of nested objects and arrays. The same things which make JSON a desirable storage format also create challenges for organizations hoping to leverage that data for applications and analytics.

4 - Parsing and Querying Semi-Structured Data

Snowflake SQL provides excellent support for parsing and processing data, but complex JSON structures require equally complex queries if users want to unpack, unnest, flatten, and relate the data found in them. That level of data complexity and SQL can be difficult to test and troubleshoot. If attributes are not notated precisely, you will receive a null value, and can easily exclude data from your result set if you are not familiar enough with the underlying data to know the value should not be null.

5 - Schema Evolution (or Schema Drift)

The structure or schema of JSON can change easily and often accidentally, making it difficult to track and process the changes to upstream data sources. Schema drift can create sudden problems as downstream pipelines are impacted or even broken. This causes delays to reporting, analytics, and even product rollouts when features depend on data from the JSON source.

Applications that are built on document databases which use Snowflake as their analytics platform and repository, suffer as a result of these uncommunicated and often overlooked schema changes. Worse, if the schema has not changed but data has inadvertently become polymorphic due to data migration errors, these types of adverse impacts can occur without something as obvious as a broken dashboard to call attention to the issue.

Last updated 2 months ago