Data Sources

Pancake currently supports connecting data sources which are a valid JSON from VARIANT columns that store one object per row. It can be added from a table, external table, or view. That JSON object can contain any number of attributes of any type and any number of nested levels. Pancake’s support for external tables allows users to scan JSON data that is currently in cloud data storage such as an S3 bucket or Azure Blob without needing to move that data into Snowflake.

Data sources must be valid JSON to be added to Pancake. Documents containing invalid JSON will produce an error and not be added to the application.

Creating a 'Full Scan' Scan Configuration

Once you've completed the setup script, open up the Pancake app. From the home page, navigate to the Data Source Overview page. You should see a screen like the one below, with only the MEDICAL_DEVICE.UDI_JSON data source (including the database prefix) visible.

If you don't see the data source in Pancake, return to the worksheet you used to set up the app and ensure you didn't encounter any errors when you ran the script. If you do see the data source, UDI_JSON has been sucessfully added and the quick scan has been performed.

Manage Scan Configuration

Navigate to the Manage Scan Configuration page in the app by clicking the name in the left navigation bar, note that depending on the size of your browser window the data source filter may be blocking the words and you might need to scroll down.

Once on the Manage Scan Configuration page, select the UDI data source from the dropdown menu. This will default to a "New" Scan Configuration, if you wish to edit the existing quick scan, select it from the dropdown menu under "Select a scan configuration to edit".

Name your new Scan Configuration something like "FDA UDI Full Scan" so you can quickly identify details if you are looking at a view containing a large number of Scan Configurations.

Once you've named your Scan Configuration, scroll down on the page. From the Warehouse Settings dropdown menu, select PANCAKE_MEDIUM_OPTIMIZED_01 to ensure we have enough processing power to handle all ~4,300,000 records we're about to scan. Leave the Record Limit set to "0" and Pancake will default to the total number of records in the data source.

Click "Save" to save the scan configuration.

This should result in a success message at the top of the page: "The scan configuration settings have been updated successfully."

Information about Scan Configuration Advanced Settings can be found on the Scan Processing page.

Scan Data Source

Navigate to the Scan Data Source page from the left navigation bar. On the Scan Data Source page, select the UDI data source ( <your_database_name>.MEDICAL_DEVICE.UDI_JSON), and the FDA UDI Full Scan from their respective dropdown menus. Click the "Start Scan" button to initiate the scan.

Scanning tasks are asynchronous in Pancake, so feel free to navigate to a different page or away from the app entirely.

You can refresh the Scan Data Source page and re-select that Data Source/Scan Configuration combination to check that the scan is in progress, or return to the Data Source Overview page where you can see a table showing all scans currently in process.

A full scan of UDI_JSON should only take 4-5 minutes using PANCAKE_MEDIUM_OPTIMIZED_01, after which you can refresh the page and see the updated attribute information in the Data Source Overview table.

Schema Analysis

If this FDA data is useful and valuable for your actual work, you can upgrade your Pancake app to "Paid" from the Marketplace listing, and upgrade your FDA UDI data source to the Schema Analysis product tier, allowing you to access the full list of attributes.

By utilizing the setup script at the top of this page and walking through this guide, you will have already set all the global permissions needed for Pancake. You should now feel comfortable adding additional data sources and creating Scan Configurations.

For in-depth information about any individual feature or concept (such as setting a cron schedule to monitor a data source for schema changes), you can find it on the relevant page on this site.

Full Scan During Data Source Addition

The FDA Quick Start adds the UDI data as a data source using the default "quick scan" Scan Configuration. The default quick scan configuration limits the number of records to 150 so you can onboard data sources quickly and inexpensively while still getting some insight into the complexity of your data.

If you want to increase the number of records scanned when onboarding a data source for the first time, simply adjust the record_limit DEFAULT value before running the statement.

Note that the default quick scan uses the PANCAKE_X_SMALL_01 warehouse created in the first part of the setup script, and depending on the size and complexity of your data source an X_Small warehouse may not be sufficient to complete the scan. The scan may also fail entirely due to memory constraints. We have tested Pancake on data as large as ~18.5M records and have been able to acheive around 1M records per minute on a Medium Snowpark Optimized warehouse and recommend not attempting to push beyond 30-40M records when adding a data source. When you've familiarized yourself with Pancake, you will be able to configure more advanced Scan Configurations for larger data sources.

To add UDI_JSON as a data source with a complete scan, ensure you set warehouse_name VARCHAR DEFAULT to 'PANCAKE_MEDIUM_OPTIMIZED_01' and then set record_limit DEFAULT to any number greater than the total number of documents in the data source, in this case 4,275,758 or greater. Pancake will only process the total number of records available, so feel free to input a larger number if you are unsure of the exact number of records in the data source.

Adding Data Sources

Add Data Source During Setup or from Worksheet

Adding a data source during setup entails reusing the relevant sections of code found in the quickstart guides and editing details for each data source or sources you wish to add.

You must grant Pancake privileges for each database and schema which contain data you want to add. This must be done from a worksheet and will involve using statments that look like the following:

GRANT USAGE ON DATABASE <database_name> TO APPLICATION PANCAKE;
GRANT USAGE ON SCHEMA <database_name>.<schema_name> TO APPLICATION PANCAKE;

Then, for each object you would like you add as a data source, you must create a statement similar to the following:

DECLARE 
	datasource_name VARCHAR DEFAULT 'PANCAKE.SAMPLE_DATA.REALESTATE_JSON'; 
	object_type VARCHAR DEFAULT 'Table'; 
	database_name VARCHAR DEFAULT 'PANCAKE'; 
	schema_name VARCHAR DEFAULT 'SAMPLE_DATA'; 
	object_name VARCHAR DEFAULT 'REALESTATE_JSON'; 
	column_name VARCHAR DEFAULT 'JSON_DATA'; 
	datasource_tag VARCHAR DEFAULT ''; 
	warehouse_name VARCHAR DEFAULT 'PANCAKE_X_SMALL_01'; 
	record_limit DEFAULT 150; 
	initiate_scan DEFAULT TRUE; 
	call_result VARCHAR;
BEGIN 
	CALL PANCAKE.core.add_datasource_with_scan( 
:datasource_name,:object_type,:database_name,:schema_name,:object_name,:column_name, 
:warehouse_name, :record_limit, :initiate_scan, :datasource_tag) into
:call_result;

return call_result;

END;

There is a script included in the quick starts (also found in the script library) which can be customized to search a given database and schema for all tables which contain VARIANT type columns. If you utilize that script to identify data sources, one of the columns returned will have a statement formatted as above for each data source in that database/schema.

You can choose to initiate a scan or not with the initiate_scan DEFAULT TRUE; statement. You can also change the scope of the initial scan by changing the record_limit DEFAULT 150; to a larger number. If you want to add the data source with a full scan, simply set the limit number to something higher than the total number of records in the table. Note that if you choose to add a data source with a full scan, you should use a Medium Snowpark Optimized or larger warehouse to avoid memory issues which could cause the scan to fail.

If you want to scan a source with 10s or 100s of millions of records (or more) we strongly suggest adding the data soure with a small quick scan to understand the complexity before creating a scan configuration that will ensure a full scan completes successfully.

Add Data Source from App

To add a data source from the Pancake app interface, you will still need to GRANT USAGE on the database and schema from a worksheet.

Once the appropriate priviliges have been granted, you can navigate to the Manage Data Source screen in the app to add objects found in that database/schema.

Details on using the Manage Data Source page can be found in the Application Overview section.

Upgrading a Data Source

By default, data sources are added at the Schema Summary product tier. To upgrade a data source to the Schema Analysis or Dynamic Table Generation product tiers, navigate to the Manage Data Source page and click the Change Product Tier button.

Note that your organization must upgrade to the full, paid version of Pancake from the listing page in the Snowflake Marketplace for users to be able to upgrade individual data sources.

Pancake does not currently provide a way to batch upgrade data sources. If your organization needs to onboard and maintain a large number of data sources and manage them at a global level, please contact us directly about enterprise packages.

Last updated 1 year ago