FDA_UDI Data Quick Start

One of the core benefits Pancake provides is the ability to performe a quick, thorough, recursive scan of a data source to discover every attribute contained within it. Data sources with a large number of records are often prohibitive to manually inspect, and users rely on sampling to understand the schema.

This guide will provide you with the script you need to install and set up Pancake in your Snowflake environment and add the FDA's Medical Device - UDI Dataset so you can familiarize yourself with Pancake's capabilities using real world data.

At present, there are 89 total attributes, 1 polymorphic attribute, 8 arrays, 11 objects, with 4 levels deep of nesting. The quick scan will likely only discover 88 total attributes and no polymorphic data.

Because the only way to be entirely certain of your data source's schema is to check every single attribute in every record, this guide will also walk you through how to create a Scan Configuration in Panake that will perform a complete scan of the FDA UDI Data.

Below you will find the setup script which you can copy and paste into a worksheet in your Snowflake environment.

Note: This guide assumes that you have already installed Pancake and the FDA UDI Data, and you will need to do so before continuing with this guide.

FDA Quick Start Script

/* ********** SCRIPT #2 - FDA Medical Device Data (Marketplace) (UDI Dataset)

You can get access to the FDA dataset here: https://app.snowflake.com/marketplace/listing/GZT1Z1RGR5S/tdaa!-fda-medical-device-raw-data-%E2%80%93-udi

Script Placeholders:

(For convenience, you can use the built in find/replace feature in Snowflake for the following placeholders)

<database_name> (x3) - The name of the database created as part of the TDAA data product you installed from the link above.
<application_install_role> (x2) - The name of the role you used to install Pancake.

In order to run this script you will also need to have access to the ACCOUNTADMIN role.


*/


USE ROLE <application_install_role>;

GRANT APPLICATION ROLE PANCAKE.app_public TO ROLE ACCOUNTADMIN;

USE ROLE ACCOUNTADMIN;

GRANT EXECUTE TASK ON ACCOUNT TO APPLICATION PANCAKE;

GRANT IMPORTED PRIVILEGES ON DATABASE <database_name> TO APPLICATION PANCAKE;

USE ROLE <application_install_role>;

CREATE OR REPLACE WAREHOUSE PANCAKE_X_SMALL_01 WITH
  WAREHOUSE_SIZE = 'X-Small';

CREATE OR REPLACE WAREHOUSE PANCAKE_MEDIUM_OPTIMIZED_01 WITH
  WAREHOUSE_SIZE = 'Medium'
  WAREHOUSE_TYPE = 'SNOWPARK-OPTIMIZED'
  MAX_CONCURRENCY_LEVEL = 1;


GRANT USAGE on WAREHOUSE PANCAKE_MEDIUM_OPTIMIZED_01 to APPLICATION PANCAKE;
GRANT USAGE on WAREHOUSE PANCAKE_X_SMALL_01 to APPLICATION PANCAKE;

DECLARE
    warehouse_name VARCHAR DEFAULT 'PANCAKE_X_SMALL_01';
    warehouse_size VARCHAR DEFAULT 'X-Small';
    call_result VARCHAR;
BEGIN
    CALL PANCAKE.CORE.ADD_WAREHOUSE(:warehouse_name, :warehouse_size) into :call_result;
    return call_result;
END;

--Add the newly created virtual warehouse in Step #5b and #6b to PANCAKE
DECLARE
    warehouse_name VARCHAR DEFAULT 'PANCAKE_MEDIUM_OPTIMIZED_01';
    warehouse_size VARCHAR DEFAULT 'Medium';
    call_result VARCHAR;
BEGIN
    CALL PANCAKE.CORE.ADD_WAREHOUSE(:warehouse_name, :warehouse_size) into :call_result;
    return call_result;
END;

DECLARE 
	datasource_name VARCHAR DEFAULT '<database_name>.MEDICAL_DEVICE.UDI_JSON'; 
	object_type VARCHAR DEFAULT 'Table'; 
	database_name VARCHAR DEFAULT '<database_name>'; 
	schema_name VARCHAR DEFAULT 'MEDICAL_DEVICE'; 
	object_name VARCHAR DEFAULT 'UDI_JSON'; 
	column_name VARCHAR DEFAULT 'JSON_DATA'; 
	datasource_tag VARCHAR DEFAULT ''; 
	warehouse_name VARCHAR DEFAULT 'PANCAKE_X_SMALL_01'; 
	record_limit DEFAULT 150; 
	initiate_scan DEFAULT TRUE; 
	call_result VARCHAR;
BEGIN 
	CALL PANCAKE.core.add_datasource_with_scan( 
:datasource_name,:object_type,:database_name,:schema_name,:object_name,:column_name, 
:warehouse_name, :record_limit, :initiate_scan, :datasource_tag) into
:call_result;

return call_result;

END;

Creating a 'Full Scan' Scan Configuration

Once you've completed the setup script, open up the Pancake app. From the home page, navigate to the Data Source Overview page. You should see a screen like the one below, with only the MEDICAL_DEVICE.UDI_JSON data source (including the database prefix) visible.

If you don't see the data source in Pancake, return to the worksheet you used to set up the app and ensure you didn't encounter any errors when you ran the script. If you do see the data source, UDI_JSON has been sucessfully added and the quick scan has been performed.

Manage Scan Configuration

Navigate to the Manage Scan Configuration page in the app by clicking the name in the left navigation bar, note that depending on the size of your browser window the data source filter may be blocking the words and you might need to scroll down.

Once on the Manage Scan Configuration page, select the UDI data source from the dropdown menu. This will default to a "New" Scan Configuration, if you wish to edit the existing quick scan, select it from the dropdown menu under "Select a scan configuration to edit".

Name your new Scan Configuration something like "FDA UDI Full Scan" so you can quickly identify details if you are looking at a view containing a large number of Scan Configurations.

Once you've named your Scan Configuration, scroll down on the page. From the Warehouse Settings dropdown menu, select PANCAKE_MEDIUM_OPTIMIZED_01 to ensure we have enough processing power to handle all ~4,300,000 records we're about to scan. Leave the Record Limit set to "0" and Pancake will default to the total number of records in the data source.

Click "Save" to save the scan configuration.

This should result in a success message at the top of the page: "The scan configuration settings have been updated successfully."

Information about Scan Configuration Advanced Settings can be found on the Scan Processing page.

Scan Data Source

Navigate to the Scan Data Source page from the left navigation bar. On the Scan Data Source page, select the UDI data source ( <your_database_name>.MEDICAL_DEVICE.UDI_JSON), and the FDA UDI Full Scan from their respective dropdown menus. Click the "Start Scan" button to initiate the scan.

Scanning tasks are asynchronous in Pancake, so feel free to navigate to a different page or away from the app entirely.

You can refresh the Scan Data Source page and re-select that Data Source/Scan Configuration combination to check that the scan is in progress, or return to the Data Source Overview page where you can see a table showing all scans currently in process.

A full scan of UDI_JSON should only take 4-5 minutes using PANCAKE_MEDIUM_OPTIMIZED_01, after which you can refresh the page and see the updated attribute information in the Data Source Overview table.

Schema Analysis

If this FDA data is useful and valuable for your actual work, you can upgrade your Pancake app to "Paid" from the Marketplace listing, and upgrade your FDA UDI data source to the Schema Analysis product tier, allowing you to access the full list of attributes.

By utilizing the setup script at the top of this page and walking through this guide, you will have already set all the global permissions needed for Pancake. You should now feel comfortable adding additional data sources and creating Scan Configurations.

For in-depth information about any individual feature or concept (such as setting a cron schedule to monitor a data source for schema changes), you can find it on the relevant page on this site.

Full Scan During Data Source Addition

The FDA Quick Start adds the UDI data as a data source using the default "quick scan" Scan Configuration. The default quick scan configuration limits the number of records to 150 so you can onboard data sources quickly and inexpensively while still getting some insight into the complexity of your data.

If you want to increase the number of records scanned when onboarding a data source for the first time, simply adjust the record_limit DEFAULT value before running the statement.

Note that the default quick scan uses the PANCAKE_X_SMALL_01 warehouse created in the first part of the setup script, and depending on the size and complexity of your data source an X_Small warehouse may not be sufficient to complete the scan. The scan may also fail entirely due to memory constraints. We have tested Pancake on data as large as ~18.5M records and have been able to acheive around 1M records per minute on a Medium Snowpark Optimized warehouse and recommend not attempting to push beyond 30-40M records when adding a data source. When you've familiarized yourself with Pancake, you will be able to configure more advanced Scan Configurations for larger data sources.

To add UDI_JSON as a data source with a complete scan, ensure you set warehouse_name VARCHAR DEFAULT to 'PANCAKE_MEDIUM_OPTIMIZED_01' and then set record_limit DEFAULT to any number greater than the total number of documents in the data source, in this case 4,275,758 or greater. Pancake will only process the total number of records available, so feel free to input a larger number if you are unsure of the exact number of records in the data source.

Last updated 10 months ago