🏥FDA_UDI Data Quick Start
One of the core benefits Pancake provides is the ability to performe a quick, thorough, recursive scan of a data source to discover every attribute contained within it. Data sources with a large number of records are often prohibitive to manually inspect, and users rely on sampling to understand the schema.
This guide will provide you with the script you need to install and set up Pancake in your Snowflake environment and add the FDA's Medical Device - UDI Dataset so you can familiarize yourself with Pancake's capabilities using real world data.
At present, there are 89 total attributes, 1 polymorphic attribute, 8 arrays, 11 objects, with 4 levels deep of nesting. The quick scan will likely only discover 88 total attributes and no polymorphic data.
Because the only way to be entirely certain of your data source's schema is to check every single attribute in every record, this guide will also walk you through how to create a Scan Configuration in Panake that will perform a complete scan of the FDA UDI Data.
Below you will find the setup script which you can copy and paste into a worksheet in your Snowflake environment.
Note: This guide assumes that you have already installed Pancake and the FDA UDI Data, and you will need to do so before continuing with this guide.
Creating a 'Full Scan' Scan Configuration
Once you've completed the setup script, open up the Pancake app. From the home page, navigate to the Data Source Overview page. You should see a screen like the one below, with only the MEDICAL_DEVICE.UDI_JSON
data source (including the database prefix) visible.
If you don't see the data source in Pancake, return to the worksheet you used to set up the app and ensure you didn't encounter any errors when you ran the script. If you do see the data source, UDI_JSON
has been sucessfully added and the quick scan has been performed.
Manage Scan Configuration
Navigate to the Manage Scan Configuration page in the app by clicking the name in the left navigation bar, note that depending on the size of your browser window the data source filter may be blocking the words and you might need to scroll down.
Once on the Manage Scan Configuration page, select the UDI data source from the dropdown menu. This will default to a "New" Scan Configuration, if you wish to edit the existing quick scan, select it from the dropdown menu under "Select a scan configuration to edit".
Name your new Scan Configuration something like "FDA UDI Full Scan" so you can quickly identify details if you are looking at a view containing a large number of Scan Configurations.
Once you've named your Scan Configuration, scroll down on the page. From the Warehouse Settings dropdown menu, select PANCAKE_MEDIUM_OPTIMIZED_01
to ensure we have enough processing power to handle all ~4,300,000 records we're about to scan. Leave the Record Limit set to "0" and Pancake will default to the total number of records in the data source.
Click "Save" to save the scan configuration.
This should result in a success message at the top of the page: "The scan configuration settings have been updated successfully."
Information about Scan Configuration Advanced Settings can be found on the Scan Processing page.
Scan Data Source
Navigate to the Scan Data Source page from the left navigation bar. On the Scan Data Source page, select the UDI data source ( <your_database_name>.MEDICAL_DEVICE.UDI_JSON)
, and the FDA UDI Full Scan from their respective dropdown menus. Click the "Start Scan" button to initiate the scan.
Scanning tasks are asynchronous in Pancake, so feel free to navigate to a different page or away from the app entirely.
You can refresh the Scan Data Source page and re-select that Data Source/Scan Configuration combination to check that the scan is in progress, or return to the Data Source Overview page where you can see a table showing all scans currently in process.
A full scan of UDI_JSON
should only take 4-5 minutes using PANCAKE_MEDIUM_OPTIMIZED_01
, after which you can refresh the page and see the updated attribute information in the Data Source Overview table.
Schema Analysis
If this FDA data is useful and valuable for your actual work, you can upgrade your Pancake app to "Paid" from the Marketplace listing, and upgrade your FDA UDI data source to the Schema Analysis product tier, allowing you to access the full list of attributes.
By utilizing the setup script at the top of this page and walking through this guide, you will have already set all the global permissions needed for Pancake. You should now feel comfortable adding additional data sources and creating Scan Configurations.
For in-depth information about any individual feature or concept (such as setting a cron schedule to monitor a data source for schema changes), you can find it on the relevant page on this site.
Full Scan During Data Source Addition
The FDA Quick Start adds the UDI data as a data source using the default "quick scan" Scan Configuration. The default quick scan configuration limits the number of records to 150 so you can onboard data sources quickly and inexpensively while still getting some insight into the complexity of your data.
If you want to increase the number of records scanned when onboarding a data source for the first time, simply adjust the record_limit DEFAULT
value before running the statement.
Note that the default quick scan uses the PANCAKE_X_SMALL_01
warehouse created in the first part of the setup script, and depending on the size and complexity of your data source an X_Small warehouse may not be sufficient to complete the scan. The scan may also fail entirely due to memory constraints. We have tested Pancake on data as large as ~18.5M records and have been able to acheive around 1M records per minute on a Medium Snowpark Optimized warehouse and recommend not attempting to push beyond 30-40M records when adding a data source. When you've familiarized yourself with Pancake, you will be able to configure more advanced Scan Configurations for larger data sources.
To add UDI_JSON as a data source with a complete scan, ensure you set warehouse_name VARCHAR DEFAULT
to 'PANCAKE_MEDIUM_OPTIMIZED_01'
and then set record_limit DEFAULT
to any number greater than the total number of documents in the data source, in this case 4,275,758 or greater. Pancake will only process the total number of records available, so feel free to input a larger number if you are unsure of the exact number of records in the data source.
Last updated