Warehouses
Snowflake virtual warehouses configured for DataPancake scanning operations, with specific requirements for Snowpark-Optimized warehouses and concurrency settings.
Overview
Warehouses in DataPancake refer to Snowflake virtual warehouses that you make available to the application as compute resources for scanning operations. Each scan configuration must be assigned to a specific warehouse, and each warehouse can only support one scan at a time.
DataPancake-Specific Constraints:
One scan per warehouse - Each warehouse processes one scan at a time; multiple warehouses required for concurrent scanning
Snowpark-Optimized required - DataPancake uses Snowpark Python procedures; standard warehouses are not recommended
MAX_CONCURRENCY_LEVEL = 1 - Required for all Snowpark-Optimized warehouses to prevent memory contention
60-minute timeout - Snowflake native app limitation; scans exceeding 60 minutes are automatically terminated
Warehouse Configuration
Supported Warehouse Sizes
DataPancake supports Snowpark-Optimized warehouses. Thread count defaults are determined by AVAILABLE_CORES from the compute_resource_size table:
X-Small: 8 threads
Small: 16 threads
Medium Snowpark Optimized: 32 threads (recommended for most use cases)
Large Snowpark Optimized: 64 threads
X-Large Snowpark Optimized: 128 threads
2X-Large Snowpark Optimized: 256 threads
3X-Large through 6X-Large: 512-4096 threads
Size Recommendations:
Medium Snowpark-Optimized - Best balance of performance and cost; achieves approximately 1,000,000 records per minute; recommended starting point
Multiple Medium Warehouses - More cost-effective than larger warehouses; enables concurrent scanning across multiple data sources
Large and 2X-Large - Use for exceptionally large or complex data sources; diminishing returns compared to medium warehouses
Small - Suitable for testing and small data sources; may struggle with complex or large data sources
Required Settings
Warehouse Type:
Snowpark-Optimized - Required for DataPancake's Snowpark-based scanning
Standard warehouses - Not recommended; may experience performance limitations
MAX_CONCURRENCY_LEVEL = 1:
Required for all Snowpark-Optimized warehouses
Prevents memory contention during Python procedure execution
Prevents scan failures due to memory constraints
Performance Characteristics
Scan Speed Benchmarks
Medium Snowpark-Optimized Warehouse:
Approximately 1,000,000 records per minute
Performance varies based on:
Data source complexity (nested structures, array depth)
Number of polymorphic variations
Embedded JSON content
Data source type (Internal Tables vs. External Tables)
External Tables:
Slightly slower than internal tables
Network latency and external storage access add overhead
Still achieve high throughput on medium warehouses
Scaling Considerations:
Diminishing returns as warehouse size increases
2X-Large warehouses don't provide significantly better performance than Medium
Multiple medium warehouses provide better overall throughput than single large warehouses
Cost-effectiveness favors multiple medium warehouses over fewer large ones
Memory Constraints
Real Memory Limitations:
Python procedures have significant memory requirements
Complex data sources with deep nesting consume more memory
Large arrays and polymorphic variations increase memory usage
Warehouse size directly impacts available memory
Vertical Scaling:
DataPancake uses vertical scaling across threads within a warehouse
Efficient job handling maximizes warehouse utilization
Memory constraints are the primary limiting factor, not CPU
Timeout Constraints
60-Minute Runtime Limit:
Snowpark procedures in Snowflake native apps have a 60-minute runtime limit
Scans exceeding 60 minutes are automatically terminated by Snowflake
This is a Snowflake organizational-level limitation, not a DataPancake limitation
Handling Large Data Sources:
Break large scans into multiple procedure calls using Scan Configuration settings
Set appropriate record limits per procedure call to stay within 60-minute limit
Monitor scan duration and adjust batch sizes accordingly
Naming Conventions
Recommended Pattern: pancake_<warehouse_size>_<number>
Examples:
pancake_medium_snowpark_optimized_1pancake_medium_snowpark_optimized_2pancake_large_optimized_1
Notes:
Use descriptive names indicating size and type
Include sequential numbers for multiple warehouses of same size
Avoid special characters that might cause issues in SQL
Adding Warehouses
During Initial Setup
Warehouses must be created immediately following the granting of global permissions. The setup script (available via the Readme button) includes warehouse creation statements.
Setup Script Example:
Setup Process:
Grant global permissions to DataPancake application
Create warehouses using
CREATE OR REPLACE WAREHOUSEstatementsWarehouses are automatically registered during setup script execution via
ADD_WAREHOUSEprocedure calls
Required Settings:
WAREHOUSE_TYPE = 'SNOWPARK-OPTIMIZED'MAX_CONCURRENCY_LEVEL = 1
After Initial Setup
Post-Setup Process (3 steps, in order):
Create the Warehouse in Snowflake:
Grant Usage to DataPancake:
Register Warehouse in DataPancake:
Notes:
The
ADD_WAREHOUSEprocedure uses a LIKE match on warehouse size name (partial matches work, e.g., 'LARGE' matches 'Large Snowpark Optimized')Warehouse name matching is case-insensitive
Duplicate warehouse names are rejected
Via DataPancake UI
UI Process (still requires SQL for step 2):
Create warehouse in Snowsight (Admin → Warehouses)
Grant usage via SQL:
GRANT USAGE on WAREHOUSE <name> to APPLICATION PANCAKE;Register in DataPancake: Navigate to "Manage Warehouses", enter warehouse name (case-insensitive), select size, click "Save"
Notes:
Warehouse must exist in Snowflake before registration
Warehouse name matching in UI is case-insensitive
Common Scenarios
Single Data Source, Regular Scanning
Use a single medium Snowpark-Optimized warehouse
Set scan configuration with record limit of 0 (full scan)
Monitor scan duration to ensure it stays under 60 minutes
Multiple Large Data Sources
Provision multiple medium Snowpark-Optimized warehouses
Assign each data source to a dedicated warehouse
Use consistent naming convention
Very Large Data Source Exceeding 60 Minutes
Use multiple procedure calls in Scan Configuration
Set appropriate record count per procedure call
Use a medium or large warehouse depending on complexity
Development and Testing
Create small Snowpark-Optimized warehouses for testing
Use separate warehouses from production to avoid resource contention
Summary
Key Points:
One scan per warehouse - Each warehouse processes one scan at a time; multiple warehouses required for concurrent operations
Medium Snowpark-Optimized recommended - Best balance of performance and cost (32 threads, ~1M records/minute)
MAX_CONCURRENCY_LEVEL = 1 - Required for all Snowpark-Optimized warehouses
60-minute timeout - Snowflake native app limitation; use multiple procedure calls for large data sources
Thread counts - Determined by warehouse size (X-Small: 8, Small: 16, Medium: 32, Large: 64, 2X-Large: 256, etc.)
For Snowflake warehouse documentation, see Snowflake Virtual Warehouses.
Last updated
Was this helpful?