Warehouses

Snowflake virtual warehouses configured for DataPancake scanning operations, with specific requirements for Snowpark-Optimized warehouses and concurrency settings.

Overview

Warehouses in DataPancake refer to Snowflake virtual warehouses that you make available to the application as compute resources for scanning operations. Each scan configuration must be assigned to a specific warehouse, and each warehouse can only support one scan at a time.

DataPancake-Specific Constraints:

One scan per warehouse - Each warehouse processes one scan at a time; multiple warehouses required for concurrent scanning
Snowpark-Optimized required - DataPancake uses Snowpark Python procedures; standard warehouses are not recommended
MAX_CONCURRENCY_LEVEL = 1 - Required for all Snowpark-Optimized warehouses to prevent memory contention
60-minute timeout - Snowflake native app limitation; scans exceeding 60 minutes are automatically terminated

Warehouse Configuration

Supported Warehouse Sizes

DataPancake supports Snowpark-Optimized warehouses. Thread count defaults are determined by AVAILABLE_CORES from the compute_resource_size table:

X-Small: 8 threads
Small: 16 threads
Medium Snowpark Optimized: 32 threads (recommended for most use cases)
Large Snowpark Optimized: 64 threads
X-Large Snowpark Optimized: 128 threads
2X-Large Snowpark Optimized: 256 threads
3X-Large through 6X-Large: 512-4096 threads

Size Recommendations:

Medium Snowpark-Optimized - Best balance of performance and cost; achieves approximately 1,000,000 records per minute; recommended starting point
Multiple Medium Warehouses - More cost-effective than larger warehouses; enables concurrent scanning across multiple data sources
Large and 2X-Large - Use for exceptionally large or complex data sources; diminishing returns compared to medium warehouses
Small - Suitable for testing and small data sources; may struggle with complex or large data sources

Required Settings

Warehouse Type:

Snowpark-Optimized - Required for DataPancake's Snowpark-based scanning
Standard warehouses - Not recommended; may experience performance limitations

MAX_CONCURRENCY_LEVEL = 1:

Required for all Snowpark-Optimized warehouses
Prevents memory contention during Python procedure execution
Prevents scan failures due to memory constraints

Performance Characteristics

Scan Speed Benchmarks

Medium Snowpark-Optimized Warehouse:

Approximately 1,000,000 records per minute
Performance varies based on:
- Data source complexity (nested structures, array depth)
- Number of polymorphic variations
- Embedded JSON content
- Data source type (Internal Tables vs. External Tables)

External Tables:

Slightly slower than internal tables
Network latency and external storage access add overhead
Still achieve high throughput on medium warehouses

Scaling Considerations:

Diminishing returns as warehouse size increases
2X-Large warehouses don't provide significantly better performance than Medium
Multiple medium warehouses provide better overall throughput than single large warehouses
Cost-effectiveness favors multiple medium warehouses over fewer large ones

Memory Constraints

Real Memory Limitations:

Python procedures have significant memory requirements
Complex data sources with deep nesting consume more memory
Large arrays and polymorphic variations increase memory usage
Warehouse size directly impacts available memory

Vertical Scaling:

DataPancake uses vertical scaling across threads within a warehouse
Efficient job handling maximizes warehouse utilization
Memory constraints are the primary limiting factor, not CPU

Timeout Constraints

60-Minute Runtime Limit:

Snowpark procedures in Snowflake native apps have a 60-minute runtime limit
Scans exceeding 60 minutes are automatically terminated by Snowflake
This is a Snowflake organizational-level limitation, not a DataPancake limitation

Handling Large Data Sources:

Break large scans into multiple procedure calls using Scan Configuration settings
Set appropriate record limits per procedure call to stay within 60-minute limit
Monitor scan duration and adjust batch sizes accordingly

Naming Conventions

Recommended Pattern: pancake_<warehouse_size>_<number>

Examples:

pancake_medium_snowpark_optimized_1
pancake_medium_snowpark_optimized_2
pancake_large_optimized_1

Notes:

Use descriptive names indicating size and type
Include sequential numbers for multiple warehouses of same size
Avoid special characters that might cause issues in SQL

Adding Warehouses

During Initial Setup

Warehouses must be created immediately following the granting of global permissions. The setup script (available via the Readme button) includes warehouse creation statements.

Setup Script Example:

CREATE OR REPLACE WAREHOUSE PANCAKE_MEDIUM_OPTIMIZED_01 WITH
  WAREHOUSE_SIZE = 'Medium'
  WAREHOUSE_TYPE = 'SNOWPARK-OPTIMIZED'
  MAX_CONCURRENCY_LEVEL = 1;

Setup Process:

Grant global permissions to DataPancake application
Create warehouses using CREATE OR REPLACE WAREHOUSE statements
Warehouses are automatically registered during setup script execution via ADD_WAREHOUSE procedure calls

Required Settings:

WAREHOUSE_TYPE = 'SNOWPARK-OPTIMIZED'
MAX_CONCURRENCY_LEVEL = 1

After Initial Setup

Post-Setup Process (3 steps, in order):

Create the Warehouse in Snowflake:

CREATE OR REPLACE WAREHOUSE PANCAKE_LARGE_OPTIMIZED_03 WITH
  WAREHOUSE_SIZE = 'Large'
  WAREHOUSE_TYPE = 'SNOWPARK-OPTIMIZED'
  MAX_CONCURRENCY_LEVEL = 1;

Grant Usage to DataPancake:

GRANT USAGE on WAREHOUSE PANCAKE_LARGE_OPTIMIZED_03 to APPLICATION PANCAKE;

Register Warehouse in DataPancake:

CALL PANCAKE.CORE.ADD_WAREHOUSE('PANCAKE_LARGE_OPTIMIZED_03', 'LARGE');

Notes:

The ADD_WAREHOUSE procedure uses a LIKE match on warehouse size name (partial matches work, e.g., 'LARGE' matches 'Large Snowpark Optimized')
Warehouse name matching is case-insensitive
Duplicate warehouse names are rejected

Via DataPancake UI

UI Process (still requires SQL for step 2):

Create warehouse in Snowsight (Admin → Warehouses)
Grant usage via SQL: GRANT USAGE on WAREHOUSE <name> to APPLICATION PANCAKE;
Register in DataPancake: Navigate to "Manage Warehouses", enter warehouse name (case-insensitive), select size, click "Save"

Notes:

Warehouse must exist in Snowflake before registration
Warehouse name matching in UI is case-insensitive

Common Scenarios

Single Data Source, Regular Scanning

Use a single medium Snowpark-Optimized warehouse
Set scan configuration with record limit of 0 (full scan)
Monitor scan duration to ensure it stays under 60 minutes

Multiple Large Data Sources

Provision multiple medium Snowpark-Optimized warehouses
Assign each data source to a dedicated warehouse
Use consistent naming convention

Very Large Data Source Exceeding 60 Minutes

Use multiple procedure calls in Scan Configuration
Set appropriate record count per procedure call
Use a medium or large warehouse depending on complexity

Development and Testing

Create small Snowpark-Optimized warehouses for testing
Use separate warehouses from production to avoid resource contention

Summary

Key Points:

One scan per warehouse - Each warehouse processes one scan at a time; multiple warehouses required for concurrent operations
Medium Snowpark-Optimized recommended - Best balance of performance and cost (32 threads, ~1M records/minute)
MAX_CONCURRENCY_LEVEL = 1 - Required for all Snowpark-Optimized warehouses
60-minute timeout - Snowflake native app limitation; use multiple procedure calls for large data sources
Thread counts - Determined by warehouse size (X-Small: 8, Small: 16, Medium: 32, Large: 64, 2X-Large: 256, etc.)

For Snowflake warehouse documentation, see Snowflake Virtual Warehouses.

PreviousFAQ NextData Sources

Last updated 2 months ago

Was this helpful?

Good evening

hashtagOverview

hashtagWarehouse Configuration

hashtagSupported Warehouse Sizes

hashtagRequired Settings

hashtagPerformance Characteristics

hashtagScan Speed Benchmarks

hashtagMemory Constraints

hashtagTimeout Constraints

hashtagNaming Conventions

hashtagAdding Warehouses

hashtagDuring Initial Setup

hashtagAfter Initial Setup

hashtagVia DataPancake UI

hashtagCommon Scenarios

hashtagSingle Data Source, Regular Scanning

hashtagMultiple Large Data Sources

hashtagVery Large Data Source Exceeding 60 Minutes

hashtagDevelopment and Testing

hashtagSummary

Overview

Warehouse Configuration

Supported Warehouse Sizes

Required Settings

Performance Characteristics

Scan Speed Benchmarks

Memory Constraints

Timeout Constraints

Naming Conventions

Adding Warehouses

During Initial Setup

After Initial Setup

Via DataPancake UI

Common Scenarios

Single Data Source, Regular Scanning

Multiple Large Data Sources

Very Large Data Source Exceeding 60 Minutes

Development and Testing

Summary