Warehouses

Snowflake virtual warehouses configured for DataPancake scanning operations, with specific requirements for Snowpark-Optimized warehouses and concurrency settings.

Overview

Warehouses in DataPancake refer to Snowflake virtual warehouses that you make available to the application as compute resources for scanning operations. Each scan configuration must be assigned to a specific warehouse, and each warehouse can only support one scan at a time.

DataPancake-Specific Constraints:

  • One scan per warehouse - Each warehouse processes one scan at a time; multiple warehouses required for concurrent scanning

  • Snowpark-Optimized required - DataPancake uses Snowpark Python procedures; standard warehouses are not recommended

  • MAX_CONCURRENCY_LEVEL = 1 - Required for all Snowpark-Optimized warehouses to prevent memory contention

  • 60-minute timeout - Snowflake native app limitation; scans exceeding 60 minutes are automatically terminated


Warehouse Configuration

Supported Warehouse Sizes

DataPancake supports Snowpark-Optimized warehouses. Thread count defaults are determined by AVAILABLE_CORES from the compute_resource_size table:

  • X-Small: 8 threads

  • Small: 16 threads

  • Medium Snowpark Optimized: 32 threads (recommended for most use cases)

  • Large Snowpark Optimized: 64 threads

  • X-Large Snowpark Optimized: 128 threads

  • 2X-Large Snowpark Optimized: 256 threads

  • 3X-Large through 6X-Large: 512-4096 threads

Size Recommendations:

  • Medium Snowpark-Optimized - Best balance of performance and cost; achieves approximately 1,000,000 records per minute; recommended starting point

  • Multiple Medium Warehouses - More cost-effective than larger warehouses; enables concurrent scanning across multiple data sources

  • Large and 2X-Large - Use for exceptionally large or complex data sources; diminishing returns compared to medium warehouses

  • Small - Suitable for testing and small data sources; may struggle with complex or large data sources

Required Settings

Warehouse Type:

  • Snowpark-Optimized - Required for DataPancake's Snowpark-based scanning

  • Standard warehouses - Not recommended; may experience performance limitations

MAX_CONCURRENCY_LEVEL = 1:

  • Required for all Snowpark-Optimized warehouses

  • Prevents memory contention during Python procedure execution

  • Prevents scan failures due to memory constraints


Performance Characteristics

Scan Speed Benchmarks

Medium Snowpark-Optimized Warehouse:

  • Approximately 1,000,000 records per minute

  • Performance varies based on:

    • Data source complexity (nested structures, array depth)

    • Number of polymorphic variations

    • Embedded JSON content

    • Data source type (Internal Tables vs. External Tables)

External Tables:

  • Slightly slower than internal tables

  • Network latency and external storage access add overhead

  • Still achieve high throughput on medium warehouses

Scaling Considerations:

  • Diminishing returns as warehouse size increases

  • 2X-Large warehouses don't provide significantly better performance than Medium

  • Multiple medium warehouses provide better overall throughput than single large warehouses

  • Cost-effectiveness favors multiple medium warehouses over fewer large ones

Memory Constraints

Real Memory Limitations:

  • Python procedures have significant memory requirements

  • Complex data sources with deep nesting consume more memory

  • Large arrays and polymorphic variations increase memory usage

  • Warehouse size directly impacts available memory

Vertical Scaling:

  • DataPancake uses vertical scaling across threads within a warehouse

  • Efficient job handling maximizes warehouse utilization

  • Memory constraints are the primary limiting factor, not CPU

Timeout Constraints

60-Minute Runtime Limit:

  • Snowpark procedures in Snowflake native apps have a 60-minute runtime limit

  • Scans exceeding 60 minutes are automatically terminated by Snowflake

  • This is a Snowflake organizational-level limitation, not a DataPancake limitation

Handling Large Data Sources:

  • Break large scans into multiple procedure calls using Scan Configuration settings

  • Set appropriate record limits per procedure call to stay within 60-minute limit

  • Monitor scan duration and adjust batch sizes accordingly


Naming Conventions

Recommended Pattern: pancake_<warehouse_size>_<number>

Examples:

  • pancake_medium_snowpark_optimized_1

  • pancake_medium_snowpark_optimized_2

  • pancake_large_optimized_1

Notes:

  • Use descriptive names indicating size and type

  • Include sequential numbers for multiple warehouses of same size

  • Avoid special characters that might cause issues in SQL


Adding Warehouses

During Initial Setup

Warehouses must be created immediately following the granting of global permissions. The setup script (available via the Readme button) includes warehouse creation statements.

Setup Script Example:

Setup Process:

  1. Grant global permissions to DataPancake application

  2. Create warehouses using CREATE OR REPLACE WAREHOUSE statements

  3. Warehouses are automatically registered during setup script execution via ADD_WAREHOUSE procedure calls

Required Settings:

  • WAREHOUSE_TYPE = 'SNOWPARK-OPTIMIZED'

  • MAX_CONCURRENCY_LEVEL = 1


After Initial Setup

Post-Setup Process (3 steps, in order):

  1. Create the Warehouse in Snowflake:

  1. Grant Usage to DataPancake:

  1. Register Warehouse in DataPancake:

Notes:

  • The ADD_WAREHOUSE procedure uses a LIKE match on warehouse size name (partial matches work, e.g., 'LARGE' matches 'Large Snowpark Optimized')

  • Warehouse name matching is case-insensitive

  • Duplicate warehouse names are rejected


Via DataPancake UI

UI Process (still requires SQL for step 2):

  1. Create warehouse in Snowsight (Admin → Warehouses)

  2. Grant usage via SQL: GRANT USAGE on WAREHOUSE <name> to APPLICATION PANCAKE;

  3. Register in DataPancake: Navigate to "Manage Warehouses", enter warehouse name (case-insensitive), select size, click "Save"

Notes:

  • Warehouse must exist in Snowflake before registration

  • Warehouse name matching in UI is case-insensitive


Common Scenarios

Single Data Source, Regular Scanning

  • Use a single medium Snowpark-Optimized warehouse

  • Set scan configuration with record limit of 0 (full scan)

  • Monitor scan duration to ensure it stays under 60 minutes

Multiple Large Data Sources

  • Provision multiple medium Snowpark-Optimized warehouses

  • Assign each data source to a dedicated warehouse

  • Use consistent naming convention

Very Large Data Source Exceeding 60 Minutes

  • Use multiple procedure calls in Scan Configuration

  • Set appropriate record count per procedure call

  • Use a medium or large warehouse depending on complexity

Development and Testing

  • Create small Snowpark-Optimized warehouses for testing

  • Use separate warehouses from production to avoid resource contention


Summary

Key Points:

  • One scan per warehouse - Each warehouse processes one scan at a time; multiple warehouses required for concurrent operations

  • Medium Snowpark-Optimized recommended - Best balance of performance and cost (32 threads, ~1M records/minute)

  • MAX_CONCURRENCY_LEVEL = 1 - Required for all Snowpark-Optimized warehouses

  • 60-minute timeout - Snowflake native app limitation; use multiple procedure calls for large data sources

  • Thread counts - Determined by warehouse size (X-Small: 8, Small: 16, Medium: 32, Large: 64, 2X-Large: 256, etc.)

For Snowflake warehouse documentation, see Snowflake Virtual Warehouses.

Last updated

Was this helpful?