# Warehouses

### Overview

Warehouses in DataPancake refer to Snowflake virtual warehouses that you make available to the application as compute resources for scanning operations. Each scan configuration must be assigned to a specific warehouse, and each warehouse can only support one scan at a time.

**DataPancake-Specific Constraints:**

* **One scan per warehouse** - Each warehouse processes one scan at a time; multiple warehouses required for concurrent scanning
* **Snowpark-Optimized required** - DataPancake uses Snowpark Python procedures; standard warehouses are not recommended
* **MAX\_CONCURRENCY\_LEVEL = 1** - Required for all Snowpark-Optimized warehouses to prevent memory contention
* **60-minute timeout** - Snowflake native app limitation; scans exceeding 60 minutes are automatically terminated

***

### Warehouse Configuration

#### Supported Warehouse Sizes

DataPancake supports Snowpark-Optimized warehouses. Thread count defaults are determined by `AVAILABLE_CORES` from the `compute_resource_size` table:

* **X-Small**: 8 threads
* **Small**: 16 threads
* **Medium Snowpark Optimized**: 32 threads (recommended for most use cases)
* **Large Snowpark Optimized**: 64 threads
* **X-Large Snowpark Optimized**: 128 threads
* **2X-Large Snowpark Optimized**: 256 threads
* **3X-Large through 6X-Large**: 512-4096 threads

**Size Recommendations:**

* **Medium Snowpark-Optimized** - Best balance of performance and cost; achieves approximately 1,000,000 records per minute; recommended starting point
* **Multiple Medium Warehouses** - More cost-effective than larger warehouses; enables concurrent scanning across multiple data sources
* **Large and 2X-Large** - Use for exceptionally large or complex data sources; diminishing returns compared to medium warehouses
* **Small** - Suitable for testing and small data sources; may struggle with complex or large data sources

#### Required Settings

**Warehouse Type:**

* **Snowpark-Optimized** - Required for DataPancake's Snowpark-based scanning
* **Standard warehouses** - Not recommended; may experience performance limitations

**MAX\_CONCURRENCY\_LEVEL = 1:**

* Required for all Snowpark-Optimized warehouses
* Prevents memory contention during Python procedure execution
* Prevents scan failures due to memory constraints

***

### Performance Characteristics

#### Scan Speed Benchmarks

**Medium Snowpark-Optimized Warehouse:**

* Approximately 1,000,000 records per minute
* Performance varies based on:
  * Data source complexity (nested structures, array depth)
  * Number of polymorphic variations
  * Embedded JSON content
  * Data source type (Internal Tables vs. External Tables)

**External Tables:**

* Slightly slower than internal tables
* Network latency and external storage access add overhead
* Still achieve high throughput on medium warehouses

**Scaling Considerations:**

* Diminishing returns as warehouse size increases
* 2X-Large warehouses don't provide significantly better performance than Medium
* Multiple medium warehouses provide better overall throughput than single large warehouses
* Cost-effectiveness favors multiple medium warehouses over fewer large ones

#### Memory Constraints

**Real Memory Limitations:**

* Python procedures have significant memory requirements
* Complex data sources with deep nesting consume more memory
* Large arrays and polymorphic variations increase memory usage
* Warehouse size directly impacts available memory

**Vertical Scaling:**

* DataPancake uses vertical scaling across threads within a warehouse
* Efficient job handling maximizes warehouse utilization
* Memory constraints are the primary limiting factor, not CPU

#### Timeout Constraints

**60-Minute Runtime Limit:**

* Snowpark procedures in Snowflake native apps have a 60-minute runtime limit
* Scans exceeding 60 minutes are automatically terminated by Snowflake
* This is a Snowflake organizational-level limitation, not a DataPancake limitation

**Handling Large Data Sources:**

* Break large scans into multiple procedure calls using Scan Configuration settings
* Set appropriate record limits per procedure call to stay within 60-minute limit
* Monitor scan duration and adjust batch sizes accordingly

***

### Naming Conventions

**Recommended Pattern:** `pancake_<warehouse_size>_<number>`

**Examples:**

* `pancake_medium_snowpark_optimized_1`
* `pancake_medium_snowpark_optimized_2`
* `pancake_large_optimized_1`

**Notes:**

* Use descriptive names indicating size and type
* Include sequential numbers for multiple warehouses of same size
* Avoid special characters that might cause issues in SQL

***

### Adding Warehouses

#### During Initial Setup

Warehouses must be created immediately following the granting of global permissions. The setup script (available via the Readme button) includes warehouse creation statements.

**Setup Script Example:**

```sql
CREATE OR REPLACE WAREHOUSE PANCAKE_MEDIUM_OPTIMIZED_01 WITH
  WAREHOUSE_SIZE = 'Medium'
  WAREHOUSE_TYPE = 'SNOWPARK-OPTIMIZED'
  MAX_CONCURRENCY_LEVEL = 1;
```

**Setup Process:**

1. Grant global permissions to DataPancake application
2. Create warehouses using `CREATE OR REPLACE WAREHOUSE` statements
3. Warehouses are automatically registered during setup script execution via `ADD_WAREHOUSE` procedure calls

**Required Settings:**

* `WAREHOUSE_TYPE = 'SNOWPARK-OPTIMIZED'`
* `MAX_CONCURRENCY_LEVEL = 1`

***

#### After Initial Setup

**Post-Setup Process (3 steps, in order):**

1. **Create the Warehouse in Snowflake:**

```sql
CREATE OR REPLACE WAREHOUSE PANCAKE_LARGE_OPTIMIZED_03 WITH
  WAREHOUSE_SIZE = 'Large'
  WAREHOUSE_TYPE = 'SNOWPARK-OPTIMIZED'
  MAX_CONCURRENCY_LEVEL = 1;
```

2. **Grant Usage to DataPancake:**

```sql
GRANT USAGE on WAREHOUSE PANCAKE_LARGE_OPTIMIZED_03 to APPLICATION PANCAKE;
```

3. **Register Warehouse in DataPancake:**

```sql
CALL PANCAKE.CORE.ADD_WAREHOUSE('PANCAKE_LARGE_OPTIMIZED_03', 'LARGE');
```

**Notes:**

* The `ADD_WAREHOUSE` procedure uses a LIKE match on warehouse size name (partial matches work, e.g., 'LARGE' matches 'Large Snowpark Optimized')
* Warehouse name matching is case-insensitive
* Duplicate warehouse names are rejected

***

#### Via DataPancake UI

**UI Process (still requires SQL for step 2):**

1. Create warehouse in Snowsight (Admin → Warehouses)
2. Grant usage via SQL: `GRANT USAGE on WAREHOUSE <name> to APPLICATION PANCAKE;`
3. Register in DataPancake: Navigate to "Manage Warehouses", enter warehouse name (case-insensitive), select size, click "Save"

**Notes:**

* Warehouse must exist in Snowflake before registration
* Warehouse name matching in UI is case-insensitive

***

### Common Scenarios

#### Single Data Source, Regular Scanning

* Use a single medium Snowpark-Optimized warehouse
* Set scan configuration with record limit of 0 (full scan)
* Monitor scan duration to ensure it stays under 60 minutes

#### Multiple Large Data Sources

* Provision multiple medium Snowpark-Optimized warehouses
* Assign each data source to a dedicated warehouse
* Use consistent naming convention

#### Very Large Data Source Exceeding 60 Minutes

* Use multiple procedure calls in Scan Configuration
* Set appropriate record count per procedure call
* Use a medium or large warehouse depending on complexity

#### Development and Testing

* Create small Snowpark-Optimized warehouses for testing
* Use separate warehouses from production to avoid resource contention

***

### Summary

**Key Points:**

* **One scan per warehouse** - Each warehouse processes one scan at a time; multiple warehouses required for concurrent operations
* **Medium Snowpark-Optimized recommended** - Best balance of performance and cost (32 threads, \~1M records/minute)
* **MAX\_CONCURRENCY\_LEVEL = 1** - Required for all Snowpark-Optimized warehouses
* **60-minute timeout** - Snowflake native app limitation; use multiple procedure calls for large data sources
* **Thread counts** - Determined by warehouse size (X-Small: 8, Small: 16, Medium: 32, Large: 64, 2X-Large: 256, etc.)

For Snowflake warehouse documentation, see [Snowflake Virtual Warehouses](https://docs.snowflake.com/en/user-guide/warehouses-overview).
