# Warehouses

### Overview

Warehouses in DataPancake refer to Snowflake virtual warehouses that you make available to the application as compute resources for scanning operations. Each scan configuration must be assigned to a specific warehouse, and each warehouse can only support one scan at a time.

**DataPancake-Specific Constraints:**

* **One scan per warehouse** - Each warehouse processes one scan at a time; multiple warehouses required for concurrent scanning
* **Snowpark-Optimized required** - DataPancake uses Snowpark Python procedures; standard warehouses are not recommended
* **MAX\_CONCURRENCY\_LEVEL = 1** - Required for all Snowpark-Optimized warehouses to prevent memory contention
* **60-minute timeout** - Snowflake native app limitation; scans exceeding 60 minutes are automatically terminated

***

### Warehouse Configuration

#### Supported Warehouse Sizes

DataPancake supports Snowpark-Optimized warehouses. Thread count defaults are determined by `AVAILABLE_CORES` from the `compute_resource_size` table:

* **X-Small**: 8 threads
* **Small**: 16 threads
* **Medium Snowpark Optimized**: 32 threads (recommended for most use cases)
* **Large Snowpark Optimized**: 64 threads
* **X-Large Snowpark Optimized**: 128 threads
* **2X-Large Snowpark Optimized**: 256 threads
* **3X-Large through 6X-Large**: 512-4096 threads

**Size Recommendations:**

* **Medium Snowpark-Optimized** - Best balance of performance and cost; achieves approximately 1,000,000 records per minute; recommended starting point
* **Multiple Medium Warehouses** - More cost-effective than larger warehouses; enables concurrent scanning across multiple data sources
* **Large and 2X-Large** - Use for exceptionally large or complex data sources; diminishing returns compared to medium warehouses
* **Small** - Suitable for testing and small data sources; may struggle with complex or large data sources

#### Required Settings

**Warehouse Type:**

* **Snowpark-Optimized** - Required for DataPancake's Snowpark-based scanning
* **Standard warehouses** - Not recommended; may experience performance limitations

**MAX\_CONCURRENCY\_LEVEL = 1:**

* Required for all Snowpark-Optimized warehouses
* Prevents memory contention during Python procedure execution
* Prevents scan failures due to memory constraints

***

### Performance Characteristics

#### Scan Speed Benchmarks

**Medium Snowpark-Optimized Warehouse:**

* Approximately 1,000,000 records per minute
* Performance varies based on:
  * Data source complexity (nested structures, array depth)
  * Number of polymorphic variations
  * Embedded JSON content
  * Data source type (Internal Tables vs. External Tables)

**External Tables:**

* Slightly slower than internal tables
* Network latency and external storage access add overhead
* Still achieve high throughput on medium warehouses

**Scaling Considerations:**

* Diminishing returns as warehouse size increases
* 2X-Large warehouses don't provide significantly better performance than Medium
* Multiple medium warehouses provide better overall throughput than single large warehouses
* Cost-effectiveness favors multiple medium warehouses over fewer large ones

#### Memory Constraints

**Real Memory Limitations:**

* Python procedures have significant memory requirements
* Complex data sources with deep nesting consume more memory
* Large arrays and polymorphic variations increase memory usage
* Warehouse size directly impacts available memory

**Vertical Scaling:**

* DataPancake uses vertical scaling across threads within a warehouse
* Efficient job handling maximizes warehouse utilization
* Memory constraints are the primary limiting factor, not CPU

#### Timeout Constraints

**60-Minute Runtime Limit:**

* Snowpark procedures in Snowflake native apps have a 60-minute runtime limit
* Scans exceeding 60 minutes are automatically terminated by Snowflake
* This is a Snowflake organizational-level limitation, not a DataPancake limitation

**Handling Large Data Sources:**

* Break large scans into multiple procedure calls using Scan Configuration settings
* Set appropriate record limits per procedure call to stay within 60-minute limit
* Monitor scan duration and adjust batch sizes accordingly

***

### Naming Conventions

**Recommended Pattern:** `pancake_<warehouse_size>_<number>`

**Examples:**

* `pancake_medium_snowpark_optimized_1`
* `pancake_medium_snowpark_optimized_2`
* `pancake_large_optimized_1`

**Notes:**

* Use descriptive names indicating size and type
* Include sequential numbers for multiple warehouses of same size
* Avoid special characters that might cause issues in SQL

***

### Adding Warehouses

#### During Initial Setup

Warehouses must be created immediately following the granting of global permissions. The setup script (available via the Readme button) includes warehouse creation statements.

**Setup Script Example:**

```sql
CREATE OR REPLACE WAREHOUSE PANCAKE_MEDIUM_OPTIMIZED_01 WITH
  WAREHOUSE_SIZE = 'Medium'
  WAREHOUSE_TYPE = 'SNOWPARK-OPTIMIZED'
  MAX_CONCURRENCY_LEVEL = 1;
```

**Setup Process:**

1. Grant global permissions to DataPancake application
2. Create warehouses using `CREATE OR REPLACE WAREHOUSE` statements
3. Warehouses are automatically registered during setup script execution via `ADD_WAREHOUSE` procedure calls

**Required Settings:**

* `WAREHOUSE_TYPE = 'SNOWPARK-OPTIMIZED'`
* `MAX_CONCURRENCY_LEVEL = 1`

***

#### After Initial Setup

**Post-Setup Process (3 steps, in order):**

1. **Create the Warehouse in Snowflake:**

```sql
CREATE OR REPLACE WAREHOUSE PANCAKE_LARGE_OPTIMIZED_03 WITH
  WAREHOUSE_SIZE = 'Large'
  WAREHOUSE_TYPE = 'SNOWPARK-OPTIMIZED'
  MAX_CONCURRENCY_LEVEL = 1;
```

2. **Grant Usage to DataPancake:**

```sql
GRANT USAGE on WAREHOUSE PANCAKE_LARGE_OPTIMIZED_03 to APPLICATION PANCAKE;
```

3. **Register Warehouse in DataPancake:**

```sql
CALL PANCAKE.CORE.ADD_WAREHOUSE('PANCAKE_LARGE_OPTIMIZED_03', 'LARGE');
```

**Notes:**

* The `ADD_WAREHOUSE` procedure uses a LIKE match on warehouse size name (partial matches work, e.g., 'LARGE' matches 'Large Snowpark Optimized')
* Warehouse name matching is case-insensitive
* Duplicate warehouse names are rejected

***

#### Via DataPancake UI

**UI Process (still requires SQL for step 2):**

1. Create warehouse in Snowsight (Admin → Warehouses)
2. Grant usage via SQL: `GRANT USAGE on WAREHOUSE <name> to APPLICATION PANCAKE;`
3. Register in DataPancake: Navigate to "Manage Warehouses", enter warehouse name (case-insensitive), select size, click "Save"

**Notes:**

* Warehouse must exist in Snowflake before registration
* Warehouse name matching in UI is case-insensitive

***

### Common Scenarios

#### Single Data Source, Regular Scanning

* Use a single medium Snowpark-Optimized warehouse
* Set scan configuration with record limit of 0 (full scan)
* Monitor scan duration to ensure it stays under 60 minutes

#### Multiple Large Data Sources

* Provision multiple medium Snowpark-Optimized warehouses
* Assign each data source to a dedicated warehouse
* Use consistent naming convention

#### Very Large Data Source Exceeding 60 Minutes

* Use multiple procedure calls in Scan Configuration
* Set appropriate record count per procedure call
* Use a medium or large warehouse depending on complexity

#### Development and Testing

* Create small Snowpark-Optimized warehouses for testing
* Use separate warehouses from production to avoid resource contention

***

### Summary

**Key Points:**

* **One scan per warehouse** - Each warehouse processes one scan at a time; multiple warehouses required for concurrent operations
* **Medium Snowpark-Optimized recommended** - Best balance of performance and cost (32 threads, \~1M records/minute)
* **MAX\_CONCURRENCY\_LEVEL = 1** - Required for all Snowpark-Optimized warehouses
* **60-minute timeout** - Snowflake native app limitation; use multiple procedure calls for large data sources
* **Thread counts** - Determined by warehouse size (X-Small: 8, Small: 16, Medium: 32, Large: 64, 2X-Large: 256, etc.)

For Snowflake warehouse documentation, see [Snowflake Virtual Warehouses](https://docs.snowflake.com/en/user-guide/warehouses-overview).


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.datapancake.com/core-concepts/warehouses.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.