# Vertical Scale Settings

Vertical scale settings control how DataPancake utilizes compute resources within a single warehouse. These settings are only available for semi-structured data sources.

***

## Number of Threads

Number of parallel threads used during scanning. Defaults to maximum threads available for chosen warehouse size.

**Default behavior:**

* Automatically set to warehouse maximum
* Small: 1-2 threads, Medium: 4-8 threads, Large: 8-16 threads, 2X-Large: 16+ threads

**Manual override:**

* Can be reduced for memory-constrained scenarios
* Lower thread count = less memory usage, slower processing
* Higher thread count = more memory usage, faster processing (up to warehouse max)

**Note:** Use default (maximum) for most scenarios. Reduce only if experiencing memory errors.

***

## Number of Procedure Calls (`PROCEDURE_INSTANCE_COUNT`)

Total number of procedure calls required to process entire dataset. Use multiple calls when a single call cannot complete within 60 minutes.

**When to use multiple calls:**

* Data source takes longer than 60 minutes to scan
* Very large datasets (millions of records)
* Complex nested structures requiring more processing time

**Configuration:**

* Set to `1` for single procedure call (default)
* Set to `2` or more to split scan across multiple calls
* Each call must complete within 60-minute timeout
* Calls execute sequentially via Snowflake Tasks

**Important constraints:**

* **WHERE clause not available** - Cannot use `SCAN_WHERE_CLAUSE` when `PROCEDURE_INSTANCE_COUNT > 1`
* **ORDER BY required** - Must specify `SCAN_ORDER_BY` when `PROCEDURE_INSTANCE_COUNT > 1`
* **Record count required** - Must set `PROCEDURE_INSTANCE_ROW_COUNT` when `PROCEDURE_INSTANCE_COUNT > 1`

**Note:** Start with single call (`1`) and monitor scan duration. Use multiple calls only if scans exceed 60 minutes.

***

## Record Count Per Procedure Call (`PROCEDURE_INSTANCE_ROW_COUNT`)

Number of records processed per procedure call. Required when `PROCEDURE_INSTANCE_COUNT > 1`.

**Calculation:**

* Total records ÷ Number of procedure calls = Records per call
* Must be large enough to process all rows
* Cannot be so large that some calls have zero rows

**Example:**

* Dataset: 2,000,000 records, 2 procedure calls
* Valid: 1,000,000 records per call (2,000,000 ÷ 2 = 1,000,000)
* Invalid: 500,000 records per call (would only process 1,000,000 total)
* Invalid: 2,500,000 records per call (exceeds total, would leave empty calls)

**Note:** Calculate: Total records ÷ Number of calls. Add small buffer (5-10%) to account for data growth.

***

## Order By (`SCAN_ORDER_BY`)

One or more attributes used to order the data source.

**When required:**

* `PROCEDURE_INSTANCE_COUNT > 1` - Ensures consistent record partitioning across calls, prevents duplicate or missed records

**When optional:**

* `PROCEDURE_INSTANCE_COUNT = 1` and `SCAN_RECORD_LIMIT > 0` - Can improve performance for limited scans

**When not required:**

* `PROCEDURE_INSTANCE_COUNT = 1` and `SCAN_RECORD_LIMIT = 0` - Full table scan in single call

**Format examples:**

* Single attribute: `json_data:"_id"` or `customer_id`
* Multiple attributes: `customer_id, json_data:"_id"`

**Note:** Use unique or nearly-unique attributes. Prefer indexed columns for performance.

***

## Record Count Per Thread Worker Process (`THREAD_PROCESS_RECORD_COUNT`)

Number of records passed to each thread worker process during a scan procedure call. Controls how much data is loaded into memory at one time.

**Memory calculation:**

```
Memory Used = Avg document size (bytes) × 2 × Thread Count × Record Count Per Thread
```

**Default behavior:**

* Automatically set based on warehouse size and data complexity
* Typically defaults to 1,000 records per thread
* Can be adjusted if memory errors occur

**Optimization:**

* Reduce if experiencing memory errors
* Increase if memory allows and you want faster processing
* Balance between memory usage and processing efficiency

**Note:** Start with default value. Reduce if memory errors occur during scanning.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.datapancake.com/core-concepts/scan-configurations/vertical-scale-settings.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
