Vertical Scale Settings

Control thread count, procedure calls, and memory management for optimal scan performance.

Vertical scale settings control how DataPancake utilizes compute resources within a single warehouse. These settings are only available for semi-structured data sources.


Number of Threads

Number of parallel threads used during scanning. Defaults to maximum threads available for chosen warehouse size.

Default behavior:

  • Automatically set to warehouse maximum

  • Small: 1-2 threads, Medium: 4-8 threads, Large: 8-16 threads, 2X-Large: 16+ threads

Manual override:

  • Can be reduced for memory-constrained scenarios

  • Lower thread count = less memory usage, slower processing

  • Higher thread count = more memory usage, faster processing (up to warehouse max)

Note: Use default (maximum) for most scenarios. Reduce only if experiencing memory errors.


Number of Procedure Calls (PROCEDURE_INSTANCE_COUNT)

Total number of procedure calls required to process entire dataset. Use multiple calls when a single call cannot complete within 60 minutes.

When to use multiple calls:

  • Data source takes longer than 60 minutes to scan

  • Very large datasets (millions of records)

  • Complex nested structures requiring more processing time

Configuration:

  • Set to 1 for single procedure call (default)

  • Set to 2 or more to split scan across multiple calls

  • Each call must complete within 60-minute timeout

  • Calls execute sequentially via Snowflake Tasks

Important constraints:

  • WHERE clause not available - Cannot use SCAN_WHERE_CLAUSE when PROCEDURE_INSTANCE_COUNT > 1

  • ORDER BY required - Must specify SCAN_ORDER_BY when PROCEDURE_INSTANCE_COUNT > 1

  • Record count required - Must set PROCEDURE_INSTANCE_ROW_COUNT when PROCEDURE_INSTANCE_COUNT > 1

Note: Start with single call (1) and monitor scan duration. Use multiple calls only if scans exceed 60 minutes.


Record Count Per Procedure Call (PROCEDURE_INSTANCE_ROW_COUNT)

Number of records processed per procedure call. Required when PROCEDURE_INSTANCE_COUNT > 1.

Calculation:

  • Total records ÷ Number of procedure calls = Records per call

  • Must be large enough to process all rows

  • Cannot be so large that some calls have zero rows

Example:

  • Dataset: 2,000,000 records, 2 procedure calls

  • Valid: 1,000,000 records per call (2,000,000 ÷ 2 = 1,000,000)

  • Invalid: 500,000 records per call (would only process 1,000,000 total)

  • Invalid: 2,500,000 records per call (exceeds total, would leave empty calls)

Note: Calculate: Total records ÷ Number of calls. Add small buffer (5-10%) to account for data growth.


Order By (SCAN_ORDER_BY)

One or more attributes used to order the data source.

When required:

  • PROCEDURE_INSTANCE_COUNT > 1 - Ensures consistent record partitioning across calls, prevents duplicate or missed records

When optional:

  • PROCEDURE_INSTANCE_COUNT = 1 and SCAN_RECORD_LIMIT > 0 - Can improve performance for limited scans

When not required:

  • PROCEDURE_INSTANCE_COUNT = 1 and SCAN_RECORD_LIMIT = 0 - Full table scan in single call

Format examples:

  • Single attribute: json_data:"_id" or customer_id

  • Multiple attributes: customer_id, json_data:"_id"

Note: Use unique or nearly-unique attributes. Prefer indexed columns for performance.


Record Count Per Thread Worker Process (THREAD_PROCESS_RECORD_COUNT)

Number of records passed to each thread worker process during a scan procedure call. Controls how much data is loaded into memory at one time.

Memory calculation:

Default behavior:

  • Automatically set based on warehouse size and data complexity

  • Typically defaults to 1,000 records per thread

  • Can be adjusted if memory errors occur

Optimization:

  • Reduce if experiencing memory errors

  • Increase if memory allows and you want faster processing

  • Balance between memory usage and processing efficiency

Note: Start with default value. Reduce if memory errors occur during scanning.

Last updated

Was this helpful?