Scan Configurations
This section provides detailed instructions for configuring and managing scan operations in DataPancake. Each subsection outlines specific configuration settings, SQL diagnostic tools, and scheduling
Configuration Calculator
Use the Configuration Calculator to estimate and optimize scan performance based on the data source’s size, structure, and workload. The following SQL statements can help validate and debug your configuration:
Diagnostic SQL Statement #1 – Select Variant Column to Create Query History Statistics
Diagnostic SQL Statement #2 – Get Last Query ID Used to Obtain Query History Statistics
Diagnostic SQL Statement #3 – Select from Query Statistics to Retrieve Calculator Variable Values. *Note: Due to Snowflake’s latency in updating the query history view, this may not return data for up to 45 minutes.
Diagnostic SQL Statement #4 – Records Scanned Per Minute
Configuration Settings
Auto Code Generate
Automatically generate dynamic table SQL code if the data source’s schema or polymorphic state changes.
Scan Schedule Settings
Cron Schedule
A cron schedule is required if you enable scheduling.
Cron Time Zone
A cron time zone is required in addition to the cron schedule if you enable scheduling.
Data Source Settings
Where Clause
The WHERE clause can only be used with a single procedure call.
Vertical Scale Settings
Number of Threads
Defaults to the maximum number of threads available to the chosen virtual warehouse.
Number of Procedure Calls
Number of scan procedure calls required to process the entire dataset. Use multiple procedure calls when a single call cannot be completed in under sixty minutes (the default timeout for the Snowpark Python Sandbox). The WHERE clause parameter is not available if the number of procedure calls is greater than 1.
Record Count Per Procedure Call
Required when the number of procedure calls is greater than 1. The record count must be large enough to process all rows but not so large that procedure calls have zero rows to process.
Example • A dataset with 2,000,000 rows can successfully be divided into two calls with 1,000,000 rows per call. • It cannot be divided into two calls with 500,000 rows per call or two calls with 2,500,000 rows per call.
Record Count Per Thread Worker Process
The number of records passed to each thread worker process during a scan process procedure call. This setting controls how much data is loaded into memory at one time.
Avg size of document in bytes * 2 * (number of threads) * (record count per thread worker process) = Memory used
Order By
One or more attributes are required if the number of procedure calls is greater than 1.
Example 1: json_data:"_id" • Example 2: customer_id • Example 3: customer_id, json_data:"_id"
Order By is optional if the number of procedure calls is 1 and the record limit is greater than 0. If you are not using a record limit and the number of calls is 1, Order By is not required.
Source Stream Settings
Last Scanned Timestamp
The last timestamp scanned. This value is used as part of the WHERE clause when scanning data from this data source. To scan the entire data source, remove the timestamp if one exists.
Last updated