Scan Configurations

This section provides detailed instructions for configuring and managing scan operations in DataPancake. Each subsection outlines specific configuration settings, SQL diagnostic tools, and scheduling

Configuration Calculator

Use the Configuration Calculator to estimate and optimize scan performance based on the data source’s size, structure, and workload. The following SQL statements can help validate and debug your configuration:

Diagnostic SQL Statement #1 – Select Variant Column to Create Query History Statistics

SELECT <VARIANT_COLUMN_NAME> FROM <DATABASE_NAME>.<SCHEMA_NAME>.<OBJECT_NAME>;

Diagnostic SQL Statement #2 – Get Last Query ID Used to Obtain Query History Statistics

SELECT LAST_QUERY_ID();

Diagnostic SQL Statement #3 – Select from Query Statistics to Retrieve Calculator Variable Values. *Note: Due to Snowflake’s latency in updating the query history view, this may not return data for up to 45 minutes.

SELECT QUERY_TEXT, USER_NAME, WAREHOUSE_NAME, ROWS_PRODUCED AS TOTAL_RECORDS, BYTES_WRITTEN_TO_RESULT AS RESULT_BYTES FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY WHERE QUERY_ID = '<QUERY_ID_FROM_QUERY_#2>';

Diagnostic SQL Statement #4 – Records Scanned Per Minute

SELECT ds.scan_datetime_begin AS scan_start_time, dse.error_message AS diagnostic_information FROM PANCAKE.CORE.VW_DATASOURCE_SCAN_ERRORS dse INNER JOIN PANCAKE.CORE.VW_DATASOURCE_SCANS ds ON dse.datasource_scan_id = ds.datasource_scan_id INNER JOIN PANCAKE.CORE.VW_DATASOURCES d ON ds.datasource_id = d.datasource_id WHERE dse.unique_identifier = 'Diagnostics' AND d.datasource_name = '<datasource_name>' AND ds.configuration_name = '<scan_configuration_name>' ORDER BY ds.scan_datetime_begin DESC;

Configuration Settings

Auto Code Generate

Automatically generate dynamic table SQL code if the data source’s schema or polymorphic state changes.

Scan Schedule Settings

Cron Schedule

A cron schedule is required if you enable scheduling.

Cron Time Zone

A cron time zone is required in addition to the cron schedule if you enable scheduling.

Data Source Settings

Where Clause

The WHERE clause can only be used with a single procedure call.

Vertical Scale Settings

Number of Threads

Defaults to the maximum number of threads available to the chosen virtual warehouse.

Number of Procedure Calls

Number of scan procedure calls required to process the entire dataset. Use multiple procedure calls when a single call cannot be completed in under sixty minutes (the default timeout for the Snowpark Python Sandbox). The WHERE clause parameter is not available if the number of procedure calls is greater than 1.

Record Count Per Procedure Call

Required when the number of procedure calls is greater than 1. The record count must be large enough to process all rows but not so large that procedure calls have zero rows to process.

Example • A dataset with 2,000,000 rows can successfully be divided into two calls with 1,000,000 rows per call. • It cannot be divided into two calls with 500,000 rows per call or two calls with 2,500,000 rows per call.

Record Count Per Thread Worker Process

The number of records passed to each thread worker process during a scan process procedure call. This setting controls how much data is loaded into memory at one time.

Avg size of document in bytes * 2 * (number of threads) * (record count per thread worker process) = Memory used

Order By

One or more attributes are required if the number of procedure calls is greater than 1.

Example 1: json_data:"_id" • Example 2: customer_id • Example 3: customer_id, json_data:"_id"

Order By is optional if the number of procedure calls is 1 and the record limit is greater than 0. If you are not using a record limit and the number of calls is 1, Order By is not required.

Source Stream Settings

Last Scanned Timestamp

The last timestamp scanned. This value is used as part of the WHERE clause when scanning data from this data source. To scan the entire data source, remove the timestamp if one exists.

Last updated 5 months ago