Common Patterns & Best Practices

Pre-configured patterns and optimization recommendations for common scan scenarios.

Common Configuration Patterns

Quick Onboarding Scan:

ATTRIBUTE_CREATE_TYPE: 'Schema' (if schema sample available) or 'Discover'
SCAN_RECORD_LIMIT: 10,000 (or small sample)
PROCEDURE_INSTANCE_COUNT: 1
Number of Threads: Default (maximum)
CODE_GENERATE_ON_VERSION_CHANGE: FALSE (review first)
MONITOR_ENABLED: FALSE
Use case: Initial data source setup, testing, quick schema validation

Full Production Scan:

ATTRIBUTE_CREATE_TYPE: 'Discover'
SCAN_RECORD_LIMIT: 0 (unlimited)
PROCEDURE_INSTANCE_COUNT: 1 (or multiple if needed)
Number of Threads: Default (maximum)
CODE_GENERATE_ON_VERSION_CHANGE: TRUE
MONITOR_ENABLED: TRUE (daily/weekly as needed)
Use case: Production data sources, complete attribute discovery, automated pipeline updates

Incremental Scan:

ATTRIBUTE_CREATE_TYPE: 'Discover'
SCAN_RECORD_LIMIT: 0 (unlimited)
SCAN_WHERE_CLAUSE: created_date > '<last_scanned_timestamp>'
SOURCE_STREAM_LAST_SCANNED_TIMESTAMP: Auto-updated after scans
PROCEDURE_INSTANCE_COUNT: 1
Number of Threads: Default (maximum)
CODE_GENERATE_ON_VERSION_CHANGE: TRUE
MONITOR_ENABLED: TRUE (hourly/daily)
Use case: Regularly updated data sources, delta scanning workflows, schema monitoring

Large Dataset Scan:

ATTRIBUTE_CREATE_TYPE: 'Discover'
SCAN_RECORD_LIMIT: 0 (unlimited)
PROCEDURE_INSTANCE_COUNT: 4 (or as needed)
PROCEDURE_INSTANCE_ROW_COUNT: 1,000,000 (or calculated)
SCAN_ORDER_BY: unique_id (or appropriate attribute)
Number of Threads: Default (maximum)
CODE_GENERATE_ON_VERSION_CHANGE: TRUE
MONITOR_ENABLED: TRUE (as needed)
Use case: Data sources exceeding 60-minute scan time, very large datasets (millions+ records), complex nested structures

Memory-Constrained Scan:

ATTRIBUTE_CREATE_TYPE: 'Discover'
SCAN_RECORD_LIMIT: 0 (unlimited) or limited
PROCEDURE_INSTANCE_COUNT: 1 or multiple
Number of Threads: Reduced (e.g., 2-4 instead of max)
THREAD_PROCESS_RECORD_COUNT: Reduced
CODE_GENERATE_ON_VERSION_CHANGE: TRUE
MONITOR_ENABLED: FALSE or TRUE
Use case: Memory errors during scanning, very complex nested structures, large arrays and polymorphic variations, smaller warehouse sizes

Configuration management:

Use descriptive, consistent names (e.g., customer_events_daily_full)
Create separate configurations for different use cases (quick scan, full scan, incremental scan)
Keep only active configurations visible; use Inactive status for configurations you may reuse

Performance optimization:

Use medium warehouses for most use cases; reserve larger warehouses for complex data sources
Use default (maximum) threads unless memory constrained
Start with single procedure call (PROCEDURE_INSTANCE_COUNT = 1); use multiple calls only if exceeding 60 minutes

Scheduling strategy:

Schedule during low-usage periods
Avoid overlapping scans on same warehouse
Use SCAN_WHERE_CLAUSE with timestamp filters for incremental scanning
Leverage SOURCE_STREAM_LAST_SCANNED_TIMESTAMP; clear timestamp for full re-scans

Error prevention:

Monitor scan duration; use multiple procedure calls if needed
Start with default settings; reduce thread count or THREAD_PROCESS_RECORD_COUNT if memory errors occur
Validate SCAN_WHERE_CLAUSE and test SCAN_ORDER_BY attributes before use
Ensure PROCEDURE_INSTANCE_ROW_COUNT is accurate when using multiple procedure calls

Last updated 2 months ago

Was this helpful?