Common Patterns & Best Practices

Pre-configured patterns and optimization recommendations for common scan scenarios.

Common Configuration Patterns

Quick Onboarding Scan:

  • ATTRIBUTE_CREATE_TYPE: 'Schema' (if schema sample available) or 'Discover'

  • SCAN_RECORD_LIMIT: 10,000 (or small sample)

  • PROCEDURE_INSTANCE_COUNT: 1

  • Number of Threads: Default (maximum)

  • CODE_GENERATE_ON_VERSION_CHANGE: FALSE (review first)

  • MONITOR_ENABLED: FALSE

  • Use case: Initial data source setup, testing, quick schema validation

Full Production Scan:

  • ATTRIBUTE_CREATE_TYPE: 'Discover'

  • SCAN_RECORD_LIMIT: 0 (unlimited)

  • PROCEDURE_INSTANCE_COUNT: 1 (or multiple if needed)

  • Number of Threads: Default (maximum)

  • CODE_GENERATE_ON_VERSION_CHANGE: TRUE

  • MONITOR_ENABLED: TRUE (daily/weekly as needed)

  • Use case: Production data sources, complete attribute discovery, automated pipeline updates

Incremental Scan:

  • ATTRIBUTE_CREATE_TYPE: 'Discover'

  • SCAN_RECORD_LIMIT: 0 (unlimited)

  • SCAN_WHERE_CLAUSE: created_date > '<last_scanned_timestamp>'

  • SOURCE_STREAM_LAST_SCANNED_TIMESTAMP: Auto-updated after scans

  • PROCEDURE_INSTANCE_COUNT: 1

  • Number of Threads: Default (maximum)

  • CODE_GENERATE_ON_VERSION_CHANGE: TRUE

  • MONITOR_ENABLED: TRUE (hourly/daily)

  • Use case: Regularly updated data sources, delta scanning workflows, schema monitoring

Large Dataset Scan:

  • ATTRIBUTE_CREATE_TYPE: 'Discover'

  • SCAN_RECORD_LIMIT: 0 (unlimited)

  • PROCEDURE_INSTANCE_COUNT: 4 (or as needed)

  • PROCEDURE_INSTANCE_ROW_COUNT: 1,000,000 (or calculated)

  • SCAN_ORDER_BY: unique_id (or appropriate attribute)

  • Number of Threads: Default (maximum)

  • CODE_GENERATE_ON_VERSION_CHANGE: TRUE

  • MONITOR_ENABLED: TRUE (as needed)

  • Use case: Data sources exceeding 60-minute scan time, very large datasets (millions+ records), complex nested structures

Memory-Constrained Scan:

  • ATTRIBUTE_CREATE_TYPE: 'Discover'

  • SCAN_RECORD_LIMIT: 0 (unlimited) or limited

  • PROCEDURE_INSTANCE_COUNT: 1 or multiple

  • Number of Threads: Reduced (e.g., 2-4 instead of max)

  • THREAD_PROCESS_RECORD_COUNT: Reduced

  • CODE_GENERATE_ON_VERSION_CHANGE: TRUE

  • MONITOR_ENABLED: FALSE or TRUE

  • Use case: Memory errors during scanning, very complex nested structures, large arrays and polymorphic variations, smaller warehouse sizes


Best Practices

Configuration management:

  • Use descriptive, consistent names (e.g., customer_events_daily_full)

  • Create separate configurations for different use cases (quick scan, full scan, incremental scan)

  • Keep only active configurations visible; use Inactive status for configurations you may reuse

Performance optimization:

  • Use medium warehouses for most use cases; reserve larger warehouses for complex data sources

  • Use default (maximum) threads unless memory constrained

  • Start with single procedure call (PROCEDURE_INSTANCE_COUNT = 1); use multiple calls only if exceeding 60 minutes

Scheduling strategy:

  • Schedule during low-usage periods

  • Avoid overlapping scans on same warehouse

  • Use SCAN_WHERE_CLAUSE with timestamp filters for incremental scanning

  • Leverage SOURCE_STREAM_LAST_SCANNED_TIMESTAMP; clear timestamp for full re-scans

Error prevention:

  • Monitor scan duration; use multiple procedure calls if needed

  • Start with default settings; reduce thread count or THREAD_PROCESS_RECORD_COUNT if memory errors occur

  • Validate SCAN_WHERE_CLAUSE and test SCAN_ORDER_BY attributes before use

  • Ensure PROCEDURE_INSTANCE_ROW_COUNT is accurate when using multiple procedure calls

Last updated

Was this helpful?