Data Sources

Introduction to data sources and navigation hub for all data source documentation.

Overview

Data sources connect to Snowflake database objects and define what data to scan, how to process it, and what materialized objects to generate. Data sources are the entry point for all DataPancake operations—from schema discovery through code generation and materialization.

Key Concepts:

Source Object Connection - Links to a specific Snowflake database object and column
Data Type Classification - Semi-Structured (VARIANT/String columns) or Structured (relational)
Product Tier Management - Enable specific features per data source
Materialization Configuration - Define output objects (dynamic tables, views, etc.)
Schema Transformation - Apply consolidations and filters during processing
Baseline Performance - Track scan performance metrics for estimation

What You'll Learn

This Data Sources section covers:

Data Source Types - Semi-structured and structured data sources, format types, and column data types
Adding Data Sources - Adding via application interface or SQL file, privileges, and connection validation
Basic Configuration Settings - Name, type, status, tags, and connection status
Product Tiers & Features - Feature tiers from Schema Discovery to Semantic Model Generator
Source Object Settings - Object selection, column configuration, schema samples, and platform settings
Materialization Settings - Output object types, naming, deployment location, and configuration
Dynamic Table Settings - Warehouse assignment, target lag, refresh modes, and metadata tables
Secure View Settings - View types, semantic layer, and row access policy integration
Schema Transformations - Consolidation rules, transformation types, and data type transformations
Schema Filters - Filter configuration, regular expressions, and attribute exclusion
Baseline Scan Settings - Performance calibration and scan estimation

Quick Reference

Essential Settings:

Data Source Name - Unique identifier (case-insensitive uniqueness check)
Data Source Type - Semi-Structured or Structured
Source Object - Database, schema, object, and column reference
Status - Active, Inactive, or Deleted

For Semi-Structured Data Sources:

Format Type - JSON, Avro, Parquet, ORC, or XML
Column Data Type - VARIANT or String
Column Name - Name of the VARIANT/String column (can include parsing expressions for String type)

Product Tiers:

Schema Discovery - Always enabled (free tier)
Pipeline Designer - Foundation for all paid features (auto-enabled with others)
SQL Code Generation - Enables materialization code generation (semi-structured only)
Additional Features - Data Dictionary Builder, Security Policy Integration, Semantic Model Generator

Materialization (Semi-Structured Only):

Output Object Type - Dynamic Table or Table
Root Table Name - Prefix for all generated objects (required when SQL Code Generation enabled)
Deployment Location - Database and schema for output objects (defaults to source location if not specified)
Dynamic Table Settings - Warehouse, target lag, optional parameters (required for Dynamic Table type)

Data Source Lifecycle

Creation - Add data source via application interface or SQL file (core.add_datasource_with_scan)
Configuration - Set product tiers, materialization settings, and transformations
Initial Scan - Perform quick scan (default 150 records) to discover schema (optional)
Full Scan - Execute comprehensive scan using scan configurations
Code Generation - Generate SQL code for materialized objects (requires SQL Code Generation feature)
Deployment - Execute generated SQL to create output objects
Maintenance - Update settings, add transformations, monitor performance

For detailed information on scanning data sources, see the Scan Configurations documentation. For information on generated attributes, see the Attribute Metadata documentation.

Last updated 1 month ago

Was this helpful?

Good afternoon

hashtagOverview

hashtagWhat You'll Learn

hashtagQuick Reference

hashtagData Source Lifecycle

Overview

What You'll Learn

Quick Reference

Data Source Lifecycle