# Adding Data Sources

Methods and requirements for adding data sources to DataPancake.

## Overview

Data sources can be added to DataPancake through two methods: the application interface or a SQL file. Both methods require appropriate Snowflake privileges to be granted first.

***

## Prerequisites

### Required Privileges

Before adding a data source, you must grant DataPancake the necessary privileges to access the source object.

For Standard Databases:

```sql
GRANT USAGE ON DATABASE <database_name> TO APPLICATION PANCAKE;
GRANT USAGE ON SCHEMA <database_name>.<schema_name> TO APPLICATION PANCAKE;
GRANT REFERENCES, SELECT ON <object_type> <database_name>.<schema_name>.<object_name> TO APPLICATION PANCAKE;
```

For Shared Databases:

```sql
GRANT USAGE ON DATABASE <database_name> TO APPLICATION PANCAKE;
GRANT USAGE ON SCHEMA <database_name>.<schema_name> TO APPLICATION PANCAKE;
GRANT IMPORTED PRIVILEGES ON DATABASE <database_name> TO APPLICATION PANCAKE;
```

Object Types:

* `TABLE` for tables
* `VIEW` for views
* `MATERIALIZED VIEW` for materialized views
* `EXTERNAL TABLE` for external tables
* `DYNAMIC TABLE` for dynamic tables
* `ICEBERG TABLE` for Iceberg tables

### Privilege Verification

The system validates privileges when:

* Creating a new data source
* Saving data source settings
* Attempting to scan the data source

If privileges are missing, you'll receive an error message with the exact SQL statements needed to grant access.

***

## Adding from Application Interface

The application interface provides a guided experience for adding data sources with validation and error handling.

{% stepper %}
{% step %}

#### Navigate to Data Sources

* Open the DataPancake application
* Navigate to "Data Source Management" → "Data Sources"
  {% endstep %}

{% step %}

#### Select "New" Action

* Use the sidebar action selector
* Choose "New" to create a new data source
  {% endstep %}

{% step %}

#### Configure Basic Settings

* Enter Data Source Name (required, unique)
* Select Data Source Type: Semi-Structured or Structured
* Set Status: Active, Inactive, or Deleted
  {% endstep %}

{% step %}

#### Configure Source Object (Semi-Structured)

* Select Object Type (Table, View, External Table, etc.)
* Select Format Type (JSON, Avro, Parquet, ORC, XML)
* Select Column Data Type (VARIANT or String)
* Select Database
* Select Schema
* Select Object Name
* Enter Column Name (VARIANT/String column name)
  {% endstep %}

{% step %}

#### Configure Source Object (Structured)

* Select Object Type
* Select Database
* Select Schema
* Select Object Name
  {% endstep %}

{% step %}

#### Optional Configuration

* Add Data Source Tags
* Configure Product Tiers/Features
* Add Schema Sample (for Semi-Structured)
* Configure Materialization Settings (if SQL Code Generation enabled)
  {% endstep %}

{% step %}

#### Save Data Source

* Click "Save Data Source" button
* System validates connection and required fields
* Success message confirms data source creation
  {% endstep %}
  {% endstepper %}

### Validation

The application validates:

* **Required Fields** - All required fields must be provided
* **Unique Name** - Data source name must be unique
* **Connection** - System attempts to connect to source object
* **Privileges** - Verifies required privileges are granted

### Error Handling

Connection Failures:

* System displays specific error message
* Provides SQL statements to grant required privileges
* Allows correction and retry

Validation Errors:

* Missing required fields are highlighted
* Error messages explain what's needed
* Sidebar shows list of missing fields

Duplicate Names:

* Error message indicates duplicate name
* Must choose a different name

***

## Adding via SQL file

SQL file addition enables programmatic data source creation using stored procedures. This method is ideal for bulk operations, automation, and integration with other systems.

### Stored Procedure: core.add\_datasource\_with\_scan

The primary stored procedure for adding data sources with optional initial scan.

Signature:

```sql
CALL core.add_datasource_with_scan(
    DATASOURCE_NAME string,
    DATASOURCE_TYPE string,
    DATASOURCE_FORMAT string,
    ATTRIBUTE_CREATE_TYPE string,
    DATASOURCE_SCHEMA_SAMPLE string,
    SOURCE_DATA_PLATFORM string,
    OBJECT_TYPE string,
    OBJECT_DATABASE_NAME string,
    OBJECT_SCHEMA_NAME string,
    OBJECT_NAME string,
    OBJECT_COLUMN_DATA_TYPE string,
    OBJECT_COLUMN_NAME string,
    INITIATE_SCAN boolean,
    SCAN_WAREHOUSE_NAME string,
    SCAN_RECORD_LIMIT int,
    BASE_SCAN_WAREHOUSE_THREAD_COUNT int,
    BASE_SCAN_RECORDS_PER_MINUTE int,
    DATASOURCE_TAG string
)
```

### Parameters

Required Parameters:

* `DATASOURCE_NAME` - Unique name for the data source
* `DATASOURCE_TYPE` - `'Semi-Structured'` or `'Structured'`
* `DATASOURCE_FORMAT` - Format type: `'JSON'`, `'Avro'`, `'Parquet'`, `'ORC'`, or `'XML'` (Semi-Structured only)
* `ATTRIBUTE_CREATE_TYPE` - `'Discover'` (scan data) or `'Schema'` (use schema sample)
* `OBJECT_TYPE` - `'Table'`, `'View'`, `'External Table'`, `'Dynamic Table'`, `'Materialized View'`, or `'Iceberg Table'`
* `OBJECT_DATABASE_NAME` - Database name containing the source object
* `OBJECT_SCHEMA_NAME` - Schema name containing the source object
* `OBJECT_NAME` - Name of the source object
* `OBJECT_COLUMN_NAME` - Column name (required for Semi-Structured)
* `OBJECT_COLUMN_DATA_TYPE` - `'Variant'` or `'String'` (required for Semi-Structured)

Optional Parameters:

* `DATASOURCE_SCHEMA_SAMPLE` - JSON/XML schema sample (for Schema attribute creation type)
* `SOURCE_DATA_PLATFORM` - `'Kafka'` or `'DynamoDB'` (for streaming platforms)
* `INITIATE_SCAN` - `TRUE` to perform initial scan, `FALSE` to skip
* `SCAN_WAREHOUSE_NAME` - Warehouse name for initial scan
* `SCAN_RECORD_LIMIT` - Number of records to scan (0 = unlimited, default 150 for quick scan)
* `BASE_SCAN_WAREHOUSE_THREAD_COUNT` - Thread count for baseline (default 8)
* `BASE_SCAN_RECORDS_PER_MINUTE` - Records per minute for baseline (default 10000)
* `DATASOURCE_TAG` - Optional tags for organization

### Example: Basic Semi-Structured Data Source

```sql
DECLARE 
    datasource_name VARCHAR DEFAULT 'PRODUCTION.MEDICAL_DEVICE.UDI_JSON'; 
    object_type VARCHAR DEFAULT 'Table'; 
    database_name VARCHAR DEFAULT 'PRODUCTION'; 
    schema_name VARCHAR DEFAULT 'MEDICAL_DEVICE'; 
    object_name VARCHAR DEFAULT 'UDI_JSON'; 
    column_name VARCHAR DEFAULT 'JSON_DATA'; 
    datasource_tag VARCHAR DEFAULT ''; 
    warehouse_name VARCHAR DEFAULT 'PANCAKE_X_SMALL_01'; 
    record_limit INT DEFAULT 150; 
    initiate_scan BOOLEAN DEFAULT TRUE; 
    call_result VARCHAR;
BEGIN 
    CALL PANCAKE.core.add_datasource_with_scan( 
        :datasource_name,
        'Semi-Structured',
        'JSON',
        'Discover',
        '',
        '',
        :object_type,
        :database_name,
        :schema_name,
        :object_name,
        'Variant',
        :column_name,
        :initiate_scan,
        :warehouse_name,
        :record_limit,
        8,
        10000,
        :datasource_tag
    ) INTO :call_result;
    
    RETURN :call_result;
END;
```

### Example: Full Scan on Creation

```sql
DECLARE 
    datasource_name VARCHAR DEFAULT 'PRODUCTION.MEDICAL_DEVICE.UDI_JSON'; 
    warehouse_name VARCHAR DEFAULT 'PANCAKE_MEDIUM_OPTIMIZED_01'; 
    record_limit INT DEFAULT 0; -- 0 = unlimited
    call_result VARCHAR;
BEGIN 
    CALL PANCAKE.core.add_datasource_with_scan( 
        :datasource_name,
        'Semi-Structured',
        'JSON',
        'Discover',
        '',
        '',
        'Table',
        'PRODUCTION',
        'MEDICAL_DEVICE',
        'UDI_JSON',
        'Variant',
        'JSON_DATA',
        TRUE,
        :warehouse_name,
        :record_limit,
        8,
        10000,
        ''
    ) INTO :call_result;
    
    RETURN :call_result;
END;
```

### Example: Using Schema Sample

```sql
DECLARE 
    schema_sample VARCHAR DEFAULT '{"id": "string", "name": "string", "items": [{"item_id": "string"}]}';
    call_result VARCHAR;
BEGIN 
    CALL PANCAKE.core.add_datasource_with_scan( 
        'PRODUCTION.MEDICAL_DEVICE.UDI_JSON',
        'Semi-Structured',
        'JSON',
        'Schema', -- Use schema sample
        :schema_sample,
        '',
        'Table',
        'PRODUCTION',
        'MEDICAL_DEVICE',
        'UDI_JSON',
        'Variant',
        'JSON_DATA',
        FALSE, -- Don't initiate scan
        '',
        0,
        8,
        10000,
        ''
    ) INTO :call_result;
    
    RETURN :call_result;
END;
```

### Example: Structured Data Source

```sql
DECLARE 
    call_result VARCHAR;
BEGIN 
    CALL PANCAKE.core.add_datasource_with_scan( 
        'PRODUCTION.CUSTOMER_DATA.ORDERS_TABLE',
        'Structured',
        '', -- No format for structured
        'Discover',
        '',
        '',
        'Table',
        'PRODUCTION',
        'CUSTOMER_DATA',
        'ORDERS_TABLE',
        '', -- No column for structured
        '', -- No column name for structured
        TRUE,
        'PANCAKE_X_SMALL_01',
        150,
        8,
        10000,
        ''
    ) INTO :call_result;
    
    RETURN :call_result;
END;
```

### Return Values

The stored procedure returns a string indicating success or failure:

* Success: Returns success message
* Error: Returns error message with details
* Connection Failed: Returns connection error with privilege grant statements
* Duplicate Name: Returns error indicating duplicate data source name

### Bulk Data Source Creation

You can use SQL scripts to find and create data sources in bulk:

```sql
-- Example: Find all tables with VARIANT columns and generate add statements
SELECT 
    'CALL PANCAKE.core.add_datasource_with_scan(' ||
    '''' || table_catalog || '.' || table_schema || '.' || table_name || ''',' ||
    '''Semi-Structured'',' ||
    '''JSON'',' ||
    '''Discover'',' ||
    '''',' ||
    '''',' ||
    '''Table'',' ||
    '''' || table_catalog || ''',' ||
    '''' || table_schema || ''',' ||
    '''' || table_name || ''',' ||
    '''Variant'',' ||
    '''' || column_name || ''',' ||
    'TRUE,' ||
    '''PANCAKE_X_SMALL_01'',' ||
    '150,' ||
    '8,' ||
    '10000,' ||
    ''''');' AS add_statement
FROM information_schema.columns
WHERE data_type = 'VARIANT'
AND table_schema NOT LIKE 'PANCAKE%';
```

***

## Initial Quick Scan

When adding a data source, you can optionally perform an initial "quick scan" to discover the schema immediately.

### Quick Scan Configuration

Default Settings:

* Record Limit: 150 records
* Warehouse: PANCAKE\_X\_SMALL\_01 (or specified warehouse)
* Attribute Create Type: Discover (scans actual data)

### Quick Scan Benefits

* Immediate Schema Discovery — See schema structure right away
* Fast Onboarding — Quick validation of data source setup
* Cost Effective — Small record limit keeps costs low
* Baseline Setup — Establishes initial attribute metadata

### When to Skip Initial Scan

Skip the initial scan when:

* You have a schema sample and want to use Schema attribute creation type
* You want to configure materialization settings first
* You're creating data sources in bulk and will scan later
* You want to review configuration before scanning

### Full Scan During Creation

To perform a full scan when creating a data source:

* Set `SCAN_RECORD_LIMIT` to `0` (unlimited) or a number greater than total records
* Use a Medium or larger warehouse (e.g., `PANCAKE_MEDIUM_OPTIMIZED_01`)
* Ensure warehouse has sufficient resources for the data size

Note: Full scans during creation may take significant time and cost. Consider using quick scan first, then performing full scans via scan configurations.

***

***

## Troubleshooting

<details>

<summary>Connection Failed</summary>

* Verify privileges are granted correctly
* Check object exists and is accessible
* For shared databases, use `GRANT IMPORTED PRIVILEGES`
* Verify database and schema names are correct

</details>

<details>

<summary>Duplicate Name</summary>

* Choose a different, unique name
* Check existing data sources for similar names
* Use more specific naming with database/schema context

</details>

<details>

<summary>Invalid Format</summary>

* Verify format type matches actual data format
* Check column data type (VARIANT vs String)
* For String columns, ensure parsing expression is correct

</details>

<details>

<summary>Scan Failures</summary>

* Verify warehouse exists and is accessible
* Check warehouse size is appropriate for data volume
* Review scan error messages for specific issues
* Start with smaller record limits

</details>

***

## Summary

Methods:

* Application Interface (guided experience)
* SQL file (programmatic/bulk)

Requirements:

* Appropriate Snowflake privileges
* Valid source object
* Unique data source name

Initial Scan:

* Optional quick scan (150 records default)
* Can skip and scan later
* Can perform full scan on creation

**Note:** Start with quick scan (default 150 records), then configure and perform full scans via scan configurations.

For information on data source types, see [Data Source Types](https://docs.datapancake.com/core-concepts/data-sources/data-source-types). For basic configuration, see [Basic Configuration Settings](https://docs.datapancake.com/core-concepts/data-sources/basic-configuration-settings).
