# Adding Data Sources

Methods and requirements for adding data sources to DataPancake.

## Overview

Data sources can be added to DataPancake through two methods: the application interface or a SQL file. Both methods require appropriate Snowflake privileges to be granted first.

***

## Prerequisites

### Required Privileges

Before adding a data source, you must grant DataPancake the necessary privileges to access the source object.

For Standard Databases:

```sql
GRANT USAGE ON DATABASE <database_name> TO APPLICATION PANCAKE;
GRANT USAGE ON SCHEMA <database_name>.<schema_name> TO APPLICATION PANCAKE;
GRANT REFERENCES, SELECT ON <object_type> <database_name>.<schema_name>.<object_name> TO APPLICATION PANCAKE;
```

For Shared Databases:

```sql
GRANT USAGE ON DATABASE <database_name> TO APPLICATION PANCAKE;
GRANT USAGE ON SCHEMA <database_name>.<schema_name> TO APPLICATION PANCAKE;
GRANT IMPORTED PRIVILEGES ON DATABASE <database_name> TO APPLICATION PANCAKE;
```

Object Types:

* `TABLE` for tables
* `VIEW` for views
* `MATERIALIZED VIEW` for materialized views
* `EXTERNAL TABLE` for external tables
* `DYNAMIC TABLE` for dynamic tables
* `ICEBERG TABLE` for Iceberg tables

### Privilege Verification

The system validates privileges when:

* Creating a new data source
* Saving data source settings
* Attempting to scan the data source

If privileges are missing, you'll receive an error message with the exact SQL statements needed to grant access.

***

## Adding from Application Interface

The application interface provides a guided experience for adding data sources with validation and error handling.

{% stepper %}
{% step %}

#### Navigate to Data Sources

* Open the DataPancake application
* Navigate to "Data Source Management" → "Data Sources"
  {% endstep %}

{% step %}

#### Select "New" Action

* Use the sidebar action selector
* Choose "New" to create a new data source
  {% endstep %}

{% step %}

#### Configure Basic Settings

* Enter Data Source Name (required, unique)
* Select Data Source Type: Semi-Structured or Structured
* Set Status: Active, Inactive, or Deleted
  {% endstep %}

{% step %}

#### Configure Source Object (Semi-Structured)

* Select Object Type (Table, View, External Table, etc.)
* Select Format Type (JSON, Avro, Parquet, ORC, XML)
* Select Column Data Type (VARIANT or String)
* Select Database
* Select Schema
* Select Object Name
* Enter Column Name (VARIANT/String column name)
  {% endstep %}

{% step %}

#### Configure Source Object (Structured)

* Select Object Type
* Select Database
* Select Schema
* Select Object Name
  {% endstep %}

{% step %}

#### Optional Configuration

* Add Data Source Tags
* Configure Product Tiers/Features
* Add Schema Sample (for Semi-Structured)
* Configure Materialization Settings (if SQL Code Generation enabled)
  {% endstep %}

{% step %}

#### Save Data Source

* Click "Save Data Source" button
* System validates connection and required fields
* Success message confirms data source creation
  {% endstep %}
  {% endstepper %}

### Validation

The application validates:

* **Required Fields** - All required fields must be provided
* **Unique Name** - Data source name must be unique
* **Connection** - System attempts to connect to source object
* **Privileges** - Verifies required privileges are granted

### Error Handling

Connection Failures:

* System displays specific error message
* Provides SQL statements to grant required privileges
* Allows correction and retry

Validation Errors:

* Missing required fields are highlighted
* Error messages explain what's needed
* Sidebar shows list of missing fields

Duplicate Names:

* Error message indicates duplicate name
* Must choose a different name

***

## Adding via SQL file

SQL file addition enables programmatic data source creation using stored procedures. This method is ideal for bulk operations, automation, and integration with other systems.

### Stored Procedure: core.add\_datasource\_with\_scan

The primary stored procedure for adding data sources with optional initial scan.

Signature:

```sql
CALL core.add_datasource_with_scan(
    DATASOURCE_NAME string,
    DATASOURCE_TYPE string,
    DATASOURCE_FORMAT string,
    ATTRIBUTE_CREATE_TYPE string,
    DATASOURCE_SCHEMA_SAMPLE string,
    SOURCE_DATA_PLATFORM string,
    OBJECT_TYPE string,
    OBJECT_DATABASE_NAME string,
    OBJECT_SCHEMA_NAME string,
    OBJECT_NAME string,
    OBJECT_COLUMN_DATA_TYPE string,
    OBJECT_COLUMN_NAME string,
    INITIATE_SCAN boolean,
    SCAN_WAREHOUSE_NAME string,
    SCAN_RECORD_LIMIT int,
    BASE_SCAN_WAREHOUSE_THREAD_COUNT int,
    BASE_SCAN_RECORDS_PER_MINUTE int,
    DATASOURCE_TAG string
)
```

### Parameters

Required Parameters:

* `DATASOURCE_NAME` - Unique name for the data source
* `DATASOURCE_TYPE` - `'Semi-Structured'` or `'Structured'`
* `DATASOURCE_FORMAT` - Format type: `'JSON'`, `'Avro'`, `'Parquet'`, `'ORC'`, or `'XML'` (Semi-Structured only)
* `ATTRIBUTE_CREATE_TYPE` - `'Discover'` (scan data) or `'Schema'` (use schema sample)
* `OBJECT_TYPE` - `'Table'`, `'View'`, `'External Table'`, `'Dynamic Table'`, `'Materialized View'`, or `'Iceberg Table'`
* `OBJECT_DATABASE_NAME` - Database name containing the source object
* `OBJECT_SCHEMA_NAME` - Schema name containing the source object
* `OBJECT_NAME` - Name of the source object
* `OBJECT_COLUMN_NAME` - Column name (required for Semi-Structured)
* `OBJECT_COLUMN_DATA_TYPE` - `'Variant'` or `'String'` (required for Semi-Structured)

Optional Parameters:

* `DATASOURCE_SCHEMA_SAMPLE` - JSON/XML schema sample (for Schema attribute creation type)
* `SOURCE_DATA_PLATFORM` - `'Kafka'` or `'DynamoDB'` (for streaming platforms)
* `INITIATE_SCAN` - `TRUE` to perform initial scan, `FALSE` to skip
* `SCAN_WAREHOUSE_NAME` - Warehouse name for initial scan
* `SCAN_RECORD_LIMIT` - Number of records to scan (0 = unlimited, default 150 for quick scan)
* `BASE_SCAN_WAREHOUSE_THREAD_COUNT` - Thread count for baseline (default 8)
* `BASE_SCAN_RECORDS_PER_MINUTE` - Records per minute for baseline (default 10000)
* `DATASOURCE_TAG` - Optional tags for organization

### Example: Basic Semi-Structured Data Source

```sql
DECLARE 
    datasource_name VARCHAR DEFAULT 'PRODUCTION.MEDICAL_DEVICE.UDI_JSON'; 
    object_type VARCHAR DEFAULT 'Table'; 
    database_name VARCHAR DEFAULT 'PRODUCTION'; 
    schema_name VARCHAR DEFAULT 'MEDICAL_DEVICE'; 
    object_name VARCHAR DEFAULT 'UDI_JSON'; 
    column_name VARCHAR DEFAULT 'JSON_DATA'; 
    datasource_tag VARCHAR DEFAULT ''; 
    warehouse_name VARCHAR DEFAULT 'PANCAKE_X_SMALL_01'; 
    record_limit INT DEFAULT 150; 
    initiate_scan BOOLEAN DEFAULT TRUE; 
    call_result VARCHAR;
BEGIN 
    CALL PANCAKE.core.add_datasource_with_scan( 
        :datasource_name,
        'Semi-Structured',
        'JSON',
        'Discover',
        '',
        '',
        :object_type,
        :database_name,
        :schema_name,
        :object_name,
        'Variant',
        :column_name,
        :initiate_scan,
        :warehouse_name,
        :record_limit,
        8,
        10000,
        :datasource_tag
    ) INTO :call_result;
    
    RETURN :call_result;
END;
```

### Example: Full Scan on Creation

```sql
DECLARE 
    datasource_name VARCHAR DEFAULT 'PRODUCTION.MEDICAL_DEVICE.UDI_JSON'; 
    warehouse_name VARCHAR DEFAULT 'PANCAKE_MEDIUM_OPTIMIZED_01'; 
    record_limit INT DEFAULT 0; -- 0 = unlimited
    call_result VARCHAR;
BEGIN 
    CALL PANCAKE.core.add_datasource_with_scan( 
        :datasource_name,
        'Semi-Structured',
        'JSON',
        'Discover',
        '',
        '',
        'Table',
        'PRODUCTION',
        'MEDICAL_DEVICE',
        'UDI_JSON',
        'Variant',
        'JSON_DATA',
        TRUE,
        :warehouse_name,
        :record_limit,
        8,
        10000,
        ''
    ) INTO :call_result;
    
    RETURN :call_result;
END;
```

### Example: Using Schema Sample

```sql
DECLARE 
    schema_sample VARCHAR DEFAULT '{"id": "string", "name": "string", "items": [{"item_id": "string"}]}';
    call_result VARCHAR;
BEGIN 
    CALL PANCAKE.core.add_datasource_with_scan( 
        'PRODUCTION.MEDICAL_DEVICE.UDI_JSON',
        'Semi-Structured',
        'JSON',
        'Schema', -- Use schema sample
        :schema_sample,
        '',
        'Table',
        'PRODUCTION',
        'MEDICAL_DEVICE',
        'UDI_JSON',
        'Variant',
        'JSON_DATA',
        FALSE, -- Don't initiate scan
        '',
        0,
        8,
        10000,
        ''
    ) INTO :call_result;
    
    RETURN :call_result;
END;
```

### Example: Structured Data Source

```sql
DECLARE 
    call_result VARCHAR;
BEGIN 
    CALL PANCAKE.core.add_datasource_with_scan( 
        'PRODUCTION.CUSTOMER_DATA.ORDERS_TABLE',
        'Structured',
        '', -- No format for structured
        'Discover',
        '',
        '',
        'Table',
        'PRODUCTION',
        'CUSTOMER_DATA',
        'ORDERS_TABLE',
        '', -- No column for structured
        '', -- No column name for structured
        TRUE,
        'PANCAKE_X_SMALL_01',
        150,
        8,
        10000,
        ''
    ) INTO :call_result;
    
    RETURN :call_result;
END;
```

### Return Values

The stored procedure returns a string indicating success or failure:

* Success: Returns success message
* Error: Returns error message with details
* Connection Failed: Returns connection error with privilege grant statements
* Duplicate Name: Returns error indicating duplicate data source name

### Bulk Data Source Creation

You can use SQL scripts to find and create data sources in bulk:

```sql
-- Example: Find all tables with VARIANT columns and generate add statements
SELECT 
    'CALL PANCAKE.core.add_datasource_with_scan(' ||
    '''' || table_catalog || '.' || table_schema || '.' || table_name || ''',' ||
    '''Semi-Structured'',' ||
    '''JSON'',' ||
    '''Discover'',' ||
    '''',' ||
    '''',' ||
    '''Table'',' ||
    '''' || table_catalog || ''',' ||
    '''' || table_schema || ''',' ||
    '''' || table_name || ''',' ||
    '''Variant'',' ||
    '''' || column_name || ''',' ||
    'TRUE,' ||
    '''PANCAKE_X_SMALL_01'',' ||
    '150,' ||
    '8,' ||
    '10000,' ||
    ''''');' AS add_statement
FROM information_schema.columns
WHERE data_type = 'VARIANT'
AND table_schema NOT LIKE 'PANCAKE%';
```

***

## Initial Quick Scan

When adding a data source, you can optionally perform an initial "quick scan" to discover the schema immediately.

### Quick Scan Configuration

Default Settings:

* Record Limit: 150 records
* Warehouse: PANCAKE\_X\_SMALL\_01 (or specified warehouse)
* Attribute Create Type: Discover (scans actual data)

### Quick Scan Benefits

* Immediate Schema Discovery — See schema structure right away
* Fast Onboarding — Quick validation of data source setup
* Cost Effective — Small record limit keeps costs low
* Baseline Setup — Establishes initial attribute metadata

### When to Skip Initial Scan

Skip the initial scan when:

* You have a schema sample and want to use Schema attribute creation type
* You want to configure materialization settings first
* You're creating data sources in bulk and will scan later
* You want to review configuration before scanning

### Full Scan During Creation

To perform a full scan when creating a data source:

* Set `SCAN_RECORD_LIMIT` to `0` (unlimited) or a number greater than total records
* Use a Medium or larger warehouse (e.g., `PANCAKE_MEDIUM_OPTIMIZED_01`)
* Ensure warehouse has sufficient resources for the data size

Note: Full scans during creation may take significant time and cost. Consider using quick scan first, then performing full scans via scan configurations.

***

***

## Troubleshooting

<details>

<summary>Connection Failed</summary>

* Verify privileges are granted correctly
* Check object exists and is accessible
* For shared databases, use `GRANT IMPORTED PRIVILEGES`
* Verify database and schema names are correct

</details>

<details>

<summary>Duplicate Name</summary>

* Choose a different, unique name
* Check existing data sources for similar names
* Use more specific naming with database/schema context

</details>

<details>

<summary>Invalid Format</summary>

* Verify format type matches actual data format
* Check column data type (VARIANT vs String)
* For String columns, ensure parsing expression is correct

</details>

<details>

<summary>Scan Failures</summary>

* Verify warehouse exists and is accessible
* Check warehouse size is appropriate for data volume
* Review scan error messages for specific issues
* Start with smaller record limits

</details>

***

## Summary

Methods:

* Application Interface (guided experience)
* SQL file (programmatic/bulk)

Requirements:

* Appropriate Snowflake privileges
* Valid source object
* Unique data source name

Initial Scan:

* Optional quick scan (150 records default)
* Can skip and scan later
* Can perform full scan on creation

**Note:** Start with quick scan (default 150 records), then configure and perform full scans via scan configurations.

For information on data source types, see [Data Source Types](/core-concepts/data-sources/data-source-types.md). For basic configuration, see [Basic Configuration Settings](/core-concepts/data-sources/basic-configuration-settings.md).


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.datapancake.com/core-concepts/data-sources/adding-data-sources.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
