# Source Object Settings

Configuration for the Snowflake database object that serves as the data source.

## Overview

Source Object Settings define the connection to the actual Snowflake object containing your data. These settings specify the database, schema, object, and column (for semi-structured data) that DataPancake will scan and process. Additional settings configure platform-specific options for streaming data sources.

***

## Object Type Selection

The type of Snowflake object that contains your data. DataPancake uses this to construct proper SQL queries and grant statements.

### Supported Object Types

**For Semi-Structured Data Sources:**

* **Dynamic Table** - VARIANT columns only
* **External Table** - JSON format only
* **Iceberg Table** - VARIANT or STRING columns, JSON format only (STRING type supports JSON only, not XML)
* **Materialized View** - VARIANT columns
* **Table** - VARIANT columns
* **View** - VARIANT columns

**For Structured Data Sources:**

* **Materialized View**
* **Table**
* **View**

***

## Database, Schema, and Object Selection

The three-part identifier for locating your source object in Snowflake.

### Database Selection

* Dropdown populated from databases accessible to your user (requires USAGE privilege)
* Case-sensitive names are automatically quoted
* Selection required before schema selection

### Schema Selection

* Dropdown populated after database selection
* Shows schemas in selected database (requires USAGE privilege)
* Refreshes when database changes
* Case-sensitive names are automatically quoted

### Object Name Selection

* Dropdown populated after schema selection
* Shows objects of selected type in selected schema
* Only objects matching selected Object Type are shown
* Refreshes when schema or object type changes
* Requires SELECT privilege on object (for semi-structured) or appropriate privileges
* Case-sensitive names are automatically quoted

***

## Column Configuration (Semi-Structured Only)

For semi-structured data sources, you must specify the column containing the VARIANT or String data.

### Column Name

Requirements:

* Column must exist in the source object
* Column must be of type VARIANT or String (VARCHAR/TEXT)
* Required field for semi-structured data sources

Input:

* Text field for column name
* Case-sensitive if column name is quoted in Snowflake
* Can include parsing expressions for String columns

Examples:

* Simple column: `JSON_DATA`
* Quoted column: `"JsonData"`
* Parsing expression: `parse_json(RECORD_CONTENT:"fullDocument")`

### Column Data Type

* **VARIANT** (Recommended)
  * Native Snowflake semi-structured type
  * Automatically parses JSON, Avro, Parquet, ORC, XML
  * Best performance
  * No parsing expressions needed
* **String**
  * Text column containing JSON/XML strings
  * Requires parsing functions in Column Name field
  * Useful for streaming platforms
  * For Iceberg tables: STRING type supports JSON only (not XML)

**Parsing Expressions (String columns only):**

Enter parsing expressions in the Column Name field when using String column type. Example for Kafka/MongoDB:

```sql
parse_json(RECORD_CONTENT:"fullDocument")
```

***

## Format Type (Semi-Structured Only)

The data format stored in the VARIANT or String column. Format type must match actual data format—incorrect format selection may cause scan failures.

### Supported Formats

* **JSON** - Standard JSON format
* **Avro** - Apache Avro binary format
* **Parquet** - Columnar storage format
* **ORC** - Optimized Row Columnar format
* **XML** - XML document format

**Format Restrictions:**

* External Tables: JSON format only
* Iceberg Tables: JSON format only (STRING type supports JSON only, not XML)

***

## Schema Sample (Semi-Structured Only)

Optional JSON or XML document representing the complete schema of your data source.

**Purpose:**

* Use "Schema" attribute create type in scan configurations (skips data scanning)
* Faster initial setup for prototyping and testing
* Can be updated when full scans are performed

**Requirements:**

* Valid JSON or XML document
* Should represent complete schema structure with all attributes and nested structures
* Can be a single representative record

**Usage:** When Attribute Create Type is set to "Schema" in scan configurations, the system uses the schema sample instead of scanning data.

***

## Platform/Application Settings (Semi-Structured Only)

Configuration for data sources from streaming platforms or document databases.

### Source Platform/Application

**Supported Platforms:**

* Kafka — Apache Kafka streaming platform
* DynamoDB — Amazon DynamoDB document database
* Other — Custom platforms can be configured

**Purpose:**

* Identifies data source platform for specialized handling
* Enables platform-specific features (e.g., stream message metadata inclusion)

**Selection:**

* Dropdown of available platforms
* Optional field (can be left blank)
* Only relevant for streaming/document database sources

### Data Deduplication

**Toggle:** Data Deduplication (By Primary Key)

**When Enabled:**

* Creates two dynamic tables:
  1. First table: Flattens root attributes
  2. Second table: Filters to most recent record per primary key

**Requirements:**

* SQL Code Generation feature must be enabled
* Deduplication SQL Expression must be provided

**Use Cases:**

* Streaming data sources with duplicate records
* Need to maintain only latest record per key

### Deduplication SQL Expression

**Format:**

* QUALIFY clause with ROW\_NUMBER() window function
* PARTITION BY primary key
* ORDER BY sort field (typically timestamp or offset)

**Example — Kafka (with Stream Metadata):**

```sql
QUALIFY ROW_NUMBER() OVER (PARTITION BY message_key ORDER BY message_offset DESC) = 1
```

**Example — Timestamp-based:**

```sql
QUALIFY ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY event_timestamp DESC) = 1
```

**Requirements:**

* Required when Data Deduplication is enabled
* Must use valid Snowflake SQL syntax
* Should reference attributes available in flattened data

***

## Source Object Validation

The system validates source object connection when creating new data sources, saving settings, or before initiating scans.

**Validation Checks:**

* Object exists and is accessible
* Required privileges are granted
* Column exists (for semi-structured)
* Object type matches selection

**Error Handling:**

* Connection Failed: System displays specific error message and provides SQL statements to grant privileges
* Object Not Found: Error indicates object doesn't exist—verify database, schema, and object names
* Column Not Found: Error indicates column doesn't exist—verify column name spelling and data type

***

## Summary

**Required Settings:**

* Object Type
* Database
* Schema
* Object Name
* Column Name (semi-structured)
* Format Type (semi-structured)
* Column Data Type (semi-structured)

**Optional Settings:**

* Schema Sample
* Source Platform/Application
* Data Deduplication
* Deduplication SQL Expression

For information on data source types, see [Data Source Types](/core-concepts/data-sources/data-source-types.md). For materialization settings, see [Materialization Settings](/core-concepts/data-sources/materialization-settings.md).


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.datapancake.com/core-concepts/data-sources/source-object-settings.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
