# Source Object Settings

Configuration for the Snowflake database object that serves as the data source.

## Overview

Source Object Settings define the connection to the actual Snowflake object containing your data. These settings specify the database, schema, object, and column (for semi-structured data) that DataPancake will scan and process. Additional settings configure platform-specific options for streaming data sources.

***

## Object Type Selection

The type of Snowflake object that contains your data. DataPancake uses this to construct proper SQL queries and grant statements.

### Supported Object Types

**For Semi-Structured Data Sources:**

* **Dynamic Table** - VARIANT columns only
* **External Table** - JSON format only
* **Iceberg Table** - VARIANT or STRING columns, JSON format only (STRING type supports JSON only, not XML)
* **Materialized View** - VARIANT columns
* **Table** - VARIANT columns
* **View** - VARIANT columns

**For Structured Data Sources:**

* **Materialized View**
* **Table**
* **View**

***

## Database, Schema, and Object Selection

The three-part identifier for locating your source object in Snowflake.

### Database Selection

* Dropdown populated from databases accessible to your user (requires USAGE privilege)
* Case-sensitive names are automatically quoted
* Selection required before schema selection

### Schema Selection

* Dropdown populated after database selection
* Shows schemas in selected database (requires USAGE privilege)
* Refreshes when database changes
* Case-sensitive names are automatically quoted

### Object Name Selection

* Dropdown populated after schema selection
* Shows objects of selected type in selected schema
* Only objects matching selected Object Type are shown
* Refreshes when schema or object type changes
* Requires SELECT privilege on object (for semi-structured) or appropriate privileges
* Case-sensitive names are automatically quoted

***

## Column Configuration (Semi-Structured Only)

For semi-structured data sources, you must specify the column containing the VARIANT or String data.

### Column Name

Requirements:

* Column must exist in the source object
* Column must be of type VARIANT or String (VARCHAR/TEXT)
* Required field for semi-structured data sources

Input:

* Text field for column name
* Case-sensitive if column name is quoted in Snowflake
* Can include parsing expressions for String columns

Examples:

* Simple column: `JSON_DATA`
* Quoted column: `"JsonData"`
* Parsing expression: `parse_json(RECORD_CONTENT:"fullDocument")`

### Column Data Type

* **VARIANT** (Recommended)
  * Native Snowflake semi-structured type
  * Automatically parses JSON, Avro, Parquet, ORC, XML
  * Best performance
  * No parsing expressions needed
* **String**
  * Text column containing JSON/XML strings
  * Requires parsing functions in Column Name field
  * Useful for streaming platforms
  * For Iceberg tables: STRING type supports JSON only (not XML)

**Parsing Expressions (String columns only):**

Enter parsing expressions in the Column Name field when using String column type. Example for Kafka/MongoDB:

```sql
parse_json(RECORD_CONTENT:"fullDocument")
```

***

## Format Type (Semi-Structured Only)

The data format stored in the VARIANT or String column. Format type must match actual data format—incorrect format selection may cause scan failures.

### Supported Formats

* **JSON** - Standard JSON format
* **Avro** - Apache Avro binary format
* **Parquet** - Columnar storage format
* **ORC** - Optimized Row Columnar format
* **XML** - XML document format

**Format Restrictions:**

* External Tables: JSON format only
* Iceberg Tables: JSON format only (STRING type supports JSON only, not XML)

***

## Schema Sample (Semi-Structured Only)

Optional JSON or XML document representing the complete schema of your data source.

**Purpose:**

* Use "Schema" attribute create type in scan configurations (skips data scanning)
* Faster initial setup for prototyping and testing
* Can be updated when full scans are performed

**Requirements:**

* Valid JSON or XML document
* Should represent complete schema structure with all attributes and nested structures
* Can be a single representative record

**Usage:** When Attribute Create Type is set to "Schema" in scan configurations, the system uses the schema sample instead of scanning data.

***

## Platform/Application Settings (Semi-Structured Only)

Configuration for data sources from streaming platforms or document databases.

### Source Platform/Application

**Supported Platforms:**

* Kafka — Apache Kafka streaming platform
* DynamoDB — Amazon DynamoDB document database
* Other — Custom platforms can be configured

**Purpose:**

* Identifies data source platform for specialized handling
* Enables platform-specific features (e.g., stream message metadata inclusion)

**Selection:**

* Dropdown of available platforms
* Optional field (can be left blank)
* Only relevant for streaming/document database sources

### Data Deduplication

**Toggle:** Data Deduplication (By Primary Key)

**When Enabled:**

* Creates two dynamic tables:
  1. First table: Flattens root attributes
  2. Second table: Filters to most recent record per primary key

**Requirements:**

* SQL Code Generation feature must be enabled
* Deduplication SQL Expression must be provided

**Use Cases:**

* Streaming data sources with duplicate records
* Need to maintain only latest record per key

### Deduplication SQL Expression

**Format:**

* QUALIFY clause with ROW\_NUMBER() window function
* PARTITION BY primary key
* ORDER BY sort field (typically timestamp or offset)

**Example — Kafka (with Stream Metadata):**

```sql
QUALIFY ROW_NUMBER() OVER (PARTITION BY message_key ORDER BY message_offset DESC) = 1
```

**Example — Timestamp-based:**

```sql
QUALIFY ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY event_timestamp DESC) = 1
```

**Requirements:**

* Required when Data Deduplication is enabled
* Must use valid Snowflake SQL syntax
* Should reference attributes available in flattened data

***

## Source Object Validation

The system validates source object connection when creating new data sources, saving settings, or before initiating scans.

**Validation Checks:**

* Object exists and is accessible
* Required privileges are granted
* Column exists (for semi-structured)
* Object type matches selection

**Error Handling:**

* Connection Failed: System displays specific error message and provides SQL statements to grant privileges
* Object Not Found: Error indicates object doesn't exist—verify database, schema, and object names
* Column Not Found: Error indicates column doesn't exist—verify column name spelling and data type

***

## Summary

**Required Settings:**

* Object Type
* Database
* Schema
* Object Name
* Column Name (semi-structured)
* Format Type (semi-structured)
* Column Data Type (semi-structured)

**Optional Settings:**

* Schema Sample
* Source Platform/Application
* Data Deduplication
* Deduplication SQL Expression

For information on data source types, see [Data Source Types](https://docs.datapancake.com/core-concepts/data-sources/data-source-types). For materialization settings, see [Materialization Settings](https://docs.datapancake.com/core-concepts/data-sources/materialization-settings).
