SolutionsDatabricks Pipeline

Problem Architecture Key Features Operational Excellence Code Highlights Outcomes Demo Talk Track Interview Mapping

DatabricksSnowflakePySparkDelta LakeREST APIADLSCI/CDObservability

CTM API → Snowflake Ingestion Pipeline

Automated daily data ingestion from a third-party call-tracking platform into the enterprise data warehouse, eliminating manual data pulls and enabling real-time call analytics for sales and operations teams.

0 stars0 forksUpdated todaygithub.com/jdariya/ctm-databricks-pipeline

Problem Statement

Business Problem

The sales and operations team relied on manual CSV exports from a third-party call-tracking platform, causing a 24–48 hour data lag and frequent reconciliation errors. Leadership had no reliable view of call volume trends, agent performance, or campaign attribution.

Technical Challenge

The source API returned paginated JSON with inconsistent field naming across versions. The existing ad-hoc Python scripts had no retry logic, no idempotency guarantees, and no observability. A historic backfill of 18 months of data was required at launch, with ongoing incremental loads thereafter.

Architecture Flow

REST API Source

Step 1 of 5

Paginated REST API with OAuth2 authentication. Supports date-range queries for historic and incremental modes.

PythonrequestsOAuth2

Key Features

The pipeline reads a job configuration JSON that specifies the load mode, date range, and target tables. The orchestration layer selects the appropriate PySpark job and passes parameters. This eliminates code duplication and makes backfills trivial.

Code ref:jobs/CTM_Daily_Incremental_Load.json

Operational Excellence

idempotency

MERGE statements with hash-key deduplication ensure safe re-runs at any stage.

retries

Databricks Workflows configured with 3 retry attempts and exponential backoff on transient failures.

logging

Structured logging to Snowflake ETL_LOG table with row counts, durations, and error payloads per stage.

access Control

Databricks service principal with least-privilege ADLS and Snowflake roles. No user credentials in code.

cicd

Azure DevOps pipeline deploys Databricks job JSON configs and Python wheels on merge to main.

environment Promotion

Dev → Staging → Prod promotion via Azure DevOps pipeline stages with manual approval gates.

Code Highlights

Pipeline Orchestration

CTM_Pipeline/pipeline_orchestration.py

Main orchestration class managing load mode selection, parameter passing, and stage execution.

View

Job Config (Incremental)

jobs/CTM_Daily_Incremental_Load.json

Databricks Workflow JSON defining the incremental load job, cluster config, and retry policy.

View

Azure DevOps Pipeline

azure-pipelines.yml

CI/CD pipeline for deploying Databricks assets across Dev, Staging, and Production.

View

Outcomes

< 2 hours

Data Freshness

Down from 24–48 hour manual export lag

~2M rows

Records Processed (Historic)

18-month backfill completed in a single run

99.2% SLA

Pipeline Reliability

Measured over 90-day production period

~8 hrs/week

Manual Effort Eliminated

Previously spent on manual CSV exports and reconciliation

Demo Talk Track

3–5 Minute Version

"Let me walk you through the CTM pipeline — this is a production Databricks solution I built to ingest call-tracking data from a REST API into Snowflake. The business problem was straightforward but painful: the sales team was manually exporting CSVs every morning, which meant a 24-hour data lag and constant reconciliation headaches. Leadership had no reliable view of call trends. The technical challenge was more interesting. The API was paginated, had inconsistent field naming across versions, and we needed to backfill 18 months of historical data at launch — without disrupting the daily incremental loads. Here's how I solved it: I built three load modes — historic, incremental, and monthly rehydration — all driven by a single JSON config file. The orchestration layer reads the config and selects the right PySpark job. This means backfills are trivial and there's zero code duplication. For reliability, every Snowflake load uses a MERGE statement keyed on a hash of the business key. That means the pipeline is fully idempotent — you can re-run it after any failure without creating duplicates. For observability, I wrapped every pipeline stage with a Python decorator that writes structured logs to a Snowflake table — row counts, durations, error payloads. The team can query pipeline health directly in Snowflake without touching Databricks. The CI/CD is Azure DevOps — job configs and Python wheels deploy automatically on merge to main, with manual approval gates for production. In production, this runs daily and processes about 50,000 records per run. The 18-month historic load processed roughly 2 million records in a single run."

30–60 Second Version

"This is a Databricks pipeline I built for REST API ingestion into Snowflake. The key design decisions were: three load modes driven by a single config file, idempotent MERGE patterns for safe re-runs, and structured logging to Snowflake for observability. It eliminated 8 hours of manual work per week and reduced data lag from 24 hours to under 2 hours."

Interview Mapping

How this solution demonstrates enterprise data platform delivery values:

Extreme Ownership

Designed, built, and operated the pipeline end-to-end — from API authentication to Snowflake load to CI/CD deployment.

Grit

The historic backfill required debugging API rate limits, schema inconsistencies, and ADLS permission issues across a 3-day sprint.

Curiosity

Introduced the schema evolution overflow pattern after researching Delta Lake best practices — not required, but made the system forward-compatible.

Humility

Documented every design decision in the README so the next engineer could understand and extend the system without asking me.