Automated daily data ingestion from a third-party call-tracking platform into the enterprise data warehouse, eliminating manual data pulls and enabling real-time call analytics for sales and operations teams.
The sales and operations team relied on manual CSV exports from a third-party call-tracking platform, causing a 24–48 hour data lag and frequent reconciliation errors. Leadership had no reliable view of call volume trends, agent performance, or campaign attribution.
The source API returned paginated JSON with inconsistent field naming across versions. The existing ad-hoc Python scripts had no retry logic, no idempotency guarantees, and no observability. A historic backfill of 18 months of data was required at launch, with ongoing incremental loads thereafter.
Paginated REST API with OAuth2 authentication. Supports date-range queries for historic and incremental modes.
The pipeline reads a job configuration JSON that specifies the load mode, date range, and target tables. The orchestration layer selects the appropriate PySpark job and passes parameters. This eliminates code duplication and makes backfills trivial.
jobs/CTM_Daily_Incremental_Load.jsonMERGE statements with hash-key deduplication ensure safe re-runs at any stage.
Databricks Workflows configured with 3 retry attempts and exponential backoff on transient failures.
Structured logging to Snowflake ETL_LOG table with row counts, durations, and error payloads per stage.
Databricks service principal with least-privilege ADLS and Snowflake roles. No user credentials in code.
Azure DevOps pipeline deploys Databricks job JSON configs and Python wheels on merge to main.
Dev → Staging → Prod promotion via Azure DevOps pipeline stages with manual approval gates.
CTM_Pipeline/pipeline_orchestration.pyMain orchestration class managing load mode selection, parameter passing, and stage execution.
jobs/CTM_Daily_Incremental_Load.jsonDatabricks Workflow JSON defining the incremental load job, cluster config, and retry policy.
azure-pipelines.ymlCI/CD pipeline for deploying Databricks assets across Dev, Staging, and Production.
"Let me walk you through the CTM pipeline — this is a production Databricks solution I built to ingest call-tracking data from a REST API into Snowflake. The business problem was straightforward but painful: the sales team was manually exporting CSVs every morning, which meant a 24-hour data lag and constant reconciliation headaches. Leadership had no reliable view of call trends. The technical challenge was more interesting. The API was paginated, had inconsistent field naming across versions, and we needed to backfill 18 months of historical data at launch — without disrupting the daily incremental loads. Here's how I solved it: I built three load modes — historic, incremental, and monthly rehydration — all driven by a single JSON config file. The orchestration layer reads the config and selects the right PySpark job. This means backfills are trivial and there's zero code duplication. For reliability, every Snowflake load uses a MERGE statement keyed on a hash of the business key. That means the pipeline is fully idempotent — you can re-run it after any failure without creating duplicates. For observability, I wrapped every pipeline stage with a Python decorator that writes structured logs to a Snowflake table — row counts, durations, error payloads. The team can query pipeline health directly in Snowflake without touching Databricks. The CI/CD is Azure DevOps — job configs and Python wheels deploy automatically on merge to main, with manual approval gates for production. In production, this runs daily and processes about 50,000 records per run. The 18-month historic load processed roughly 2 million records in a single run."
"This is a Databricks pipeline I built for REST API ingestion into Snowflake. The key design decisions were: three load modes driven by a single config file, idempotent MERGE patterns for safe re-runs, and structured logging to Snowflake for observability. It eliminated 8 hours of manual work per week and reduced data lag from 24 hours to under 2 hours."
How this solution demonstrates enterprise data platform delivery values:
Designed, built, and operated the pipeline end-to-end — from API authentication to Snowflake load to CI/CD deployment.
The historic backfill required debugging API rate limits, schema inconsistencies, and ADLS permission issues across a 3-day sprint.
Introduced the schema evolution overflow pattern after researching Delta Lake best practices — not required, but made the system forward-compatible.
Documented every design decision in the README so the next engineer could understand and extend the system without asking me.