Systems

PySpark Data Pipelines - Big Data Ingestion at Scale

January 2023

PySparkDatabricksApache NiFiPythonAWS KinesisStructured Streaming

Overview

At JP Morgan Chase, I built and optimized large-scale data ingestion and transformation pipelines running on Databricks, handling 50+ GB of daily data flowing into a centralized data lake. The work spanned batch ingestion, structured streaming, and real-time pipeline validation.

What I Built

Databricks PySpark Pipelines

Designed modular ingestion pipelines using PySpark DataFrames and Spark SQL, processing structured and semi-structured data from multiple upstream sources
Applied partition pruning, broadcast joins, and adaptive query execution to reduce pipeline runtime by 30%
Implemented schema evolution handling and data quality checkpoints at each transformation stage

Apache NiFi Test Utilities

Developed a test utility framework for Apache NiFi data flows handling 30-40 GB per single load
Enabled reliable validation of processor chains, backpressure behavior, and error routing without needing a full production environment

Real-Time Streaming

Built real-time ingestion paths using AWS Kinesis, supporting ~300 concurrent producers and sustaining 50+ MB/s aggregate throughput under bursty workloads
Used Spark Structured Streaming for continuous processing with watermarking and late-data handling

Key Outcomes

30% reduction in batch pipeline runtime through query optimization
Reliable test coverage for NiFi flows handling tens of GBs per load
Scalable streaming infrastructure supporting hundreds of concurrent data producers