← All Projects
Systems

PySpark Data Pipelines - Big Data Ingestion at Scale

January 2023
PySparkDatabricksApache NiFiPythonAWS KinesisStructured Streaming

Overview

At JP Morgan Chase, I built and optimized large-scale data ingestion and transformation pipelines running on Databricks, handling 50+ GB of daily data flowing into a centralized data lake. The work spanned batch ingestion, structured streaming, and real-time pipeline validation.

What I Built

Databricks PySpark Pipelines

  • Designed modular ingestion pipelines using PySpark DataFrames and Spark SQL, processing structured and semi-structured data from multiple upstream sources
  • Applied partition pruning, broadcast joins, and adaptive query execution to reduce pipeline runtime by 30%
  • Implemented schema evolution handling and data quality checkpoints at each transformation stage

Apache NiFi Test Utilities

  • Developed a test utility framework for Apache NiFi data flows handling 30-40 GB per single load
  • Enabled reliable validation of processor chains, backpressure behavior, and error routing without needing a full production environment

Real-Time Streaming

  • Built real-time ingestion paths using AWS Kinesis, supporting ~300 concurrent producers and sustaining 50+ MB/s aggregate throughput under bursty workloads
  • Used Spark Structured Streaming for continuous processing with watermarking and late-data handling

Key Outcomes

  • 30% reduction in batch pipeline runtime through query optimization
  • Reliable test coverage for NiFi flows handling tens of GBs per load
  • Scalable streaming infrastructure supporting hundreds of concurrent data producers