12 posts tagged with "data-engineering"

DE tag description

Automating Unit Tests and Deploying AWS Glue & Lambda Python Jobs with CI/CD

March 21, 2025 · 5 min read

Data Engineer

In this blog, we’ll explore how to set up a complete CI/CD pipeline using Jenkins, pytest, and Terraform to automate unit testing and deployment for AWS Glue and Lambda jobs. You’ll also learn how to manage Python dependencies using uv and pyproject.toml, use JFrog Artifactory to store and retrieve build artifacts, and enforce code quality with Ruff.

The Life Cycle of a Spark Application ( Outside )

February 7, 2025 · 3 min read

Vibhavari Bellutagi

Data Engineer

In this blog, we will go in-depth on the overall life cycle of Spark Applications from outside the actual Spark code. Before going ahead, I recommend reading the Execution Modes of the Spark application.

Spark Execution Modes

February 6, 2025 · 4 min read

Vibhavari Bellutagi

Data Engineer

In this post, we will discuss the different execution modes available in Apache Spark. Apache Spark provides three execution modes to run Spark applications. These execution modes are: Cluster mode, Client mode, and Local mode. Each of these modes has its own use case and is suitable for different scenarios.

Under the hood of a Spark job

January 21, 2025 · 5 min read

Vibhavari Bellutagi

Data Engineer

Understanding the internal execution flow of a Spark application is key to optimizing performance and debugging. This blog dives into the details of Spark jobs, stages, and tasks, providing a thorough exploration of how Spark handles distributed execution.

Handling Nulls in Spark

January 13, 2025 · 11 min read

Vibhavari Bellutagi

Data Engineer

In SQL null or Null is a special marker used to indicate that a data value does not exist in the database. A null should not be confused with a value of 0. A null indicates a lack of a value, which is not the same as a zero value.

For example: Consider the question "How many books does Krishna own?" The answer may be zero (we know that he owns none) or null (we do not know how many he owns).

Let's deep dive into handling nulls in Spark.

Columns and Expressions

January 10, 2025 · 4 min read

Vibhavari Bellutagi

Data Engineer

Apache Spark's Column and Expression play a big role in making your pipeline more efficient. In this blog we will look into ALL the possible ways to select columns, use built-in functions and perform calculations with column objects and expressions in PySpark. So, whether you build an ETL pipeline or doing exploratory data analysis, these techniques methods will come in handy.

Introduction to Apache Spark

January 1, 2025 · 5 min read

Vibhavari Bellutagi

Data Engineer

Welcome to my Apache Spark series! I’ll dive deep into Apache Spark, from basics to advanced concepts. This series is about learning, exploring, and sharing—documenting my journey to mastering Apache Spark ( again ) while sharing insights, challenges, and tips.

In this first post, we’ll cover the fundamentals of Apache Spark, its history, and why it’s a game-changer in data engineering.

Find all the blogs in the series here.

Data Modelling - Fact vs Dimension

December 19, 2024 · 3 min read

Vibhavari Bellutagi

Data Engineer

I'm sharing my learnings from the Data Engineering Bootcamp, where we are currently focusing on Fact vs Dimension.

Resource	Link
DataExpert.io	DataExpert.io
Zach Wilson on LinkedIn	LinkedIn

Thank you, Zach, for your invaluable guidance and this comprehensive bootcamp!

Week-2, Day-2: Fact vs Dimension

Data Modelling - Fact Modelling

December 14, 2024 · 6 min read

Vibhavari Bellutagi

Data Engineer

Im sharing my learning from the Data Engineering Bootcamp, where we are learning about Data Engeering. Today we are learning about Fact Modelling.

I would like to extend my gratitude to Zach Wilson, the founder of DataExpert.io, for his invaluable guidance and the comprehensive Data Engineering Bootcamp. Connect with Zach Wilson on LinkedIn.

Thank you, Zach, for this amazing intense bootcamp on Data engineering!

Week-2, Day-1: Fact Data Modeling

Data Modelling - Graph Databases and Additve Dimensions

December 5, 2024 · 5 min read

Vibhavari Bellutagi

Data Engineer

Im sharing my learning from the Data Engineering Bootcamp, where we are learning about Data Engeering. Today we are learning about Data Modelling - Graph Databases

I would like to extend my gratitude to Zach Wilson, the founder of DataExpert.io, for his invaluable guidance and the comprehensive Data Engineering Bootcamp. Connect with Zach Wilson on LinkedIn.

Thank you, Zach, for this amazing intense bootcamp on Data engineering!

Day - 3: Data Modeling: Graph Databases