Skip to main content

12 posts tagged with "data-engineering"

DE tag description

View All Tags

Automating Unit Tests and Deploying AWS Glue & Lambda Python Jobs with CI/CD

· 5 min read
Vibhavari Bellutagi
Data Engineer

In this blog, we’ll explore how to set up a complete CI/CD pipeline using Jenkins, pytest, and Terraform to automate unit testing and deployment for AWS Glue and Lambda jobs. You’ll also learn how to manage Python dependencies using uv and pyproject.toml, use JFrog Artifactory to store and retrieve build artifacts, and enforce code quality with Ruff.

Spark Execution Modes

· 4 min read
Vibhavari Bellutagi
Data Engineer

In this post, we will discuss the different execution modes available in Apache Spark. Apache Spark provides three execution modes to run Spark applications. These execution modes are: Cluster mode, Client mode, and Local mode. Each of these modes has its own use case and is suitable for different scenarios.

Handling Nulls in Spark

· 11 min read
Vibhavari Bellutagi
Data Engineer

In SQL null or Null is a special marker used to indicate that a data value does not exist in the database. A null should not be confused with a value of 0. A null indicates a lack of a value, which is not the same as a zero value.

For example: Consider the question "How many books does Krishna own?" The answer may be zero (we know that he owns none) or null (we do not know how many he owns).

Let's deep dive into handling nulls in Spark.

Columns and Expressions

· 4 min read
Vibhavari Bellutagi
Data Engineer

Apache Spark's Column and Expression play a big role in making your pipeline more efficient. In this blog we will look into ALL the possible ways to select columns, use built-in functions and perform calculations with column objects and expressions in PySpark. So, whether you build an ETL pipeline or doing exploratory data analysis, these techniques methods will come in handy.

Introduction to Apache Spark

· 5 min read
Vibhavari Bellutagi
Data Engineer

Welcome to my Apache Spark series! I’ll dive deep into Apache Spark, from basics to advanced concepts. This series is about learning, exploring, and sharing—documenting my journey to mastering Apache Spark ( again ) while sharing insights, challenges, and tips.

In this first post, we’ll cover the fundamentals of Apache Spark, its history, and why it’s a game-changer in data engineering.

Find all the blogs in the series here.

Data Modelling - Fact Modelling

· 6 min read
Vibhavari Bellutagi
Data Engineer

Im sharing my learning from the Data Engineering Bootcamp, where we are learning about Data Engeering. Today we are learning about Fact Modelling.

I would like to extend my gratitude to Zach Wilson, the founder of DataExpert.io, for his invaluable guidance and the comprehensive Data Engineering Bootcamp. Connect with Zach Wilson on LinkedIn.

Thank you, Zach, for this amazing intense bootcamp on Data engineering!


Week-2, Day-1: Fact Data Modeling

Data Modelling - Graph Databases and Additve Dimensions

· 5 min read
Vibhavari Bellutagi
Data Engineer

Im sharing my learning from the Data Engineering Bootcamp, where we are learning about Data Engeering. Today we are learning about Data Modelling - Graph Databases

I would like to extend my gratitude to Zach Wilson, the founder of DataExpert.io, for his invaluable guidance and the comprehensive Data Engineering Bootcamp. Connect with Zach Wilson on LinkedIn.

Thank you, Zach, for this amazing intense bootcamp on Data engineering!


Day - 3: Data Modeling: Graph Databases