ETL with PySpark – Intro

Data transformation is an essential step in the data processing pipeline, especially when working with big data platforms like PySpark. In this article, we’ll explore the different types of data transformations you can perform using PySpark, complete with easy-to-understand code examples.

What is Data Transformation?

Data transformation involves converting data from one format or structure into another. This process is crucial for preparing your data for analytics or machine learning models. In PySpark, you can perform data transformations using Resilient Distributed Datasets (RDDs) or DataFrames.

Setting Up PySpark

Before diving into the code, make sure you have PySpark installed. If not, you can install it using pip:

pip install pyspark

Now, let’s initialize a PySpark session:

from pyspark.sql import SparkSession

spark = SparkSession.builder \
.appName("Data Transformation in PySpark") \

Data Transformation Types

1. Normalization and Standardization


Normalization scales the data into a specific range, usually [0, 1]. This is useful when your features have different units or different scales.

Normalized Value = Value−Min/Max -Min


Standardization, on the other hand, centers the data around zero and scales it based on the standard deviation. This is useful when the data follows a Gaussian distribution.

Standardized Value = Value−Mean/Standard Deviation

Code Example 1: Normalization

Let’s start by creating a sample DataFrame and then normalize the “Age” column.

from pyspark.sql.functions import col, min, max

# Sample DataFrame
data = [("Alice", 34), ("Bob", 45), ("Catherine", 29)]
columns = ["Name", "Age"]

df = spark.createDataFrame(data, columns)

# Calculate Min and Max for Age
age_min ="Age"))).collect()[0][0]
age_max ="Age"))).collect()[0][0]

# Perform Normalization
df_normalized = df.withColumn("Age_Normalized", (col("Age") - age_min) / (age_max - age_min))

In this example, the “Age” column is normalized into a [0, 1] range.

Code Example 2: Standardization

Now, let’s standardize the “Age” column.

from pyspark.sql.functions import mean, stddev

# Calculate Mean and Standard Deviation for Age
age_mean ="Age"))).collect()[0][0]
age_stddev ="Age"))).collect()[0][0]

# Perform Standardization
df_standardized = df.withColumn("Age_Standardized", (col("Age") - age_mean) / age_stddev)

2. Encoding Categorical Data

What is it?

Encoding transforms categorical data into a format that can be used by machine learning algorithms, such as one-hot encoding or label encoding.

Sample DataFrame

Let’s create a sample DataFrame with a categorical column “Fruit”:

# Sample DataFrame
data = [("Apple",), ("Banana",), ("Cherry",), ("Apple",), ("Banana",), ("Apple",)]
columns = ["Fruit"]

df = spark.createDataFrame(data, columns)

Label Encoding

Label encoding assigns a unique integer to each category in the categorical column. Here’s how to do it:

from import StringIndexer

# Initialize StringIndexer
indexer = StringIndexer(inputCol="Fruit", outputCol="Fruit_Index")

# Fit and Transform the DataFrame
df_indexed =

One-Hot Encoding

One-hot encoding converts the integer-mapped column into a binary vector. Each category is represented as an array where only one element is “1,” and the rest are “0.”

from import OneHotEncoder

# Initialize OneHotEncoder
encoder = OneHotEncoder(inputCol="Fruit_Index", outputCol="Fruit_OneHot")

# Transform the DataFrame
df_encoded =

Full Code

Here’s the full code combining all the steps:

# Sample DataFrame
data = [("Apple",), ("Banana",), ("Cherry",), ("Apple",), ("Banana",), ("Apple",)]
columns = ["Fruit"]
df = spark.createDataFrame(data, columns)

# Label Encoding
indexer = StringIndexer(inputCol="Fruit", outputCol="Fruit_Index")
df_indexed =

# One-Hot Encoding
encoder = OneHotEncoder(inputCol="Fruit_Index", outputCol="Fruit_OneHot")
df_encoded =

# Show the final DataFrame

By following this example, you can easily encode categorical data in PySpark, making your dataset ready for machine learning algorithms.

3. Feature Engineering

What is it?

Feature engineering is the process of transforming raw data into a format that is better suited for modeling by machine learning algorithms. It’s often said that “data should be features, not columns,” emphasizing the importance of feature engineering in the data science pipeline. In this guide, we’ll explore how to perform feature engineering using PySpark.

Sample DataFrame

Let’s create a sample DataFrame with numerical and categorical columns:

# Sample DataFrame
data = [("Alice", 34, "Female"), ("Bob", 45, "Male"), ("Catherine", 29, "Female")]
columns = ["Name", "Age", "Gender"]

df = spark.createDataFrame(data, columns)

Feature Engineering Techniques

1. Creating Polynomial Features

Polynomial features are those features created by raising an existing feature to a power.

Code Example

from pyspark.sql.functions import col

# Create a new feature Age_Squared = Age * Age
df_poly = df.withColumn("Age_Squared", col("Age") ** 2)

2. Interaction Features

Interaction features are created by combining two or more features.

Code Example

# Sample DataFrame
data = [(1, 10), (2, 20), (3, 30), (4, 40)]
columns = ["Units_Sold", "Price_Per_Unit"]

df = spark.createDataFrame(data, columns)

# Create an interaction feature: Total_Sales_Value = Units_Sold * Price_Per_Unit
df_interaction = df.withColumn("Total_Sales_Value", col("Units_Sold") * col("Price_Per_Unit"))

Stay tuned for more.