With data processing, in performing ETL workloads to a data warehouse, some of the activities in ETL can be achieved using the MERGE statement in SQL. The MERGE statement in SQL is a special type of query in SQL Server Read More …
Building a Real-Time Data Pipeline with Python, Docker, Airflow, Spark, Kafka, and Cassandra
In today’s data-driven world, the ability to efficiently collect, process, and analyze large volumes of data is paramount. This blog post will delve into a data engineering project that leverages a powerful combination of tools: Python, Docker, Airflow, Spark, Kafka, Read More …
Mastering Personalized Conversations with RAG
In today’s fast-evolving tech landscape, the demand for personalized, contextually aware conversational systems is higher than ever. Whether you’re developing customer support bots, virtual personal assistants, or interactive educational tools, leveraging advanced models like Retrieval-Augmented Generation (RAG) can transform your Read More …
Top 10 Core Concepts in every Programming Languages
While there are many programming languages, they all share some fundamental building blocks. Here’s a breakdown of the top 10 core concepts you’ll find in pretty much every coding language: Understanding these core concepts is like learning the alphabet of Read More …
Intro to Apache Spark
Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It is based on Hadoop MapReduce and it extends the MapReduce model to efficiently use it for more types of computations, which includes interactive queries and stream processing. Read More …
Spark SQL – Part I
The world of data is exploding. Businesses are generating massive datasets from various sources – customer transactions, sensor readings, social media feeds, and more. Analyzing this data is crucial for uncovering valuable insights, informing decisions, and gaining a competitive edge. Read More …
ETL with PySpark – Intro
Data transformation is an essential step in the data processing pipeline, especially when working with big data platforms like PySpark. In this article, we’ll explore the different types of data transformations you can perform using PySpark, complete with easy-to-understand code Read More …
Spark DataFrame Cheat Sheet
Core Concepts DataFrame is simply a type alias of Dataset[Row] Quick Reference val spark = SparkSession .builder() .appName(“Spark SQL basic example”) .master(“local”) .getOrCreate() // For implicit conversions like converting RDDs to DataFrames import spark.implicits._ Creation create DataSet from seq Read More …
Basic Linux Commands – Cheat sheet
This cheat sheet covers some of the most essential commands you’ll encounter in Linux. File Management: Basic Operations: Permissions and Ownership: Network: Process Management: Finding Things: Help and Information: This is a basic list. Many more commands exist for specific Read More …
Data Engineering Learning Path
This revamped curriculum outlines the key areas of focus and estimated timeframes for mastering data engineering skills. Foundational Knowledge (1-3 weeks) Data Modeling (2-4 weeks) Data Storage (3-5 weeks) Data Processing (2-4 weeks) Data Integration (4-8 weeks) Data Transformation (4-6 Read More …
Python 101: A Beginner’s Guide to Python Programming
Intro Installing Python Before diving into Python programming, you need to install Python on your computer. Follow these steps: For Windows: For macOS: For Linux: Most Linux distributions come with Python pre-installed. You can check the version by running python Read More …