While there are many programming languages, they all share some fundamental building blocks. Here’s a breakdown of the top 10 core concepts you’ll find in pretty much every coding language: Understanding these core concepts is like learning the alphabet of Read More …
Month: May 2024
Intro to Apache Spark
Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It is based on Hadoop MapReduce and it extends the MapReduce model to efficiently use it for more types of computations, which includes interactive queries and stream processing. Read More …
Spark SQL – Part I
The world of data is exploding. Businesses are generating massive datasets from various sources – customer transactions, sensor readings, social media feeds, and more. Analyzing this data is crucial for uncovering valuable insights, informing decisions, and gaining a competitive edge. Read More …
ETL with PySpark – Intro
Data transformation is an essential step in the data processing pipeline, especially when working with big data platforms like PySpark. In this article, we’ll explore the different types of data transformations you can perform using PySpark, complete with easy-to-understand code Read More …
Spark DataFrame Cheat Sheet
Core Concepts DataFrame is simply a type alias of Dataset[Row] Quick Reference val spark = SparkSession .builder() .appName(“Spark SQL basic example”) .master(“local”) .getOrCreate() // For implicit conversions like converting RDDs to DataFrames import spark.implicits._ Creation create DataSet from seq Read More …
Basic Linux Commands – Cheat sheet
This cheat sheet covers some of the most essential commands you’ll encounter in Linux. File Management: Basic Operations: Permissions and Ownership: Network: Process Management: Finding Things: Help and Information: This is a basic list. Many more commands exist for specific Read More …
Data Engineering Learning Path
This revamped curriculum outlines the key areas of focus and estimated timeframes for mastering data engineering skills. Foundational Knowledge (1-3 weeks) Data Modeling (2-4 weeks) Data Storage (3-5 weeks) Data Processing (2-4 weeks) Data Integration (4-8 weeks) Data Transformation (4-6 Read More …