Data Processing

setting up a data lake from scratch

Editor3 months ago2 months ago08 mins

What is a Data Lake? A data lake is a centralized repository that stores vast amounts of raw data in its native format. Unlike traditional data warehouses, which require predefined schemas and are optimized for structured data, data lakes store unprocessed data. This approach provides greater flexibility for advanced analytics, real-time data processing, and machine…

Apache Spark with Python 101—Quick Start to PySpark (2025)

Editor4 months ago2 months ago053 mins

Apache Spark is an open source, distributed engine for large-scale data processing. It was developed at UC Berkeley’s AMPLab in 2009 (and released publicly in 2010), mainly to address the limitations of Hadoop MapReduce—particularly for iterative algorithms and interactive data analysis. Spark executes programs significantly faster—up to 100x quicker than Hadoop MapReduce in certain workloads—primarily due to its in-memory processing capabilities. Plus,…

Let’s break down AI, Machine Learning (ML), and Neural Networks in a structured way

Editor3 years ago7 months ago014 mins

Let’s break down AI, Machine Learning (ML), and Neural Networks in a structured way, covering key concepts, types of ML, and model architectures like Transformers, and their applications.