Member-only story
High Level Overview of Apache Spark
What is Spark? Let’s take a look under the hood

In my last post we introduced a problem: copious, never ending streams of data, and its solution: Apache Spark. Here in part two, we’ll focus on Spark’s internal architecture and data structures.
In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log, they didn’t try to grow a larger ox. We shouldn’t be trying for bigger computers, but for more systems of computers — Grace Hopper
With the scale of data growing at a rapid and ominous pace, we needed a way to process potential petabytes of data quickly, and we simply couldn’t make a single computer process that amount of data at a reasonable pace. This problem is solved by creating a cluster of machines to perform the work for you, but how do those machines work together to solve the common problem?
Meet Spark
Spark is the cluster computing framework for large-scale data processing. Spark offers a set of libraries in three languages (Java, Scala, Python) for its unified computing engine. What does this definition actually mean?
Unified — with Spark, there is no need to piece together an application out of multiple APIs or systems. Spark provides you with enough built-in APIs to get the job done.
Computing Engine — Spark handles the loading of data from various file systems and runs computations on it, but does not store any data itself permanently. Spark operates entirely in memory, allowing unparalleled performance and speed.
Libraries — Spark is comprised of a series of libraries built for data science tasks. Spark includes libraries for SQL (Spark SQL), Machine Learning (MLlib), Stream Processing (Spark Streaming and Structured Streaming), and Graph Analytics (GraphX).