Member-only story
A Deep Dive Into Spark Datasets and DataFrames Using Scala
A comprehensive guide to Spark datasets and DataFrames

Preliminary
Apache Spark is an open source distributed data processing engine that can be used for big data analysis. It has built-in libraries for streaming, graph processing, and machine learning, and data scientists can use Spark to rapidly analyze data at scale. Programming languages supported by Spark include Python, Java, Scala, and R.
Scala is a powerful programming language that combines functional and object-oriented programming. It is a JVM-based statistically typed language. Apache Spark is written in Scala, and because of its scalability on JVM, it is a popular programming language for data developers working on Spark projects. In this article, I am going to show you how to use Spark Datasets and DataFrames using Scala.
Code listings
The code listings in this article have been tested on Databricks Community Edition cluster (Runtime 8.2) with Spark 3.1.1 and Scala 2.12. Some of the code listings may not work with lower versions of Spark. You can find a link to the source code of all the code listings at the end of this article.