Member-only story

A Deep Dive Into Spark Datasets and DataFrames Using Scala

A comprehensive guide to Spark datasets and DataFrames

Published in

Better Programming

102 min readMay 28, 2021

Image by Author

Preliminary

Apache Spark is an open source distributed data processing engine that can be used for big data analysis. It has built-in libraries for streaming, graph processing, and machine learning, and data scientists can use Spark to rapidly analyze data at scale. Programming languages supported by Spark include Python, Java, Scala, and R.

Scala is a powerful programming language that combines functional and object-oriented programming. It is a JVM-based statistically typed language. Apache Spark is written in Scala, and because of its scalability on JVM, it is a popular programming language for data developers working on Spark projects. In this article, I am going to show you how to use Spark Datasets and DataFrames using Scala.

Code listings

The code listings in this article have been tested on Databricks Community Edition cluster (Runtime 8.2) with Spark 3.1.1 and Scala 2.12. Some of the code listings may not work with lower versions of Spark. You can find a link to the source code of all the code listings at the end of this article.

Spark source code

Create an account to read the full story.

The author made this story available to Medium members only.
If you’re new to Medium, create a new account to read this story on us.

Continue in app

Or, continue in mobile web

Sign up with Google

Sign up with Facebook

Already have an account? Sign in

Published in Better Programming

Last published Nov 10, 2023

Advice for programmers.

Written by Reza Bagheri

Data Scientist and Researcher. LinkedIn: https://www.linkedin.com/in/reza-bagheri-71882a76/

No responses yet

What are your thoughts?

Also publish to my profile

Recommended from Medium

Which Python Dashboard Is Better? Dash, Panel And Streamlit Showdown

In

Towards AI

by

John Loewen, PhD

Which Python Dashboard Is Better? Dash, Panel And Streamlit Showdown

Prompting GPT-4 for multi-visual interactive dashboard creation

Feb 5

PySpark — Run Multiple Jobs in Parallel

In

Dev Genius

by

Subham Khandelwal

PySpark — Run Multiple Jobs in Parallel

Understand How to Execute multiple Jobs in Parallel or Concurrently in PySpark

Sep 2, 2024

Lists

General Coding Knowledge

20 stories1916 saves

Coding & Development

11 stories1009 saves

Predictive Modeling w/ Python

20 stories1832 saves

Stories to Help You Grow as a Software Developer

19 stories1601 saves

5 AI Projects You Can Build This Weekend (with Python)

In

TDS Archive

by

Shaw Talebi

5 AI Projects You Can Build This Weekend (with Python)

From beginner-friendly to advanced

Oct 9, 2024

If You Can Answer These 7 Python Concepts Correctly, You’re Decent at Python (Part 2)

Sabrina Carpenter 🐍

If You Can Answer These 7 Python Concepts Correctly, You’re Decent at Python (Part 2)

Last week, during a code review, a junior developer asked me why I used a context manager instead of a traditional try-except block.

Feb 11

Goodbye RAG? Gemini 2.0 Flash Have Just Killed It!

In

AI Advances

by

Manpreet Singh

Goodbye RAG? Gemini 2.0 Flash Have Just Killed It!

Alright!!!

Feb 10

20 Advanced Statistical Approaches Every Data Scientist Should Know 🐱‍🚀

Sarowar Jahan Saurav

20 Advanced Statistical Approaches Every Data Scientist Should Know 🐱‍🚀

Data science is a multidisciplinary field that combines mathematics, statistics, computer science, and domain expertise to extract…

Feb 6

See more recommendations

Help
Status
About
Careers
Press
Blog
Privacy
Terms
Text to speech
Teams