Scala 3 and Spark?

Introduction

After the release of Scala 3, one of the most common questions asked by developers was: “When will we be able to write Spark jobs using Scala 3?”.
Up till now, the answer was: “Not yet” but everything changed after the release of Spark 3.2.0, which brought Scala 2.13 support. Scala 3 projects can depend on Scala 2.13 libraries, so can we finally write Spark jobs using Scala 3? The answer is yes, in theory

DISCLAIMER: Many things shown here are still under development or experimental and may not work properly in some cases.

In this post, we will show an example Spark project using Scala 3.

The project

Usually, examples in blog posts like this one use some simple and common examples, like processing simple dataset provided as JSON files. Being in the subject of Scala 3, we decided to use some Scala 3 code itself as our dataset. Scala 3 introduces the TASTy format that defines serialized Typed Abstract Syntax Trees. TASTy files, in short, contain the view that the compiler has of your program, with all types inferred, implicits expanded etc. With Scala 3, TASTy archives are shipped together with bytecode inside jars as *.tasty files. In our example, we are going to analyze the code of the most popular libraries from the Scala 3 ecosystem.

Loading and parsing TASTy is out of the scope of this blogpost so we will just describe briefly what that part does. The more detailed explanation can be found in comments in the source code on Github

In the first step, we are pulling the .jar files from Maven Central by simply sending a http request for each Library and extracting the content of tasty files into TastyFile case class:

1case class Library(org: String, name: String, version: String)
2case class TastyFile(lib: Library, path: String, content: Array[Byte])

Then, using tasty-query we are extracting information about the base type, kind and position of the tree into another case class:

1case class TreeInfo(
2    lib: Library,
3    sourceFile: String,
4    method: String,
5    treeKind: String,
6    index: Int,
7    depth: Int,
8    topLevelType: Option[String]
9)

Taking it for a spin

Let’s try to run it on a single library, e.g. cats-core. With a brand new tool called scala-cli we can run locally the code directly from our repository with just one command:

1scala-cli run https://raw.githubusercontent.com/VirtusLab/types-usage-counter/master/tasty.scala -- org.typelevel cats-core_3 2.6.1
2

The most popular types in cats-core are Tuple2, Object and Function1. These types are very common in the codebase, and they are quite general so this result looks reasonable. We imagine that our idea is correct and decide to run this algorithm using Spark.

Missing implicits

If you take a look at our code, you can find an unexpected dependency:

scala.Tuple2$	2279
scala.Tuple2[_, _]	1422
java.lang.Object	1315
scala.Function1	1119
scala.package	1013
<root>.symbol[A]	969
cats.Semigroupal	938
<root>.symbol[$anon]	800
scala.Any	667
cats.kernel.Semigroup	568

There’s a good reason why we need both. If you try to compile the project without it, you are going to get compilation error saying:

1io.github.vincenzobaz::spark-scala3:…

Running in a cluster

The next step was to create a Spark application to load, process and analyze the data, preferably inside the cluster. The processing part is quite simple. We are looking for the top 10 most popular types. We end up with the following code:

1import scala3encoders.given

Our method for processing the data from TASTY itself is very naive, so the result may be a bit blurred. Nevertheless, high usage of Int, Boolean, Tuple2$ (Tuple2’s companion object), Function1 and Option indicates that our data makes some sense.

You can run the code on your machine and experiment with some other data using scala-cli. You can do it by either cloning our git repository and then running

1value toDF is not a member of … — did you mean libs.coll?

Summary

The release of Spark 3.2.0 for Scala 2.13 opens up the possibility of writing Scala 3 Apache Spark jobs. However it is an uphill path and many challenges ahead before it can be confidently done in production.

We are going to keep you posted with all information about progress in this area. Follow us on our social media and look forward to new blog posts.

1def processLibraries(libs: Dataset[Library]): Dataset[SerializedTreeInfo] =
2  libs
3    .flatMap { lib =>
4      val tastyFiles: Either[String, Seq[TastyFile]] = loadTastyFiles(lib)
5      tastyFiles.left.foreach(log.warn)
6      tastyFiles.toSeq.flatten.map(SerializedTastyFile.from)
7    }
8    .flatMap { serialized =>
9      val treeInfos: Either[String, Seq[TreeInfo]] =
10        processTastyFile(serialized.toTastyFile)
11      treeInfos.left.foreach(log.warn)
12      treeInfos.toSeq.flatten.map(SerializedTreeInfo.from)
13    }

Running in a cluster

1@main def spark(args: String*) =
2  val csvPath = args.headOption.getOrElse("libs.csv")
3
4  val spark = SparkSession.builder().master("local").getOrCreate()
5  import spark.implicits._
6
7  val libs = spark.read.option("header", true).csv(csvPath).as[Library]
8  val treeInfos = processLibraries(libs)
9
10  treeInfos
11    .filter(col("topLevelType").=!=(NoType))
12    .groupBy("topLevelType")
13    .count()
14    .sort(col("count").desc)
15    .show(10, false)
16
17  spark.stop()

Our input consists of over 500 artifacts compiled with Scala 3 obtained from scaladex API. Our cluster contains one master and one worker node based on Docker containers from big-data-europe tweaked to use Spark 3.2 and Scala 2.13.

The most popular types are:

scala.Any	75264
scala.Int	49041
java.lang.Object	42730
scala.Boolean	23670
scala.Predef	21112
java.lang.String	21124
scala.Tuple2$	19279
scala.runtime.ModuleSerializationProxy	18783
java.lang.IndexOutOfBoundsException	17262
scala.package	16605

You can run the code on your machine and experiment with some other data using scala-cli. You can do it by either cloning our git repository and then running

1scala-cli . --main-class spark -- libs.csv

or running it directly from the multi-file gist that we have prepared using:

1scala-cli https://gist.github.com/romanowski/ff5266cac98ac387bbfe648909b28ea0

Summary

We are going to keep you posted with all information about progress in this area. Follow us on our social media and look forward to new blog posts.

Scala 3 and Spark?

Introduction

The project

Taking it for a spin

Missing implicits

Running in a cluster

Summary

Running in a cluster

Summary

Subscribe to our newsletter and never miss an article

Explore more topics

Unlock the power of your analytical data platform for data-driven decisions

Table schemas in data pipelines Spark: How to handle large, nested & growing ones

Data lake for edge reference architecture: What is it, and how does it integrate with modern edge computing solutions?