After the release of Scala 3, one of the most common questions asked by developers was: “When will we be able to write Spark jobs using Scala 3?”. Up till now, the answer was: “Not yet” but everything changed after the release of Spark 3.2.0, which brought Scala 2.13 support. Scala 3 projects can depend on Scala 2.13 libraries, so can we finally write Spark jobs using Scala 3? The answer is yes, in theory
DISCLAIMER: Many things shown here are still under development or experimental and may not work properly in some cases.
In this post, we will show an example Spark project using Scala 3.
Usually, examples in blog posts like this one use some simple and common examples, like processing simple dataset provided as JSON files. Being in the subject of Scala 3, we decided to use some Scala 3 code itself as our dataset. Scala 3 introduces the TASTy format that defines serialized Typed Abstract Syntax Trees. TASTy files, in short, contain the view that the compiler has of your program, with all types inferred, implicits expanded etc. With Scala 3, TASTy archives are shipped together with bytecode inside jars as *.tasty files. In our example, we are going to analyze the code of the most popular libraries from the Scala 3 ecosystem.
Loading and parsing TASTy is out of the scope of this blogpost so we will just describe briefly what that part does. The more detailed explanation can be found in comments in the source code on Github
In the first step, we are pulling the .jar files from Maven Central by simply sending a http request for each Library and extracting the content of tasty files into TastyFile case class:
1case class Library(org: String, name: String, version: String)
2case class TastyFile(lib: Library, path: String, content: Array[Byte])
Then, using tasty-query we are extracting information about the base type, kind and position of the tree into another case class:
Let’s try to run it on a single library, e.g. cats-core. With a brand new tool called scala-cli we can run locally the code directly from our repository with just one command:
1scala-cli run https://raw.githubusercontent.com/VirtusLab/types-usage-counter/master/tasty.scala -- org.typelevel cats-core_3 2.6.1
2
The most popular types in cats-core are Tuple2, Object and Function1. These types are very common in the codebase, and they are quite general so this result looks reasonable. We imagine that our idea is correct and decide to run this algorithm using Spark.
The next step was to create a Spark application to load, process and analyze the data, preferably inside the cluster. The processing part is quite simple. We are looking for the top 10 most popular types. We end up with the following code:
1import scala3encoders.given
Our method for processing the data from TASTY itself is very naive, so the result may be a bit blurred. Nevertheless, high usage of Int, Boolean, Tuple2$ (Tuple2’s companion object), Function1 and Option indicates that our data makes some sense.
You can run the code on your machine and experiment with some other data using scala-cli. You can do it by either cloning our git repository and then running
1value toDF is not a member of … — did you mean libs.coll?
The release of Spark 3.2.0 for Scala 2.13 opens up the possibility of writing Scala 3 Apache Spark jobs. However it is an uphill path and many challenges ahead before it can be confidently done in production.
We are going to keep you posted with all information about progress in this area. Follow us on our social media and look forward to new blog posts.
The next step was to create a Spark application to load, process and analyze the data, preferably inside the cluster. The processing part is quite simple. We are looking for the top 10 most popular types. We end up with the following code:
1@main def spark(args: String*) =
2 val csvPath = args.headOption.getOrElse("libs.csv")
3
4 val spark = SparkSession.builder().master("local").getOrCreate()
5 import spark.implicits._
6
7 val libs = spark.read.option("header", true).csv(csvPath).as[Library]
8 val treeInfos = processLibraries(libs)
9
10 treeInfos
11 .filter(col("topLevelType").=!=(NoType))
12 .groupBy("topLevelType")
13 .count()
14 .sort(col("count").desc)
15 .show(10, false)
16
17 spark.stop()
Our input consists of over 500 artifacts compiled with Scala 3 obtained from scaladex API. Our cluster contains one master and one worker node based on Docker containers from big-data-europe tweaked to use Spark 3.2 and Scala 2.13.
The most popular types are:
scala.Any
75264
scala.Int
49041
java.lang.Object
42730
scala.Boolean
23670
scala.Predef
21112
java.lang.String
21124
scala.Tuple2$
19279
scala.runtime.ModuleSerializationProxy
18783
java.lang.IndexOutOfBoundsException
17262
scala.package
16605
Our method for processing the data from TASTY itself is very naive, so the result may be a bit blurred. Nevertheless, high usage of Int, Boolean, Tuple2$ (Tuple2’s companion object), Function1 and Option indicates that our data makes some sense.
You can run the code on your machine and experiment with some other data using scala-cli. You can do it by either cloning our git repository and then running
1scala-cli . --main-class spark -- libs.csv
or running it directly from the multi-file gist that we have prepared using:
The release of Spark 3.2.0 for Scala 2.13 opens up the possibility of writing Scala 3 Apache Spark jobs. However it is an uphill path and many challenges ahead before it can be confidently done in production.
We are going to keep you posted with all information about progress in this area. Follow us on our social media and look forward to new blog posts.