



Learn how we leveraged Scala 3’s decompiler to mine compiler metadata with TASTy files, streamlining data gathering in 150 code lines.
Natural Language Processing and Machine Learning gain more and more popularity. Of course, VirtusLab had to have a closer look in terms of Scala. As the leader in Scala development, we decided to conduct small machine learning experiments on Scala 3.
The initial obstacle was how to gather data for training in an efficient and scalable way. Scala 3, as many contemporary programming languages, lets us write compiler plugins that we could use. However, they are uncomfortable to use, since they require us to:
A very tedious and nearly impossible task. Decompilers, however, allow parsing binaries without the need for the original project’s source code setup. Nevertheless, this approach has limitations in various languages as decompilation may produce similar but not identical source code.
Enter Scala 3, introducing a robust decompiler that grants access to nearly all compiler internals, mimicking the compilation process from source code. In this brief blog post, we share our experience of extracting a vast collection of Scala 3 compiler metadata and demonstrate how you can achieve the same in just 150 lines of code.
The key is to utilise TASTy files, which are included in target jars alongside binaries uploaded to Maven Central. TASTy files are binary serialised files containing the complete structure of Scala source code usage. They store abundant metadata, such as scaladoc comments, for all declared classes, objects, methods, and more. These files are serialised as AST trees, allowing us to navigate them as if we were compiling the source code.
We aim to utilise the TASTy format. But how can we extract valuable information from it? Let’s break it down into smaller challenges:
In reality, these questions are simpler to address than they appear. Let’s go through them chronologically.
Let’s turn to the scaladex service. It is a comprehensive resource for Scala libraries that enables user-friendly browsing of Scala libraries. Instead of relying on direct Maven integration or the internal scaladex model, we opt to scrape the necessary data from a dedicated scaladex web page. With the help of the JSoup dependency, we effortlessly obtain the Maven coordinates of the newest Scala 3 libraries in just 40 lines of code.
The pipeline primarily involves a few transformations on the scraped HTML pages. Let’s have a look at the code itself:
val elems = (1 to 64).par
.flatMap { page =>
Jsoup
.connect(
s"https://index.scala-lang.org/search?sort=stars&languages=3.x&q=*&page=$page"
)
.get()
.select("h4")
.eachText
.asScala
}
.flatMap { header =>
Try(
Jsoup
.connect(s"https://index.scala-lang.org/$header/artifacts/version")
.get()
).toOption
.map { page =>
val version = page.select(".head-last-version").text.trim
page.select("option").eachText.asScala.map((_, (header, version)))
}
}
.flatten
.flatMap { case (name, (header, version)) =>
Try {
val text = Jsoup
.connect(
s"https://index.scala-lang.org/$header/artifacts/$name/$version?binary-versions=_3"
)
.get()
.select("#copy-maven")
.text
Jsoup.parse(text, "", Parser.xmlParser())
}.toOption
.filter(_.select("artifactId").text.endsWith("_3"))
.map { doc =>
doc.select("groupId").text + ":" + doc
.select("artifactId")
.text + ":" + doc.select("version").text
}
}
Once we have the library coordinates, we can download them using coursier, a handy Scala tool that saves us time. Coursier is a library with a user-friendly interface for fetching Maven packages. It is commonly utilized by Scala build tools or as a standalone tool in a terminal. The SDK API is easy to use, allowing us to obtain the desired jar file and its dependencies with a single call to Fetch():
Fetch()
.withRepositories(repositories)
.withDependencies(
Seq(
Dependency(
Module(Organization(organization), ModuleName(module)),
version
)
)
)
.run
We will use TASTy Inspector, a Scala 3 decompiler tool, to read TASTy files. To collect methods and their Scaladoc comments, we need to override a simple procedure that acts as a callback for each file.
Let’s define our custom inspector:
class MyInspector(fileOutputName: String, classpath: String) extends Inspector:
val file = new File(fileOutputName)
val bw = new BufferedWriter(new FileWriter(file))
def inspect(using Quotes)(tastys: List[Tasty[quotes.type]]): Unit =
import quotes.reflect.*
object Traverser extends TreeAccumulator[List[DefDef]]:
def foldTree(defdefs: List[DefDef], tree: Tree)(
owner: Symbol
): List[DefDef] =
val defdef = tree match
case d: DefDef =>
List(d)
case tree =>
Nil
foldOverTree(defdefs ++ defdef, tree)(owner)
end Traverser
tastys
.flatMap { tasty =>
val tree = tasty.ast
Traverser.foldTree(List.empty, tree)(tree.symbol)
}
.filter(_.symbol.docstring.nonEmpty)
.flatMap { defdef =>
val comment = Cleaner.clean(defdef.symbol.docstring.get).mkString(" ")
Option.when(!comment.isBlank && defdef.rhs != None)(
s"${astCode(defdef)}␟${byteCode(defdef)}␟${sourceCode(defdef, true)}␟${sourceCode(defdef, false)}␟${comment}\n"
)
}
.foreach(bw.write)
bw.close()
extension (s: String)
def removeNewLines: String =
s.replaceAll("\\p{C}|\\s+|\\r$|\\\\t|\\\\n|\\\\r", " ")
def astCode(using Quotes)(defdef: quotes.reflect.DefDef): String =
Extractors.showTree(defdef).removeNewLines
Once the inspector is defined, we can easily run it in one line:
TastyInspector.inspectAllTastyFiles(
Nil,
List(classpath.head),
classpath.tail.toList
)(
new MyInspector(coordinates, classpath)
)
Note: Certain classes, like Cleaner for deobfuscating Scaladoc comments or custom Extractors, have been borrowed from the dotty repository and can be found in our repository.
To obtain the source code, we can utilise a built-in compiler printer that converts the Tree into Scala code. It incurs minimal cost since the necessary files are already loaded. This process provides us with cleaned code, complete with resolved fully-qualified names and the removal of comments, performed by the scanner and parser phases.
def sourceCode(using Quotes)(
defdef: quotes.reflect.DefDef,
fullNames: Boolean
): String =
val sourceCode = Try(
SourceCode
.showTree(defdef)(SyntaxHighlight.plain, fullNames)
.removeNewLines
)
sourceCode.toOption.getOrElse("NO_SOURCECODE")
If there were internal errors in recovering the source code, we just discard faulty Trees and return a placeholder “NO_SOURCECODE.”
For bytecode, we opted to use the Apache BCEL library for simplicity. As we have the classpath, which consists of the jars fetched by coursier, everything is readily available.
def byteCode(using Quotes)(defdef: quotes.reflect.DefDef): String =
val reader = Try {
SyntheticRepository
.getInstance(ClassPath(classpath))
.loadClass(defdef.symbol.owner.fullName.replaceAll("\\$\\.", "\\$"))
.getMethods()
}
reader.toOption
.flatMap {
_.toList
.find(_.getName == defdef.symbol.name)
.map(_.getCode)
.filter(_ != null)
.map(x =>
Utility.codeToString(x.getCode, x.getConstantPool, 0, -1, true)
)
.map(_.toString.removeNewLines)
}
.getOrElse("NO_BYTECODE")
Similar to source code recovery, in some cases, it was simpler to exclude unknown synthetic Scala classes and return “NO_BYTECODE” instead of searching for their correct names in bytecode class files.
By utilising these methods, you can automatically extract data from all Scala 3 libraries indexed by Scaladex.
We have had the opportunity to work with the Scala 3 decompiler mechanism and have found numerous benefits. For one project, we specifically required the ASTs of methods and their corresponding comments.
However, we can utilise this mechanism to extract various data related to code structure for statistical analysis or machine learning tasks. Throughout our usage of TastyInspector, we encountered internal errors that prevented the parsing of certain libraries into the Scala AST model. To ensure the correct reading of all produced TASTy files, we suggest considering the inclusion of these scripts as an additional step in the Scala Community Build.
The functional scripts can be found in our repository ScalaTastiesScrapper. Overall, our experience with the Scala 3 decompiler has provided us with valuable insights and expanded possibilities for its effective utilisation. Go try it out yourself.