How to mine Scala3 compiler metadata with TASTy files

Natural Language Processing and Machine Learning are gaining more and more popularity. Of course, VirtusLab had to have a closer look in terms of Scala. As the leader in Scala development, we decided to conduct small machine-learning experiments on Scala 3.

The initial obstacle was how to gather data for training in an efficient and scalable way. Scala 3, as many contemporary programming languages, lets us write compiler plugins that we could use. However, they are uncomfortable to use, since they require us to:

gather source codes from different sources
set up projects
compile them

A very tedious and nearly impossible task. Decompilers, however, allow parsing binaries without the need for the original project’s source code setup. Nevertheless, this approach has limitations in various languages as decompilation may produce similar but not identical source code.

Enter Scala 3, introducing a robust decompiler that grants access to nearly all compiler internals, mimicking the compilation process from source code. In this brief blog post, we share our experience of extracting a vast collection of Scala 3 compiler metadata and demonstrate how you can achieve the same in just 150 lines of code.

Explore TASTy files for comprehensive code analysis

The key is to utilize TASTy files, which are included in target jars alongside binaries uploaded to Maven Central. TASTy files are binary serialized files containing the complete structure of Scala source code usage. They store abundant metadata, such as scaladoc comments, for all declared classes, objects, methods, and more. These files are serialized as AST trees, allowing us to navigate them as if we were compiling the source code.

TASTy files and the challenges

We aim to utilize the TASTy format. But how can we extract valuable information from it? Let’s break it down into smaller challenges:

How can we obtain the list of Scala 3 libraries and their Maven coordinates?
How can we efficiently fetch all of them, including their dependencies?
How do we parse and retrieve the desired data?

In reality, these questions are simpler to address than they appear. Let’s go through them chronologically.

1 Obtaining Scala 3 libraries and their Maven coordinates

Let’s turn to the scaladex service. It is a comprehensive resource for Scala libraries that enables user-friendly browsing of Scala libraries. Instead of relying on direct Maven integration or the internal scaladex model, we opt to scrape the necessary data from a dedicated scaladex web page. With the help of the JSoup dependency, we effortlessly obtain the Maven coordinates of the newest Scala 3 libraries in just 40 lines of code.

The pipeline primarily involves a few transformations on the scraped HTML pages. Let’s have a look at the code itself:

2 Fetching and downloading all libraries and dependencies

Once we have the library coordinates, we can download them using coursier, a handy Scala tool that saves us time. Coursier is a library with a user-friendly interface for fetching Maven packages. It is commonly utilized by Scala build tools or as a standalone tool in a terminal. The SDK API is easy to use, allowing us to obtain the desired jar file and its dependencies with a single call to Fetch():

3 Parsing and retrieving data with the TASTy Inspector

We will use TASTy Inspector, a Scala 3 decompiler tool, to read TASTy files. To collect methods and their Scaladoc comments, we need to override a simple procedure that acts as a callback for each file.

Let’s define our custom inspector:

Once the inspector is defined, we can easily run it in one line:

Note: Certain classes, like Cleaner for deobfuscating Scaladoc comments or custom Extractors, have been borrowed from the dotty repository and can be found in our repository.

How to extract source code and bytecode

To obtain the source code, we can utilize a built-in compiler printer that converts the Tree into Scala code. It incurs minimal cost since the necessary files are already loaded. This process provides us with cleaned code, complete with resolved fully-qualified names and the removal of comments, performed by the scanner and parser phases.

If there were internal errors in recovering the source code, we just discard faulty Trees and return a placeholder “NO_SOURCECODE.”
For bytecode, we opted to use the Apache BCEL library for simplicity. As we have the classpath, which consists of the jars fetched by coursier, everything is readily available.

Similar to source code recovery, in some cases, it was simpler to exclude unknown synthetic Scala classes and return “NO_BYTECODE” instead of searching for their correct names in bytecode class files.

By utilising these methods, you can automatically extract data from all Scala 3 libraries indexed by Scaladex.

Conclusion

We have had the opportunity to work with the Scala 3 decompiler mechanism and have found numerous benefits. For one project, we specifically required the ASTs of methods and their corresponding comments.

However, we can utilize this mechanism to extract various data related to code structure for statistical analysis or machine learning tasks. Throughout our usage of TastyInspector, we encountered internal errors that prevented the parsing of certain libraries into the Scala AST model. To ensure the correct reading of all produced TASTy files, we suggest considering the inclusion of these scripts as an additional step in the Scala Community Build.
The functional scripts can be found in our repository ScalaTastiesScrapper. Overall, our experience with the Scala 3 decompiler has provided us with valuable insights and expanded possibilities for its effective utilization. Go try it out yourself.

Curated by Sebastian Synowiec