Natural Language Processing and Machine Learning are gaining more and more popularity. Of course, VirtusLab had to have a closer look in terms of Scala. As the leader in Scala development, we decided to conduct small machine-learning experiments on Scala 3.
The initial obstacle was how to gather data for training in an efficient and scalable way. Scala 3, as many contemporary programming languages, lets us write compiler plugins that we could use. However, they are uncomfortable to use, since they require us to:
- gather source codes from different sources
- set up projects
- compile them
A very tedious and nearly impossible task. Decompilers, however, allow parsing binaries without the need for the original project’s source code setup. Nevertheless, this approach has limitations in various languages as decompilation may produce similar but not identical source code.
Enter Scala 3, introducing a robust decompiler that grants access to nearly all compiler internals, mimicking the compilation process from source code. In this brief blog post, we share our experience of extracting a vast collection of Scala 3 compiler metadata and demonstrate how you can achieve the same in just 150 lines of code.
Explore TASTy files for comprehensive code analysis
The key is to utilize TASTy files, which are included in target jars alongside binaries uploaded to Maven Central. TASTy files are binary serialized files containing the complete structure of Scala source code usage. They store abundant metadata, such as scaladoc comments, for all declared classes, objects, methods, and more. These files are serialized as AST trees, allowing us to navigate them as if we were compiling the source code.
TASTy files and the challenges
We aim to utilize the TASTy format. But how can we extract valuable information from it? Let’s break it down into smaller challenges:
- How can we obtain the list of Scala 3 libraries and their Maven coordinates?
- How can we efficiently fetch all of them, including their dependencies?
- How do we parse and retrieve the desired data?
In reality, these questions are simpler to address than they appear. Let’s go through them chronologically.
1 Obtaining Scala 3 libraries and their Maven coordinates
Let’s turn to the scaladex service. It is a comprehensive resource for Scala libraries that enables user-friendly browsing of Scala libraries. Instead of relying on direct Maven integration or the internal scaladex model, we opt to scrape the necessary data from a dedicated scaladex web page. With the help of the JSoup dependency, we effortlessly obtain the Maven coordinates of the newest Scala 3 libraries in just 40 lines of code.
The pipeline primarily involves a few transformations on the scraped HTML pages. Let’s have a look at the code itself: