How to mine Scala 3 compiler metadata using TASTy files

Learn how we leveraged Scala 3’s decompiler to mine compiler metadata with TASTy files, streamlining data gathering in 150 code lines.

Easily-Mine-Scala-3-Compiler-Metadata-Using-TASTy-Files

Natural Language Processing and Machine Learning gain more and more popularity. Of course, VirtusLab had to have a closer look in terms of Scala. As the leader in Scala development, we decided to conduct small machine learning experiments on Scala 3. 

The initial obstacle was how to gather data for training in an efficient and scalable way. Scala 3, as many contemporary programming languages, lets us write compiler plugins that we could use. However, they are uncomfortable to use, since they require us to: 

  • gather source codes from different sources
  • set up projects
  • compile them

A very tedious and nearly impossible task. Decompilers, however, allow parsing binaries without the need for the original project’s source code setup. Nevertheless, this approach has limitations in various languages as decompilation may produce similar but not identical source code. 

Enter Scala 3, introducing a robust decompiler that grants access to nearly all compiler internals, mimicking the compilation process from source code. In this brief blog post, we share our experience of extracting a vast collection of Scala 3 compiler metadata and demonstrate how you can achieve the same in just 150 lines of code.

Explore TASTy files for comprehensive code analysis

The key is to utilise TASTy files, which are included in target jars alongside binaries uploaded to Maven Central. TASTy files are binary serialised files containing the complete structure of Scala source code usage. They store abundant metadata, such as scaladoc comments, for all declared classes, objects, methods, and more. These files are serialised as AST trees, allowing us to navigate them as if we were compiling the source code.

TASTy files and the challenges

We aim to utilise the TASTy format. But how can we extract valuable information from it? Let’s break it down into smaller challenges:

  1. How can we obtain the list of Scala 3 libraries and their Maven coordinates?
  2. How can we efficiently fetch all of them, including their dependencies?
  3. How do we parse and retrieve the desired data?

In reality, these questions are simpler to address than they appear. Let’s go through them chronologically.

1 Obtaining Scala 3 libraries and their Maven coordinates

Let’s turn to the scaladex service. It is a comprehensive resource for Scala libraries that enables user-friendly browsing of Scala libraries. Instead of relying on direct Maven integration or the internal scaladex model, we opt to scrape the necessary data from a dedicated scaladex web page. With the help of the JSoup dependency, we effortlessly obtain the Maven coordinates of the newest Scala 3 libraries in just 40 lines of code.

The pipeline primarily involves a few transformations on the scraped HTML pages. Let’s have a look at the code itself:

val elems = (1 to 64).par
 .flatMap { page =>
   Jsoup
     .connect(
       s"https://index.scala-lang.org/search?sort=stars&languages=3.x&q=*&page=$page"
     )
     .get()
     .select("h4")
     .eachText
     .asScala
 }
 .flatMap { header =>
   Try(
     Jsoup
       .connect(s"https://index.scala-lang.org/$header/artifacts/version")
       .get()
   ).toOption
     .map { page =>
       val version = page.select(".head-last-version").text.trim
       page.select("option").eachText.asScala.map((_, (header, version)))
     }
 }
 .flatten
 .flatMap { case (name, (header, version)) =>
   Try {
     val text = Jsoup
       .connect(
         s"https://index.scala-lang.org/$header/artifacts/$name/$version?binary-versions=_3"
       )
       .get()
       .select("#copy-maven")
       .text
     Jsoup.parse(text, "", Parser.xmlParser())
   }.toOption
     .filter(_.select("artifactId").text.endsWith("_3"))
     .map { doc =>
       doc.select("groupId").text + ":" + doc
         .select("artifactId")
         .text + ":" + doc.select("version").text
     }
 }

2 Fetching and downloading all libraries and dependencies

Once we have the library coordinates, we can download them using coursier, a handy Scala tool that saves us time. Coursier is a library with a user-friendly interface for fetching Maven packages. It is commonly utilized by Scala build tools or as a standalone tool in a terminal. The SDK API is easy to use, allowing us to obtain the desired jar file and its dependencies with a single call to Fetch():

Fetch()
 .withRepositories(repositories)
 .withDependencies(
   Seq(
     Dependency(
       Module(Organization(organization), ModuleName(module)),
       version
     )
   )
 )
 .run

3 Parsing and retrieving data with the TASTy Inspector

We will use TASTy Inspector, a Scala 3 decompiler tool, to read TASTy files. To collect methods and their Scaladoc comments, we need to override a simple procedure that acts as a callback for each file. 

Let’s define our custom inspector:

class MyInspector(fileOutputName: String, classpath: String) extends Inspector:
 val file = new File(fileOutputName)
 val bw = new BufferedWriter(new FileWriter(file))
 def inspect(using Quotes)(tastys: List[Tasty[quotes.type]]): Unit =
   import quotes.reflect.*
   object Traverser extends TreeAccumulator[List[DefDef]]:
     def foldTree(defdefs: List[DefDef], tree: Tree)(
         owner: Symbol
     ): List[DefDef] =
       val defdef = tree match
         case d: DefDef =>
           List(d)
         case tree =>
           Nil
       foldOverTree(defdefs ++ defdef, tree)(owner)
   end Traverser


   tastys
     .flatMap { tasty =>
       val tree = tasty.ast
       Traverser.foldTree(List.empty, tree)(tree.symbol)
     }
     .filter(_.symbol.docstring.nonEmpty)
     .flatMap { defdef =>
       val comment = Cleaner.clean(defdef.symbol.docstring.get).mkString(" ")
       Option.when(!comment.isBlank && defdef.rhs != None)(
         s"${astCode(defdef)}␟${byteCode(defdef)}␟${sourceCode(defdef, true)}␟${sourceCode(defdef, false)}␟${comment}\n"
       )
     }
     .foreach(bw.write)


   bw.close()


 extension (s: String)
   def removeNewLines: String =
     s.replaceAll("\\p{C}|\\s+|\\r$|\\\\t|\\\\n|\\\\r", " ")


 def astCode(using Quotes)(defdef: quotes.reflect.DefDef): String =
   Extractors.showTree(defdef).removeNewLines

Once the inspector is defined, we can easily run it in one line:

TastyInspector.inspectAllTastyFiles(
 Nil,
 List(classpath.head),
 classpath.tail.toList
)(
 new MyInspector(coordinates, classpath)
)

Note: Certain classes, like Cleaner for deobfuscating Scaladoc comments or custom Extractors, have been borrowed from the dotty repository and can be found in our repository.

How to extract source code and bytecode

To obtain the source code, we can utilise a built-in compiler printer that converts the Tree into Scala code. It incurs minimal cost since the necessary files are already loaded. This process provides us with cleaned code, complete with resolved fully-qualified names and the removal of comments, performed by the scanner and parser phases.

def sourceCode(using Quotes)(
   defdef: quotes.reflect.DefDef,
   fullNames: Boolean
): String =
 val sourceCode = Try(
   SourceCode
     .showTree(defdef)(SyntaxHighlight.plain, fullNames)
     .removeNewLines
 )
 sourceCode.toOption.getOrElse("NO_SOURCECODE")

If there were internal errors in recovering the source code, we just discard faulty Trees and return a placeholder “NO_SOURCECODE.”
For bytecode, we opted to use the Apache BCEL library for simplicity. As we have the classpath, which consists of the jars fetched by coursier, everything is readily available.

def byteCode(using Quotes)(defdef: quotes.reflect.DefDef): String =
 val reader = Try {
   SyntheticRepository
     .getInstance(ClassPath(classpath))
     .loadClass(defdef.symbol.owner.fullName.replaceAll("\\$\\.", "\\$"))
     .getMethods()
 }
 reader.toOption
   .flatMap {
     _.toList
       .find(_.getName == defdef.symbol.name)
       .map(_.getCode)
       .filter(_ != null)
       .map(x =>
         Utility.codeToString(x.getCode, x.getConstantPool, 0, -1, true)
       )
       .map(_.toString.removeNewLines)
   }
   .getOrElse("NO_BYTECODE")

Similar to source code recovery, in some cases, it was simpler to exclude unknown synthetic Scala classes and return “NO_BYTECODE” instead of searching for their correct names in bytecode class files.

By utilising these methods, you can automatically extract data from all Scala 3 libraries indexed by Scaladex.

Conclusion

We have had the opportunity to work with the Scala 3 decompiler mechanism and have found numerous benefits. For one project, we specifically required the ASTs of methods and their corresponding comments. 

However, we can utilise this mechanism to extract various data related to code structure for statistical analysis or machine learning tasks. Throughout our usage of TastyInspector, we encountered internal errors that prevented the parsing of certain libraries into the Scala AST model. To ensure the correct reading of all produced TASTy files, we suggest considering the inclusion of these scripts as an additional step in the Scala Community Build
The functional scripts can be found in our repository ScalaTastiesScrapper. Overall, our experience with the Scala 3 decompiler has provided us with valuable insights and expanded possibilities for its effective utilisation. Go try it out yourself.

Article tags

Written by

A-Ratajczak
Andrzej Ratajczak Jun 28, 2023