How to use Large Language Models for Scala documentation

Large Language Models have recently seen a significant rise in interest and usage, with models like ChatGPT gaining widespread attention due to their impressive capabilities in natural language processing. These models have been applied in a variety of settings. Their potential for innovation in various domains is constantly being explored.

In this article, we focus on utilizing Large Language Models to assist programmers in answering Scala3-related questions. As programming languages continue to evolve and diversify, keeping up with the latest developments can be daunting, especially for those new to the field. Large Language Models can aid developers with quick and accurate responses to their inquiries, saving them valuable time and energy.

By using state-of-the-art Large Language Models to help programmers navigate the complexities of Scala3, we can potentially improve the efficiency and effectiveness of the programming process.

Our exploration will delve into the realm of Large Language Models, beginning with GPT-4 and moving on to open-source alternatives such as flan-alpaca, BERT, and gpt4all. Our primary goal is to determine whether leveraging these models saves time for programmers seeking assistance with Scala3-related inquiries.

Let’s have a first glance at the models:

Model name	Model size (params)	Accuracy	Free	License
GPT-4	n/a	✩✩✩✩✩	✗	Commercial API
ChatGPT (GPT-3.5)	n/a	✩✩✩✩	✗	Commercial API
davinci	175 bn	✩✩✩	✗	Commercial API
gpt4all	7 bn	✩✩	✓	NC Research
flan-alpaca-xl	3 bn	✩✩	✓	CC BY-NC 4.0
BERT for QA	340 m	✩	✓	Apache 2.0
flan-alpaca-large	770 m	✩	✓	CC BY-NC 4.0

To prepare our validation dataset, we compiled a roster of Scala-oriented questions along with relevant excerpts from official Scala documentation. We’ve selected these excerpts based on their ability to provide answers to the questions or, at the very least, contextual information that could assist in finding the answers.

As we begin our examination of the dataset, let us consider the following question from our list: “How can we convert a case class Person(name: String, surname: String) to Json?”

The Scala toolkit documentation on GitHub offers a response to this query.

GPT-4: The Large Language Model leader

In today’s world, any article that seeks to compare Natural Language Processing (NLP) solutions must consider the relevance of GPT-4. Evaluating its capabilities requires no additional setup; we can simply input the question into the prompt window without the need for coding at this stage. However, to maximize the potential of GPT-4, we must structure the prompt in a specific manner, including:

the role: “You are a Q&A system”
the context: “You will receive questions from a programmer seeking assistance with Scala3 documentation”
precise instructions: “Answer truthfully, and if you are unsure of the answer, respond with ‘I don’t know’”
precise constraints: “Be short and precise”

Of these steps, the third one is particularly crucial as it aims to lower the possibility of hallucinations, namely unreliable responses, in case GPT-4 is unable to provide a suitable answer.

Let’s have a look at the response generated by GPT-4:

GPT-4 delivered an impressive answer by correctly identifying the context of the Scala question and providing a concise solution using the Circe library. Even without any specific knowledge provided in the prompt, GPT-4 was able to provide a helpful code sample.

However, it is important to consider the costs and limitations associated with the GPT-4 web app, as well as potential legal issues regarding uploading code samples:

Limits message rate to 25 per 3 hours, which might be enough for one developer, yet restricts the efficiency of a whole team.
Poses legal issues with uploading code samples or adapting a model for specific closed-source libraries.
Confines information up until September 2021, which may render some answers outdated.

Open source: More Large Language Models

Now we want to explore open-source models to see if they deliver comparable results to GPT-4.

While GPT-4 has proven to have a good understanding of the Scala realm without context, other LLMs might deliver less detailed knowledge due to smaller training sets and fewer parameters (up to 20 billion for most models). However, we can try to optimize the results of open-source models supplying additional context to the query, such as a paragraph or a document, which we will delve into later.

The large language model leader answers are always abstractive. This means it answers the question “in its own words”. The variety of open-source models allows us to choose between abstractive or extractive answering methods. The former method produces more coherent text, while the latter provides a direct connection to the place where the answer can be found in the context.

About abstractive and extractive answering

Extractive answering involves the machine identifying and selecting a subset of text from the input document that directly answers the question. This approach results in more precise and specific answers. However, it struggles when the input text lacks a direct answer or when a rephrased response is more desirable.

Abstractive answering omits those limitations, as the model answers the question with its own words. This approach elicits a more coherent response. However, it will sometimes produce an answer outside the given context.

The history of Large Language Models – BERT model for Question Answering

In the task of Question Answering, there are two main approaches for answer generation:

Models fine-tuned for QA
Prompt-based generative models

Fine-tuning models for extractive answering task was the dominant approach for Question Answering for a long time. This involves selecting the most relevant portions of text to provide an answer to a given question. The leading models for this task are from the BERT family.

We will be using the bert-large-uncased-whole-word-masking-fine-tuned-squad model for our purposes. To test its capabilities, let’s pose an example question using a context that consists of a copy-pasted page from the unpickle-serialize documentation.

Let’s have a closer look at the answer provided by BERT for our example question. While it relates well to the question and correctly recognizes its meaning, it has made an error. It suggests using a petOwner object instead of a Person object when invoking the write() method. In all fairness, the example we provided could have been better for BERT since the exact answer was not explicitly stated in the given passage and BERT can only work with extraction-based methods.

To explore BERT’s capabilities further, let’s try a more straightforward example question.

BERT’s answer is indeed correct. However, the confidence score for this answer is slightly lower at 0.190 compared to the 0.197 we obtained previously. It’s worth noting that in our specific use case, we can’t solely rely on questions that have an exact answer within the provided documentation.

These examples illustrate the notable differences between GPT-4 and the QA BERT model regarding answer quality. It also highlights the significant value that generative models bring to the table. Now, let’s examine generative models.

Language Learning for Machine Assistance (LLaMA) Models

We have a few options apart from the OpenAI models. Among these is a LLaMA model from Facebook, based on which the Alpaca project and GPT4ALL were fine-tuned for chat-like qualities.

As a user, the main issue with using the LLaMA model is its licensing, which limits its availability to research purposes and requires access to be granted on a case-by-case basis.

Text-to-Text Transfer Transformer or T5 models

T5 is a type of generative pre-trained language model developed by Google Research. The T5 family of models are trained on a massive amount of text data and implement a full encoder-decoder architecture. They are pre-trained on multiple tasks, making them suitable for QA, summarisation, translation and many more applications.

These models come in different sizes, ranging from 60 million parameters in t5-small to 11 billion parameters in t5-11b. Fine-tuning t5 for instruction led to a group of flan-t5 models, and extending further to Alpaca led to a group of flan-alpaca models.

In the following instance, you’ll see a prompt window for the flan-alpaca-large model. Notably, the previous structure where the question and context were separated is missing. The updated format entails a single prompt window that grants you the liberty to input any prompt for the model to progress with.

As we try to enter the same earlier prompt that we’ve used for GPT-4, we receive an honest answer, namely, “I don’t know”. We do expect this behavior since flan-alpaca lacks the knowledge of Scala from pre-training, unlike GPT-4.

Large Language Models are still evolving

As of May 2023, the open-source modeling landscape is evolving swiftly, with the emergence of fresh models occurring daily. For instance, two new models, Databricks’ Dolly model and StabilityAI’s StableLM-Tuned-Alpha model, were introduced while we were writing this article. That is why it is essential to remain up-to-date with the newest developments in the field.

Steps to maximize the value from LLM

As natural language processing (NLP) solutions such as GPT-4, and open-source models continue to revolutionize the way we interact with technology, the question of how to maximize the value of these models have become increasingly important.

We’ve compared the answer quality of these models and explored the differences in their approaches to generating answers.

The real challenge lies in leveraging hundreds or thousands of documents and automatically detecting the relevant context. We suggest downloading the documentation and using the raw documents as a source to create a solution that performs three essential functions:

Finding the most relevant documents as a context.
Building a prompt based on the context and specific instructions.
Generating the model’s answer to the question.

With these steps, we only need to insert the question, without specifying which part of the documentation can answer the question.

How to find relevant documents with advanced methods

When it comes to finding relevant documents for a specific question, relying solely on word overlap renders it ineffective, especially when dealing with synonyms. However, there are now more advanced methods available to sustain efficiency.

One such method involves using vector representations, or embeddings, to search for documents similar to the query. By converting each document into a vector and comparing it to the vector representation of the query, we can identify the most similar documents in the vector space.

Hugging Face’s Sentence Transformers help to make this process easier than ever. For example, we can use this approach to compare the benchmark sentence “How to convert case class Person(name: String, surname: String) into Json?” to various Scala documentation extracts, including ones on serialization, macros, and varargs.

Let’s have a look at msmarco-distilbert-base-v4 API:

As expected, the model calculates similarity scores to identify the most relevant documentation fragments and rank them in order of importance.

Once we run the model on all the documents, it produces well-ranked text pieces. These can then be used as a context for generative models, enabling us to refine our search results and generate more accurate and relevant content.

Building pipelines with Lang Chain

We can automate the entire process of indexing, prompt creation, and model interaction using Lang Chain, a Python library specifically designed for language models. This library comes with various features, including a rich API for interacting with models and automated text extraction for indexing.

To create a context for our retrieval process, we adapted the default prompt from Lang Chain by mentioning the Scala language at the beginning. This helps to focus the model’s attention on relevant content and improves the accuracy of our results.

The modified prompt serves as input for the remaining models, enabling us to generate more targeted and effective search results.

The results

If we provide questions from the beginning of the article to multiple models, we obtain these answers:

1How to convert case class Person(name: String, surname: String) into Json?
2--------------------------------------
3ChatGPT
4"You can use the ujson library's write method to convert a case class into JSON format. Here's an example:\n\n```scala\nimport ujson._\ncase class Person(name: String, surname: String)\nval person = Person(\"John\", \"Doe\")\nval json = write(person)\nprintln(json) // Prints out {\"name\":\"John\",\"surname\":\"Doe\"}\n```\n\nIn this example, we import the ujson library and define a case class called Person. We then create an instance of the Person class and use the write method to convert it into a JSON string. Finally, we print out the JSON string."
5--------------------------------------
6Openai da vinci
7" You can use the ujson library to convert a case class into a JSON string. For example, you can use the ujson.write method to convert a Person instance into a JSON string:\n\nval person = Person(\"John\", \"Doe\")\nval jsonString = ujson.write(person)\nprintln(jsonString) // Prints out {\"name\":\"John\",\"surname\":\"Doe\"}"
8--------------------------------------
9flan t5 xl
10"ujson.read(name) print(\"\"\"\"name\": \"Peter\", \"age\": 13\"\"\"\"\" )"
11--------------------------------------
12flan t5 xxl
13"scala case class Person(name: String, surname: String) json"
14--------------------------------------
15gpt4all
16You can define a Writer like this and use it in the help method. Then you need an implicit for Writes[Person]."
17--------------------------------------
18alpaca-t5-xl
19"\"name\": \"Peter\", \"age\": 13\"\" "
20--------------------------------------
21declare-lab/flan-sharegpt-xl
22""
23--------------------------------------

The responses generated by ChatGPT and OpenAI Davinci are remarkably similar, both suggesting a specific library and providing a code example in the correct Scala syntax. This isn’t surprising, given that these models belong to the same family.

However, other models’ responses are less satisfactory. While the flan-t5-xl model correctly identifies the ujson library, it doesn’t use it correctly. It’s also puzzling that the larger flan-t5-xxl model fails to provide a reasonable answer. Other models’ responses are also incomplete or unsatisfactory.

These results highlight the importance of using the right language model for the task.

How to fine-tune language models for Scala QA leveraging ChatGPT

While no out-of-the-box language model can match the performance of GPT4 on Scala question-answering tasks, we can still fine-tune existing models for better results. The challenge though is obtaining a suitable training dataset.

One option is to manually create hundreds of Scala questions and answers based on documentation, but this approach is time-consuming and requires significant expertise in Scala documentation.

Instead, we can leverage the power of ChatGPT to generate a custom QA dataset specifically tailored to our needs. By prompting the model to generate questions for each page of documentation, we can ensure that our final model is fine-tuned on a broad range of topics.

Below you can find a possible way to craft a prompt for ChatGPT:

1You are going to create Questions and Answers based on fragments of Scala 3 documentation given in a context.
2Requirements:
31. The documentation is in a markdown file, but in output, no markdown artefact should be presented.
42. The answer should come from the documentation fragment provided. Questions and answers should work standalone without previous information. 
53. Be short and precise.
64. The questions should be strictly technical and should answer common developers questions. Questions like: “What is the title of the document the information is taken from?” is not allowed
75. Replace demonstrative pronouns with use <DOC NAME> instead in phrases like “this document”
86. Questions and answers should make sense without context, no phrases like “given fragment”, “given document” 
9If the fragment is from a larger document, you will be given a summarization of the whole document, but use it only for the context. Do not create questions based on summarization.
10If you cannot extract at least 5 relevant and technical questions with answers from the context fragment, return <NO QUESTIONS GENERATED>
11Example:
12Q: What is @main annotation?
13A: Method with `@main` annotation is an entry point to a program.
14Q: How to write a test using a temporary file?
15A: Use `FunFixture` from munit. eg val usingTempFile: FunFixture[os.Path] = FunFixture(setup = _ => os.temp(prefix = "file-tests"), teardown = tempFile => os.remove(tempFile)); usingTempFile.test(<name>){{ <your test> }}
16Q: How to change the jvm version in Scala CLI?
17A: You can use the --jvm flag `scala-cli --jvm 8 Main.scala`
18Document name: [DOC NAME]
19{f'Summarization: {summarization}' if summarization else ''}
20Context fragment: 
21{fragment.page_content}       .
22Remember, If you cannot extract at least 5 relevant and technical answers from the context fragment return less than 5 questions or text<NO QUESTIONS GENERATED> 
23Answers should come directly from the context fragment.
24Create 5 questions with answers. Let's take it step by step. “””}

The prompt may appear lengthy, but it includes essential elements previously used in our experiments with ChatGPT and GPT-4. These include role and context, where the mbehaviorsodel generates questions and answers based on a section of Scala3 documentation within a context. It also provides precise instructions, where the model produces a maximum of five questions per document or uses <No QUESTIONS GENERATED>, with concise and technically-focused answers sourced from the context.

A sample of smartly generated QAs

The prompt features several examples to guide the model in following a specific pattern, enabling us to generate a total of 2368 question-and-answer pairs that we later used to fine-tune the best-performing models.

Why is it not recommended to use null values or empty strings for optional fields?	It is not recommended to use null values or empty strings for optional fields because they are hacks to work around the main problem of declaring the field as optional upfront. This can lead to errors and unexpected behaviours in the code.
What are some common ways to work with option values in Scala?	Common ways to work with option values are using `match` and `for` expressions. This allows for handling both `Some` and `None` values in a similar fashion.
How can a client use the `onFailure` callback to handle a failed future computation?	A client can use the `onFailure` callback to handle a failed future computation by providing a partial function that will be applied to the value of a failed computation.

The top-performing model after fine-tuning on a new dataset is flan-alpaca-xl, which produced the most accurate and relevant answers to our Scala QA task. These are:

How to convert case class Person(name: String, surname: String) into Json?	The method ujson.read can parse a JSON string and make all of its fields available.
How to pass implicit parameters?	To pass implicit parameters, you can use the regular application f(arg) to pass arguments to an implicit parameter.
How to import dependency in Scala CLI project?	You can import dependencies on the command-line with –dependency: bash scala-cli compile Hello.scala –dependency org.scala-lang.modules::scala-parallel-collections:1.0.4 Note that –dependency is only meant as a convenience. You should favor adding dependencies in the source files themselves via [using directives](/docs/guides/configuration.md#special-imports). You can also add simple JAR files — those that don’t have transitive dependencies — as dependencies, with –jar: bash scala-cli compile Hello.scala –jar /path/to/library.
How to test throwing an exception?	To test throwing an exception, you can use the intercept assertion to specify the expected exception and the code that should throw it. The intercept method returns the exception that is thrown and the intercept method can be used to check for other assertions on the intercept method.
What is tail-recursion function and how to annotate it?	Tail-recursion function: a function that is used to return a value from a loop. It can be annotated as a loop function, and can be used to perform recursive operations on a list of values.

Explore the limits and potential of open-source Language Models for technical Q&A tasks

Although the fine-tuned open-source models generated better-quality answers, they still have limitations, such as a lack of technical domain knowledge and the potential for providing inaccurate information. However, these limitations can be mitigated by providing context and specific prompts and by fine-tuning the models on a domain-specific dataset. Our experiment also shows that ChatGPT can be a useful tool for creating custom QA datasets for research purposes to improve model performance.

Here are the key takeaways from our experiment:

Open-source language models can generate grammatically correct answers but need more technical domain knowledge, unlike GPT-4, which has pre-existing domain knowledge.
Providing context and specific prompts can help overcome the limitations of open-source models.
Fine-tuning models on a domain-specific dataset can improve the stylistics and form of answers but only sometimes the truthfulness.
ChatGPT can be used to create custom QA datasets for research purposes, which can be used to improve the performance of other models.

Overall, our results show that, with the help of ChatGPT, we were able to create a tailored dataset for Scala 3 documentation that improved the performance of the flan-alpaca-xl model. While open-source models are still in the early stages of development compared to commercially available ones, we are making progress toward improving their accuracy and domain knowledge.

However, if we want to optimize for accuracy, GPT-4 delivers the best results. By providing GPT-4 with documents containing context from current documentation, it can accurately understand the context from libraries made after 2021. It is important to note that this approach is suitable for handling non-sensitive information only.

Conclusion

Large Language Models can potentially revolutionize how programmers seek assistance with their queries and concerns related to Scala3. With the advent of models like ChatGPT and open-source alternatives such as flan-alpaca, BERT, and gpt4all, programmers can access quick and accurate responses to their inquiries, saving them valuable time and energy.

Our exploration of the dataset and the responses provided by the Scala toolkit documentation on GitHub has demonstrated the ability of these models to provide abstractive and extractive answering, making them useful for a variety of tasks.

As Large Language Models continue to evolve, businesses, organizations, and individuals can maximize their potential by leveraging advanced methods to find relevant documents and build pipelines with Lang Chain. By fine-tuning these models for Scala QA, we can explore the limits and potential of open-source Language Models for technical Q&A tasks.

Overall, large language models have the potential to be game-changers for the programming industry, and their utilization should be explored further to maximize their benefits.

Curated by Sebastian Synowiec

Large Language Models: How to use open source alternatives to ChatGPT for Scala documentation