How to build an LLM chatbot for your company’s information
Zbigniew Skolicki
Principal AI Engineer & Advisor
Published: May 20, 2024|9 min read9 minutes read
As organizations grow in size, the volume of internal information swells exponentially. Employees might find themselves navigating through an overwhelming sea of data. This hinders clear communication, failing to reach every employee, as it becomes harder to find helpful information.
Nowadays, employees require information to be easily accessible and adaptable. Traditional repositories, like company wikis, often face challenges in meeting these evolving needs. This communication gap involves managing a mix of documents through different office suites, like Google's and Microsoft's, as well as keeping up with content on external websites.
A possible remedy emerges as LLM chatbots. Creating an LLM chatbot might have crossed your mind, but you might also have thought about the complexity of building it and fine-tuning it.
We went down this rabbit hole and discovered that it is actually relatively easy when leveraging the right technology with the right approach.
Join us as we venture into the narrative of our development process, the strategic considerations that guided our journey, and the results of our efforts.
While there are various ways to create an LLM chatbot, we will show you how we’ve used technology that is approachable to most organizations.
The result is that building your LLM chatbot to achieve tangible results can be straightforward, time-efficient, and cost-effective.
LLM chatbots are advanced AI implementations designed to understand and generate human language in a way that mimics how humans communicate.
They're trained on vast amounts of text data, enabling them to perform various language-related tasks, such as answering questions, providing recommendations, and simulating conversations.
LLM chatbots engage in more nuanced and complex conversations than earlier chatbot technologies. They're capable of understanding context, managing ambiguity, and even exhibiting a degree of creativity in their responses.
In short, an LLM chatbot does more than just aggregate information from diverse sources. It is a gateway to a conversational, chatbot-driven interface, facilitating an easier way to find information naturally and faster.
Employees might feel out of the loop on company information and news. This gap in communication does more than just interrupt information flow—it can weaken employee trust and harm your organization's culture.
The overflow of information is visible in most companies of any size. Enterprise organizations especially struggle to keep their employees informed about day-to-day business.
Let's take a look at how a sizable organization like VirtusLab manages its information landscape. VirtusLab organizes its data across various platforms.
Slack for immediate communication
Google Workspace for collaborative documents and emails
Their own website for public-facing information
A company-specific wiki on Coda for internal knowledge sharing
A similar structure is common in many organizations and enterprises.
However, clear communication and effective information sharing become challenging with information so dispersed. Over time, as the wiki expanded, finding specific information started to feel like searching for a needle in a haystack.
We recognized the hurdles in accessing and sharing information across the broad landscape of tools and platforms. That is why we examined the potential of Large Language Models (LLMs) to streamline this process.
Given that a significant portion of our interactions took place on Slack, it became apparent that developing an LLM chatbot on Slack was the strategic move to take. This bot would address our employees' queries and proactively suggest information from the diverse sources maintained by our company.
This change would enhance accessibility and efficiency in our internal communication channels.
Scope of data involved
We wanted to create a controlled testing environment for our LLM chatbot. Since our experience showed that an iterative process is the best way to go, we purposely chose to index only documents available to all company employees. Therefore, we focused on our wiki pages on Coda as a starting point.
This way, we could monitor the accuracy and reliability of the bot following its initial deployment. A careful oversight would also ensure that sensitive information, meant for a select audience, remained secure.
A wide range of technologies can be used to build LLM chatbots. You might want to consider commercially available solutions or build a more customizable one from scratch.
Working with a limited time frame, budget, and team, we followed a Retrieval-Augmented Generation (RAG) paradigm. The focus was on using standard approaches, especially selecting general-purpose Large Language Models (LLMs) without fine-tuning. This way we could achieve the benefits we were looking for.
An expansion of the knowledge base for future iterations
Source citations
A reduction in hallucinations
Since we wanted to create the best possible LLM chatbot for our employees, we evaluated both custom-built solutions and commercial platforms to determine the best fit for our needs.
Custom Stack for the LLM chatbot
As advocates of open-source software, we opted for the following stack because of our previous engagement with that technology.
LangChain
Faiss
Stable Beluga 7B
Our goal was to leverage the chatbot's customizability and the potential to finely tune its behavior to meet specific requirements.
Commercial stack for the LLM chatbot
We focused on Google’s Vertex solutions, explicitly targeting two solutions.
Google Enterprise Search
Google DialogFlow
Our priority remained to maintain configurability and ensure we had the flexibility to integrate various sources.
When choosing the best technology for our purposes, we considered the following key aspects.
Quality of responses
Customizability and flexibility
Data handling
Scalability
Expenses
Simple implementation process
Shortest time to deliver a working solution
To identify the best technology for our needs, we gathered repetitive questions answered by key departments, including marketing, recruitment, people operations, onboarding, office management, security, finances, and travel.
Starting with queries like "Who's the CEO?" or "How can I reserve a parking spot?" we compiled a list of 100 questions. This collection helped us assess the accuracy and feasibility of the technology under consideration.
Final decision on the LLM chatbot technology
After running the test questions, we rejected the custom stack, which produced poor-quality answers and had issues with altering the names of employees. This could lead to confusion and reduced trust in the chatbot’s responses.
When turning to commercial technologies, Google Enterprise Search stood out initially due to its powerful search capabilities, delivering top-notch answers right away. Despite its strengths, we ultimately opted for Google DialogFlow. Its responses were comparably effective to those from Google Enterprise Search but offered the added advantage of greater flexibility in configuration.
Moreover, DialogFlow provides a unique feature set, including the capability to ingest three distinct types of data: a collection of documents, website content, and a custom FAQ document, making it a more versatile choice for our needs.
We accepted the trade-off between customizability and quality of output. Google DialogFlow gave us flexibility, data handling capabilities, and the quality of its language model for the immediate needs of our project.
Still, this choice leaves room for future adjustments, ensuring that as our project's needs evolve, our chatbot solution can grow as well.
The configuration of Google DialogFlow
We embraced a philosophy of simplicity and clarity. To achieve this, we employed a minimalistic setup in DialogFlow, utilizing a single "page", and without relying on the context of past conversations.
This approach guarantees that each query is treated independently, ensuring clear and unbiased responses, and enhancing the user experience by providing straightforward answers to questions.
In the development process, we paid also attention to grounding—the chatbot's ability to stay aligned with data in its responses. After experimenting with various levels, we opted for low grounding.
This decision was crucial in preventing the LLM chatbot from producing hallucinatory or irrelevant responses. By choosing low grounding, we increased the chances that our chatbot remained versatile and capable of addressing a broad spectrum of inquiries without defaulting to refusal.
Selecting the right language model was another pivotal decision. We chose Gemini 1.0 Pro, a model that provides advanced linguistic capabilities but is still cost-effective.
This choice was driven by the anticipation of a relatively low volume of questions. Gemini 1.0 Pro enables our chatbot to understand and respond to queries with sophistication, making it an integral component of our solution's success.
Before releasing our LLM chatbot, we needed to ensure everything ran smoothly.
The initial integration with Slack was straightforward, in fact enabling us to have a prototype ready in less than a day. We established production and development environments to facilitate the fine-tuning of our LLM chatbot without causing disruptions and to experiment with changes efficiently.
However, some agent settings were shared across all environments, complicating the testing process. We faced a choice of either risking disruption of the environment or cloning the entire setup.
We prioritized a minimalistic setup that enabled us to go through with a low risk of disruption and without cloning the setup.
Expense regulation of LLM chatbots
A significant concern was managing expenses, especially with the potential for extensive testing by our engineers. We aimed to restrict traffic to mitigate potential issues, such as integration failures that could trigger an excessive volume of inquiries, or if the service unexpectedly gains popularity.
We decided to implement a “circuit breaker”. It would automatically disconnect the Slack integration once a given expense limit was reached.
Data quality considerations
For the bot to be effective, it was essential to keep the data current and ensure the correct documents were indexed. We've created a detailed list of Google documents for indexing, presented in an editable spreadsheet.
We ensured that Google documents were appropriately shared with our service account and indexed all Coda pages and documents on the list. This process provided explicit control to avoid accidental sharing of sensitive information.
To streamline the parsing of Coda pages and documents, we’ve cleaned them of unnecessary HTML tags and attributes by some custom reformatting. We've included the titles of parent pages in the names of each document, to retain some semblance of the original document hierarchy. While we've tried consolidating certain pages, these experiments have yielded little improvements.
In the future, we might explore capabilities for reading documents and presentations in their rich formats.
Response quality
We aimed for responses that were easy to understand, but we needed to set the right expectations for the LLM chatbot’s users.
We've highlighted this point in our release notes and labeled the bot as "beta" to manage expectations appropriately. We’ve also provided links to the source documents the answers were based on.
Starting out, we aimed to link directly to the original documents on Coda or Google Workspace. However, our LLM chatbot uses copies of source documents stored in a Google Cloud Platform (GCP) bucket. The reason is that DialogFlow requires a dedicated document bucket on GCP, and there isn't a straightforward method to access pages outside of GCP directly.
After our initial efforts to simply get LLM generators to perform these URL translations reliably failed, we implemented a straightforward webhook. This webhook converted the GCP bucket links back to their original forms, utilizing a well-thought-out naming convention.
After we had created a PoC, we opted for a structured release sequence to evaluate the LLM chatbot’s effectiveness and reliability. Before any release, we rigorously tested the bot, confirming its ability to answer most queries correctly.
While not all answers were perfect, some provided valuable links rather than direct answers. Still, the LLM chatbot portrayed consistent behavior, especially when it couldn’t understand a query. This was enough to follow through with the release.
The release sequence was as follows.
Leadership release: We ensured no significant concerns existed that, for example, could release sensitive information.
Non-engineering staff release: As this group handled most of the provided information, by incorporating this group's answers, we aimed to minimize the potential for providing misleading information or causing harm.
Company-wide release: We aimed to test whether the tool is a helpful tool rather than to challenge its accuracy.
Following the bot's release, we noticed an expected spike in interest. Given that many of our internal users are software engineers, they pushed the limits of the bot’s capabilities with their queries.
Exploratory Queries: Users tested the bot with attempts to influence its future responses using specific prompting techniques and inquired about its limitations, revealing the users' curiosity about the bot's adaptability and self-awareness.
Data Currency and Accessibility Challenges: Users sought recent information that was more recent than what was contained in our internal documentation. Additionally, queries emerged for data available on the internet but not indexed by us, highlighting the need for the bot to access a broader and more up-to-date information base.
Structural and Analytical Inquiries: Users asked about organizational structure not detailed in the wiki, and requested complex analyses requiring the synthesis of information across multiple documents. These inquiries indicate a demand for comprehensive insights and the bot’s ability to navigate and integrate detailed organizational knowledge.
Language Flexibility: Questions were posed to respond in languages beyond the bot’s designed capabilities, underscoring the importance of an LLM’s ability to respond in various languages for a global user base.
We identified legitimate gaps in our documentation based on the questions asked. Before proceeding with further updates, it’s clear we must enhance our data corpus—an outcome we anticipated.
Since the release, we've been closely monitoring usage, which has stabilized at a sporadic level. We observed and analyzed cloud logs and DialogFlow's conversation logs.
We’ve developed a specialized notebook to track queries, primarily to discern the most common types of questions and identify gaps in the data. As a side note, the release notes clearly communicated our intent to use data for these purposes.
DialogFlow offers a user feedback feature for upvoting or downvoting responses. However, given our use of Slack integration, we've provided a link for users to submit comments via a Google form, allowing us to collect and analyze user feedback effectively.
With this analysis and feedback, we intend to further optimize the LLM chatbot after our data source enhancement.
Possible approaches include modifying the agent's summarization prompt and prioritizing specific data sources.
We might also explore more sophisticated adjustments, such as developing a bespoke solution for greater control over embeddings, a vector database, custom prompts, and selecting among LLM models.
However, these advanced changes were beyond the scope of this initial proof of concept.
Building this internal LLM chatbot demonstrated the rapid development and deployment of an initial version. We learned that our data, while incomplete, remains useful. The project also revealed user expectations for interaction and the broad range of questions the bot should address.
Currently, we view it as an enhanced search tool, but as LLM technology advances, we anticipate it will integrate data from multiple sources more effectively. This could be partly achieved through longer prompts or model fine-tuning, and making an agent use proper tools underneath.
The familiarization of certain technologies and understanding their constraints and advantages proved beneficial.