Image from Unsplash
Arthur Clarke famously quipped that any sufficiently advanced technology is indistinguishable from magic. AI has crossed that line with the introduction of Vision and Language (V&L) models and Large Language Models (LLMs). Projects like Promptbase essentially weave the right words in the correct sequence to conjure seemingly spontaneous outcomes. If “prompt engineering” doesn't meet the criteria of spell-casting, it's hard to say what does. Moreover, the quality of prompts matter. Better "spells" lead to better results!
Nearly every company is keen on harnessing a share of this LLM magic. But it’s only magic if you can align the LLM to specific business needs, like summarizing information from your knowledge base.
Let's embark on an adventure, revealing the recipe for creating a potent potion—an LLM with domain-specific expertise. As a fun example, we'll develop an LLM proficient in Civilization 6, a concept that’s geeky enough to intrigue us, boasts a fantastic WikiFandom under a CC-BY-SA license, and isn't too complex so that even non-fans can follow our examples.
Step 1: Decipher the Documentation
The LLM may already possess some domain-specific knowledge, accessible with the right prompt. However, you probably have existing documents that store knowledge you want to utilize. Locate those documents and proceed to the next step.
Step 2: Segment Your Spells
To make your domain-specific knowledge accessible to the LLM, segment your documentation into smaller, digestible pieces. This segmentation improves comprehension and facilitates easier retrieval of relevant information. For us, this involves splitting the Fandom Wiki markdown files into sections. Different LLMs can process prompts of different length. It makes sense to split your documents into pieces that would be significantly shorter (say, 10% or less) then the maximum LLM input length.
Step 3: Create Knowledge Elixirs and Brew Your Vector Database
Encode each segmented text piece with the corresponding embedding, using, for instance, Sentence Transformers.
Store the resulting embeddings and corresponding texts in a vector database. You could do it DIY-style using Numpy and SKlearn's KNN, but seasoned practitioners often recommend vector databases.
Step 4: Craft Spellbinding Prompts
When a user asks the LLM something about Civilization 6, you can search the vector database for elements whose embedding closely matches the question embedding. You can use these texts in the prompt you craft.
Step 5: Manage the Cauldron of Context
Let's get serious about spellbinding! You can add database elements to the prompt until you reach the maximum context length set for the prompt. Pay close attention to the size of your text sections from Step 2. There are usually significant trade-offs between the size of the embedded documents and how many you include in the prompt.
Step 6: Choose Your Magic Ingredient
Regardless of the LLM chosen for your final solution, these steps apply. The LLM landscape is changing rapidly, so once your pipeline is ready, choose your success metric and run side-by-side comparisons of different models. For instance, we can compare Vicuna-13b and GPT-3.5-turbo.
Step 7: Test Your Potion
Testing if our "potion" works is the next step. Easier said than done, as there's no scientific consensus on evaluating LLMs. Some researchers develop new benchmarks like HELM or BIG-bench, while others advocate for human-in-the-loop assessments or assessing the output of domain-specific LLMs with a superior model. Each approach has pros and cons. For a problem involving domain-specific knowledge, you need to build an evaluation pipeline relevant to your business needs. Unfortunately, this usually involves starting from scratch.
Step 8: Unveil the Oracle and Conjure Answers and Evaluation
First, collect a set of questions to assess the domain-specific LLM's performance. This may be a tedious task, but in our Civilization example, we leveraged Google Suggest. We used search queries like “Civilization 6 how to …” and applied Google's suggestions as the questions to evaluate our solution. Then with a set of domain-related questions, run your QnA pipeline. Form a prompt and generate an answer for each question.
Step 9: Assess Quality Through the Seer's Lens
Once you have the answers and original queries, you must assess their alignment. Depending on your desired precision, you can compare your LLM's answers with a superior model or use a side-by-side comparison on Toloka. The second option has the advantage of direct human assessment, which, if done correctly, safeguards against implicit bias that a superior LLM might have (GPT-4, for example, tends to rate its responses higher than humans). This could be crucial for actual business implementation where such implicit bias could negatively impact your product. Since we're dealing with a toy example, we can follow the first path: comparing Vicuna-13b and GPT-3.5-turbo's answers with those of GPT-4.
Step 10: Distill Quality Assessment
LLMs are often used in open setups, so ideally, you want an LLM that can distinguish questions with answers in your vector database from those without. Here is a side-by-side comparison of Vicuna-13b and GPT-3.5, as assessed by humans on Toloka (aka Tolokers) and GPT.
Method | Tolokers | GPT-4 | |
Model | vicuna-13b | GPT-3.5 | |
Answerable, correct answer | 46.3% | 60.3% | 80.9% |
Unanswerable, AI gave no answer | 20.9% | 11.8% | 17.7% |
Answerable, wrong answer | 20.9% | 20.6% | 1.4% |
Unanswerable, AI gave some answer | 11.9% | 7.3% | 0 |
We can see the differences between evaluations conducted by superior models versus human assessment if we examine the evaluation of Vicuna-13b by Tolokers, as illustrated in the first column. Several key takeaways emerge from this comparison. Firstly, discrepancies between GPT-4 and the Tolokers are noteworthy. These inconsistencies primarily occur when the domain-specific LLM appropriately refrains from responding, yet GPT-4 grades such non-responses as correct answers to answerable questions. This highlights a potential evaluation bias that can emerge when an LLM's evaluation is not juxtaposed with human assessment.
Secondly, both GPT-4 and human assessors demonstrate a consensus when evaluating overall performance. This is calculated as the sum of the numbers in the first two rows compared to the sum in the second two rows. Therefore, comparing two domain-specific LLMs with a superior model can be an effective DIY approach to preliminary model assessment.
And there you have it! You have mastered spellbinding, and your domain-specific LLM pipeline is fully operational.
Ivan Yamshchikov is a professor of Semantic Data Processing and Cognitive Computing at the Center for AI and Robotics, Technical University of Applied Sciences Würzburg-Schweinfurt. He also leads the Data Advocates team at Toloka AI. His research interests include computational creativity, semantic data processing and generative models.