GitHub All-Stars #3: LangExtract

I’ve always claimed that there’s no better way to learn anything than to build something… and the second-best is to review someone else’s code 😁. A sobering thought struck us recently at VirtusLab—our “collection” of starred projects had grown to a gigantic size, bringing little real value either to us at VirtusLab or to the wider community. So we decided to change our approach: introduce a bit of regularity and become chroniclers of these open-source gems. That way, we’ll understand them better and discover the ones where we can genuinely help.

Every Wednesday, we’ll pick one trending repository from the previous week and give it attention, preparing a tutorial, article, or code review—learning from its creators along the way. We’ll focus on what piques our interest: it might be a tool, a library, or anything the community deems worth publishing. One simple rule applies—it has to be a new or little-known project, not the widely recognized ones that rack up tons of stars after a big update (because let’s be honest—who wants to hear about Angular’s architecture in 2025).

Today, we’re taking a fresh project from Google engineers to the bench: Google/langextract.

Why was LangExtract created, and what problem does it solve?

Two words are key here: reliable and transparent. While we cannot fight with models’ “hallucinations,” LangExtract ensures that the extracted data are not them but are precisely linked to a specific fragment of the source text - the so-called source grounding.

The project is the work of Google engineers, which immediately gives it a certain benefit of trust (sorry, that’s how it works). You can see the experience of a company that has been processing and indexing immense amounts of text for decades.

Patterns and techniques - the meat for an engineer

As from the start of this whole series, it’s time to show LangExtract on a realistic example from the Insurtech space - in particular, the underwriting (insurance risk assessment) process.

This is where the real fun begins. LangExtract is a trove of modern LLM-oriented programming patterns. We used it in our underwriting projects to automate the crucial yet labor-intensive process of analyzing medical documentation. Let’s see how it works in practice.

Underwriting 101 (quick refresher): Underwriting is the heart of every insurance company. It’s the process of thoroughly assessing the risk associated with a potential client. Analysts (underwriters) sift through piles of documents—from applications and medical history to financial reports—to assess the likelihood of a claim. Based on this, they decide whether the company can offer insurance and what the premium should be. This is a key but traditionally very manual and time-consuming stage - ideal for automation.

This is exactly where patterns from the LangExtract library come into play. Let’s see how we used them in one of our experiments to take on this challenge.

Pattern 1: Declarative programming via prompts

Instead of writing complicated regex-based rules to find conditions or drug dosages in the text, we simply declare what we want to achieve.

In underwriting, the goal is to extract key risk factors from hundreds of pages of documents. We defined such a task for LangExtract:

1import langextract as lx
2prompt_description = """\
3Extract all medical conditions, medications, and lifestyle risk factors from the patient's medical summary.
4
5- For medications, always include the dosage and frequency if mentioned.
6- For conditions, note if they are described as 'controlled', 'stable', or 'chronic'.
7- For lifestyle risks, extract the specific habit, e.g., 'smoking' or 'drinking'.
8- Accurately capture the exact text spans for each extracted entity.
9"""
10# Step 1: Define what we want to extract from the medical documentation

This simple, English-language description is the entire business logic of our extraction. The language model now knows it should look not only for keywords but also for their attributes, which is fundamental to risk assessment.

Pattern 2: Configuration via examples (few-shot learning)

Underwriting is a game of nuances. “Diabetes” is not the same as “type 2 diabetes under control.” To teach the model these subtleties, we provide a few precise examples, formalized using the ExampleData class.

1examples = [
2    lx.ExampleData(
3        source_text="Patient has a history of controlled hypertension, treated with Metformin 500mg twice daily.",
4        extractions=[
5            lx.Extraction(
6                class_name="medical_condition",
7                text_span="hypertension",
8                attributes={"status": "controlled"},
9            ),
10            lx.Extraction(
11                class_name="medication",
12                text_span="Metformin",
13                attributes={"dosage": "500mg", "frequency": "twice daily"},
14            ),
15        ],
16    ),
17    lx.ExampleData(
18        source_text="He is a non-smoker but admits to occasional social drinking. BP is stable at 120/80 mmHg.",
19        extractions=[
20            lx.Extraction(
21                class_name="lifestyle_risk",
22                text_span="non-smoker",
23                attributes={},
24            ),
25            lx.Extraction(
26                class_name="lifestyle_risk",
27                text_span="social drinking",
28                attributes={"frequency": "occasional"},
29            ),
30            lx.Extraction(
31                class_name="measurement",
32                text_span="BP is stable at 120/80 mmHg",
33                attributes={"type": "blood_pressure", "value": "120/80"},
34            )
35        ],
36    ),
37]
38# Step 2: Provide examples to teach the model patterns from the medical domain

These examples are, in practice, our unit tests for the LLM, which at the same time serve as its precise configuration. Brilliant in its simplicity.

Pattern 3: Reliability through abstraction and control

All the magic happens in a single function—lx.extract. It’s a beautiful example of the façade pattern that hides the enormous complexity of communicating with the model, parsing long documents, and aggregating results.

1# (let’s assume 'underwriting_document_text' contains the client’s full medical report)
2extraction_requests = [
3    lx.ExtractionRequest(
4        prompt_description=prompt_description,
5        examples=examples,
6        source_text=underwriting_document_text,
7    )
8]
9
10# Choose a model (could be Gemini, or even a local one via Ollama)
11model = lx.Llm(lx.Gemini(model_name="gemini-1.5-flash-latest"))
12
13# A single line to launch the entire analytical process!
14extraction_results = lx.extract(
15    requests=extraction_requests,
16    llm=model,
17    output_path="underwriting_extractions.jsonl",
18)
19# Step 3: Run extraction on the full document

Interestingly, under the hood lx.extract can intelligently split a multi-page report into smaller pieces and process them in parallel, and even apply a multi-pass strategy - first identifying all conditions, and in a second pass linking them to recorded medications. This improves the accuracy and completeness of the assessment.

Pattern 4: Feedback loop via visualization

How can an underwriter or developer quickly verify whether the model correctly interpreted a key fragment of the report? LangExtract provides a dedicated tool for this.

1lx.visualize_extraction_results(
2    extraction_path="underwriting_extractions.jsonl",
3    html_path="underwriting_visualization.html",
4    source_text=underwriting_document_text,
5)
6# Step 4: Create an interactive HTML report for verification

This generates an HTML file where you can browse the original text with highlighted risk factors, medications, and their attributes.

This fast feedback loop is an interesting building block for reliable systems in a regulated industry like insurance (and in any other, too).

What’s interesting to learn from it?

Thinking in LLM terms: Formulate problems in a way that’s understandable to language models, moving away from classic imperative coding.
The power of few-shot learning: Instead of gigantic training datasets, sometimes a few precise examples are enough to achieve astonishing results.
Reliability engineering in AI: Techniques like source grounding, controlled generation, and multi-pass extraction are a must if we want production-grade trust.
A complete tool: Great DX matters - simple APIs, multiple backends, built-in debugging, and visualization.

Summary

Google LangExtract is a great example of mature software engineering meeting the raw power of LLMs - it shows how it democratizes access to advanced information extraction, making it simpler, cheaper, and more reliable. This project suggests that the future of a large part of programming may lie not in writing complicated algorithms, but in the art of conducting a precise and effective “conversation” with a machine.

A well-deserved star on GitHub from me 😉