SFT: Scaling Small Vision-Language Models

While API based LLMs are great for rapid, fast, and easy development, they can be less secure and costly in the long-term horizon for load-intensive applications. The solution are Small Language Models (SLM), self-hosted and finetuned on the downstream task. This article presents a case study of a Supervised Fine-Tuning (SFT) of the SLM on the Invoice Processing task. It shows that while SLMs have higher investment costs at start, they are faster, cheaper, and more secure in the long-term, especially for high-load intensive applications.

Task

The task is to extract and return a json in following format:

from invoice and receipt images and digital invoice pdfs of diverse layout and quality.

Methodology

For the purpose of key-values extraction from invoice images, a single-pass approach using the Visual Language Model was chosen. Images were uploaded to the model along with a prompt specifying what to extract and the expected output format.

User prompt:

The performance was evaluated against the following models:

Qwen3-VL-2B-Instruct
Qwen3-VL-2B-Instruct, SFT with LORA r = 4 alpha = 8
OpenAI GPT-4o-mini
OpenAI GPT-5.2

The SLM (Qwen3-VL 2B Instruct) was trained for 3 epochs on a dataset with a train split size of 1622 samples and an efficient batch size of 16. Training was performed with bfloat16 precision, using LORA (Low-Rank Adaptation) with rank 4 and alpha 8, resulting in 4,358,144 training parameters out of 2,131,890,176 total (~0.2044% of total model parameters).

Metrics

For evaluation purposes, the following metrics were used:

Overall accuracy: accuracy percentage for all keys. All 3 fields are within ±1% of the ground truth value.
Per key accuracy: accuracy percentage for each of the 3 keys. The value is counted as correct if within ±1% of the ground truth value.
Mean absolute error per key:

sum(abs(p - g) for p, g in pairs) / len(pairs)

Mean absolute percentage error: An error calculated as the percentage of the ground truth value.

sum(abs(p - g) / abs(g) for p, g in nonzero_pairs) / len(nonzero_pairs) * 100

Costs calculations

For self-hosted Qwen3b-VL Instruct costs calculations (inference hosting), the average prices of GPUs in the average region for GCP and AWS were calculated. The following prices were used for calculations:

GPU	Base average GPU cost/hour	VM price + 60% for (vCPUs, RAM, storage) or 0.04$/hr cpu, 0.004$/hr RAM	Total
H100 (80GB)	4 $/hr	2.4 $/hr	6.4 $/hr
A100	3.5 $/hr	2 $/hr	5.5 $/hr
L4	1 $/hr	1.15 $/hr	2.15 $/hr

GPU

Base average GPU cost/hour

VM price

+ 60% for (vCPUs, RAM, storage) or 0.04$/hr cpu, 0.004$/hr RAM

Total

H100 (80GB)

4 $/hr

2.4 $/hr

6.4 $/hr

A100

3.5 $/hr

2 $/hr

5.5 $/hr

1 $/hr

1.15 $/hr

2.15 $/hr

Table 1. GPU + VM costs estimation for self-hosted solutions in GCP/AWS cloud.

The Qwen3b-VL Instruct was trained on the L4 GPU for ~3 hours. The total cost of computing for training was $8.

Data

The used dataset was a mix of the Fatura dataset (Creative Commons Attribution 4.0 International) and A labeled dataset of hand-captured images of restaurant receipts (Creative Commons Attribution 4.0 International). For fine-tuning, 1,662 training samples were used. The dataset consisted of a mix of receipt images, invoice scans, and digital invoice PDFs. Most documents were in English, but some were in other languages such as French.

Image 1. Sample training dataset item, with ground truth and prediction results. Source: https://zenodo.org/records/13688441

Data error cases

Image 2. Sample training dataset item, with ground truth and prediction results. Source: https://zenodo.org/records/13688441

Out-of-distribution evaluation

The fine-tuned Qwen model performance was also evaluated on a completely separate dataset, unseen during fine-tuning katanaml-org/invoices-donut-data-v1 from HuggingFace.

Image 3. Sample from the out of distribution dataset. Source: katanaml-org/invoices-donut-data-v1

Results

Accuracy

Figure 1. Evaluation results on the in-distribution dataset (the dataset that the model was finetuned on). Evaluated on 62 samples.

Figure 2. Evaluation results on the in-distribution dataset (the dataset that the model was finetuned on). Evaluated on 62 samples.

Figure 3. Evaluation results on the in-distribution dataset. Mean Absolute Error in a log scale (the smaller the value the better). Evaluated on 62 samples.

Figure 4. Evaluation results on the in-distribution dataset. Mean Absolute Percentage Error in a log scale (the smaller the value the better). Evaluated on 62 samples.

Figure 5. The Qwen3-VL Instruct LORA fine tuned was evaluated on a completely separate dataset to verify generalization of the finetuned model to out of distribution samples.

Performance

L4 NVIDIA GPU
Batch size	Throughput (samples/s)	Wall time (s) 128 samples	Peak VRAM (GB)
1	0.54	237.0	4.44
2	0.97	132.1	4.60
4	1.64	78.0	4.92
8	2.47	51.9	5.57
16	3.39	37.8	6.85
32	4.12	31.1	9.43
64	4.33	29.6	14.57

Table 2. Throughput of the self-hosted, fine-tuned Qwen3-VL-2B-Instruct on L4 NVIDIA GPU 24GB VRAM.

Pricing

Cloud API

Model	gpt-4o-mini	gpt-5.2
Prompt tokens	1,461,236	77,573
Completion tokens	1,613	1,385
Total tokens	1,462,849	78,958
Input cost/1M	$0.40	$1.75
Output cost/1M	$1.60	$14.00
Input cost	$0.22	$0.14
Output cost	$0.00	$0.02
Total cost	$0.22	$0.16
Avg cost/request	$0.0036	$0.0025

Table 3. Costs of running GPT-based invoice data extraction. Assuming batching, reported prices can be reduced by further 50%, at the cost of waiting for the response up to 24 hours. But if we are ok with obtaining results asynchronously, we can schedule the VM with GPU to run only for 2 hours a day, and reduce the self-hosted solution costs by 90%.

GPT-4o mini can consume significantly more tokens (sometimes 20x+ higher) for image inputs compared to GPT-5, making GPT-4o-mini a more expensive option than higher models.

The self-hosted solution would need to handle at least 20,800 requests daily to meet the GPT prices using an L4-based VM. Assuming a throughput at the level of 4.33 samples per second, the cost parity can be achieved in 80 minutes, assuming a requirement to maintain and pay for the GPU-VM 24 hours a day. It proves that a single, cost-efficient GPU can handle high traffic and can be cheaper in the long-term for high-load workloads. The required return time is achieved in as little as 1.5h of work, leaving time to process a much larger volume of documents.

A self-hosting solution is a larger one-time investment for paying an ML Engineer to train and deploy the model. However, without dataset preparation, the task can be completed in one week. It makes the self-hosted solution a budget-friendly approach for high-load agentic and LLM workloads.

Future improvements

The study evaluated the self-hosted model, without inference optimization. Wrapping the Qwen model in the VLLM or optimising the model with TensortRT-LLM and deploying with Triron Inference Server can potentially boost the inference speed x4-x8 times, which will increase the self-hosted solution throughput and solution profitability significantly.

Study limitations

We don’t know whether any of the base models were trained on the invoice-receipt dataset before fine-tuning. Results might therefore be biased in favor of one or the other models.

Conclusions

A self-hosting solution has a greater starting cost, as it requires dataset preparation, model-fintuning and model hosting setup. However, once set up, it is highly cost-efficient compared to API based solutions for high-load workflows. With 20,800 requests daily, the L4 hosted solution achieves parity with API based solution in terms of costs, while efficiently utilising the VM and GPU for as little as 80 minutes, leaving a lot of space for higher volume traffic.

Moreover, further fine-tuning of the model optimization via TenorRT-LLM and Triton Inference Server can significantly increase model throughput. Additionally, the self-hosted solution provides privacy that no API-based solution will ever provide. Finetuning using LORA allows a single model to be finetuned and purposed for multiple different tasks. Therefore, a single model can be used for multiple agents, while swiping only the adapter, based on the specific task requirements.

Overall, the API based solutions are great for prototyping, while the self-hosted solutions are the way to go for well-established, high-volume use cases.

SFT: Scaling Small Vision-Language Models for High-Load Invoice Processing