While API based LLMs are great for rapid, fast, and easy development, they can be less secure and costly in the long-term horizon for load-intensive applications. The solution are Small Language Models (SLM), self-hosted and finetuned on the downstream task. This article presents a case study of a Supervised Fine-Tuning (SFT) of the SLM on the Invoice Processing task. It shows that while SLMs have higher investment costs at start, they are faster, cheaper, and more secure in the long-term, especially for high-load intensive applications.
Task
The task is to extract and return a json in following format:
from invoice and receipt images and digital invoice pdfs of diverse layout and quality.
Methodology
For the purpose of key-values extraction from invoice images, a single-pass approach using the Visual Language Model was chosen. Images were uploaded to the model along with a prompt specifying what to extract and the expected output format.
User prompt:
The performance was evaluated against the following models:
- Qwen3-VL-2B-Instruct
- Qwen3-VL-2B-Instruct, SFT with LORA r = 4 alpha = 8
- OpenAI GPT-4o-mini
- OpenAI GPT-5.2
The SLM (Qwen3-VL 2B Instruct) was trained for 3 epochs on a dataset with a train split size of 1622 samples and an efficient batch size of 16. Training was performed with bfloat16 precision, using LORA (Low-Rank Adaptation) with rank 4 and alpha 8, resulting in 4,358,144 training parameters out of 2,131,890,176 total (~0.2044% of total model parameters).
Metrics
For evaluation purposes, the following metrics were used:
- Overall accuracy: accuracy percentage for all keys. All 3 fields are within ±1% of the ground truth value.
- Per key accuracy: accuracy percentage for each of the 3 keys. The value is counted as correct if within ±1% of the ground truth value.
- Mean absolute error per key:
sum(abs(p - g) for p, g in pairs) / len(pairs)
- Mean absolute percentage error: An error calculated as the percentage of the ground truth value.
sum(abs(p - g) / abs(g) for p, g in nonzero_pairs) / len(nonzero_pairs) * 100
Costs calculations
For self-hosted Qwen3b-VL Instruct costs calculations (inference hosting), the average prices of GPUs in the average region for GCP and AWS were calculated. The following prices were used for calculations:
| GPU | Base average GPU cost/hour | VM price + 60% for (vCPUs, RAM, storage) or 0.04$/hr cpu, 0.004$/hr RAM | Total |
|---|---|---|---|
| H100 (80GB) | 4 $/hr | 2.4 $/hr | 6.4 $/hr |
| A100 | 3.5 $/hr | 2 $/hr | 5.5 $/hr |
| L4 | 1 $/hr | 1.15 $/hr | 2.15 $/hr |
Table 1. GPU + VM costs estimation for self-hosted solutions in GCP/AWS cloud.
The Qwen3b-VL Instruct was trained on the L4 GPU for ~3 hours. The total cost of computing for training was $8.
Data
The used dataset was a mix of the Fatura dataset (Creative Commons Attribution 4.0 International) and A labeled dataset of hand-captured images of restaurant receipts (Creative Commons Attribution 4.0 International). For fine-tuning, 1,662 training samples were used. The dataset consisted of a mix of receipt images, invoice scans, and digital invoice PDFs. Most documents were in English, but some were in other languages such as French.

Image 1. Sample training dataset item, with ground truth and prediction results. Source: https://zenodo.org/records/13688441

Image 2. Sample training dataset item, with ground truth and prediction results. Source: https://zenodo.org/records/13688441
Out-of-distribution evaluation
The fine-tuned Qwen model performance was also evaluated on a completely separate dataset, unseen during fine-tuning katanaml-org/invoices-donut-data-v1 from HuggingFace.

Image 3. Sample from the out of distribution dataset. Source: katanaml-org/invoices-donut-data-v1
Results
Accuracy

Figure 1. Evaluation results on the in-distribution dataset (the dataset that the model was finetuned on). Evaluated on 62 samples.

Figure 2. Evaluation results on the in-distribution dataset (the dataset that the model was finetuned on). Evaluated on 62 samples.

Figure 3. Evaluation results on the in-distribution dataset. Mean Absolute Error in a log scale (the smaller the value the better). Evaluated on 62 samples.

Figure 4. Evaluation results on the in-distribution dataset. Mean Absolute Percentage Error in a log scale (the smaller the value the better). Evaluated on 62 samples.

Figure 5. The Qwen3-VL Instruct LORA fine tuned was evaluated on a completely separate dataset to verify generalization of the finetuned model to out of distribution samples.
| L4 NVIDIA GPU | |||
|---|---|---|---|
| Batch size | Throughput (samples/s) | Wall time (s) 128 samples | Peak VRAM (GB) |
| 1 | 0.54 | 237.0 | 4.44 |
| 2 | 0.97 | 132.1 | 4.60 |
| 4 | 1.64 | 78.0 | 4.92 |
| 8 | 2.47 | 51.9 | 5.57 |
| 16 | 3.39 | 37.8 | 6.85 |
| 32 | 4.12 | 31.1 | 9.43 |
| 64 | 4.33 | 29.6 | 14.57 |
Table 2. Throughput of the self-hosted, fine-tuned Qwen3-VL-2B-Instruct on L4 NVIDIA GPU 24GB VRAM.
Pricing
Cloud API
| Model | gpt-4o-mini | gpt-5.2 |
|---|---|---|
| Prompt tokens | 1,461,236 | 77,573 |
| Completion tokens | 1,613 | 1,385 |
| Total tokens | 1,462,849 | 78,958 |
| Input cost/1M | $0.40 | $1.75 |
| Output cost/1M | $1.60 | $14.00 |
| Input cost | $0.22 | $0.14 |
| Output cost | $0.00 | $0.02 |
| Total cost | $0.22 | $0.16 |
| Avg cost/request | $0.0036 | $0.0025 |
Table 3. Costs of running GPT-based invoice data extraction. Assuming batching, reported prices can be reduced by further 50%, at the cost of waiting for the response up to 24 hours. But if we are ok with obtaining results asynchronously, we can schedule the VM with GPU to run only for 2 hours a day, and reduce the self-hosted solution costs by 90%.
GPT-4o mini can consume significantly more tokens (sometimes 20x+ higher) for image inputs compared to GPT-5, making GPT-4o-mini a more expensive option than higher models.
The self-hosted solution would need to handle at least 20,800 requests daily to meet the GPT prices using an L4-based VM. Assuming a throughput at the level of 4.33 samples per second, the cost parity can be achieved in 80 minutes, assuming a requirement to maintain and pay for the GPU-VM 24 hours a day. It proves that a single, cost-efficient GPU can handle high traffic and can be cheaper in the long-term for high-load workloads. The required return time is achieved in as little as 1.5h of work, leaving time to process a much larger volume of documents.
A self-hosting solution is a larger one-time investment for paying an ML Engineer to train and deploy the model. However, without dataset preparation, the task can be completed in one week. It makes the self-hosted solution a budget-friendly approach for high-load agentic and LLM workloads.
Future improvements
The study evaluated the self-hosted model, without inference optimization. Wrapping the Qwen model in the VLLM or optimising the model with TensortRT-LLM and deploying with Triron Inference Server can potentially boost the inference speed x4-x8 times, which will increase the self-hosted solution throughput and solution profitability significantly.
Study limitations
We don’t know whether any of the base models were trained on the invoice-receipt dataset before fine-tuning. Results might therefore be biased in favor of one or the other models.
Conclusions
A self-hosting solution has a greater starting cost, as it requires dataset preparation, model-fintuning and model hosting setup. However, once set up, it is highly cost-efficient compared to API based solutions for high-load workflows. With 20,800 requests daily, the L4 hosted solution achieves parity with API based solution in terms of costs, while efficiently utilising the VM and GPU for as little as 80 minutes, leaving a lot of space for higher volume traffic.
Moreover, further fine-tuning of the model optimization via TenorRT-LLM and Triton Inference Server can significantly increase model throughput. Additionally, the self-hosted solution provides privacy that no API-based solution will ever provide. Finetuning using LORA allows a single model to be finetuned and purposed for multiple different tasks. Therefore, a single model can be used for multiple agents, while swiping only the adapter, based on the specific task requirements.
Overall, the API based solutions are great for prototyping, while the self-hosted solutions are the way to go for well-established, high-volume use cases.




