Skip to main content

SFT: Scaling Small Vision-Language Models for High-Load Invoice Processing

Picture of Kamil Rzechowski, ML Engineer

Kamil Rzechowski

ML Engineer
Apr 3, 2026|9 min read
Image Alt
receipt_with_data

Image 1. Sample training dataset item, with ground truth and prediction results. Source: https://zenodo.org/records/13688441

receipt_with_data_2

Image 2. Sample training dataset item, with ground truth and prediction results. Source: https://zenodo.org/records/13688441

Image Alt

Image 3. Sample from the out of distribution dataset. Source: katanaml-org/invoices-donut-data-v1

Image Alt

Figure 1. Evaluation results on the in-distribution dataset (the dataset that the model was finetuned on). Evaluated on 62 samples.

Image Alt

Figure 2. Evaluation results on the in-distribution dataset (the dataset that the model was finetuned on). Evaluated on 62 samples.

Image Alt

Figure 3. Evaluation results on the in-distribution dataset. Mean Absolute Error in a log scale (the smaller the value the better). Evaluated on 62 samples.

Image Alt

Figure 4. Evaluation results on the in-distribution dataset. Mean Absolute Percentage Error in a log scale (the smaller the value the better). Evaluated on 62 samples.

Image Alt

Figure 5. The Qwen3-VL Instruct LORA fine tuned was evaluated on a completely separate dataset to verify generalization of the finetuned model to out of distribution samples.