Do Open Source LLMs Make Us A Frontier Lab?

I fine-tuned a 0.8 billion parameter model on financial sentiment data. The base model scored 27.4% accuracy before training. After three epochs it hit 77.9%. That gap is what this is about.

Language models are trained on enormous amounts of internet text, which makes them good at a lot of things. Financial sentiment isn't one of them. Take this sentence: "The company's revenues declined less than expected." A model with no domain exposure reads "revenues declined" and calls it negative. Anyone who follows markets knows that beating expectations is often positive news - the word "less" before "expected" flips the whole thing. That's the kind of context a base model doesn't have.

I used Qwen3.5-0.8B deliberately - a small model. The question I wanted to answer was how much domain knowledge you can inject into a tiny model through targeted training, rather than assuming you need something larger. Fine-tuning it on 8GB of VRAM meant using QLoRA: the base model gets quantized to 4-bit, and only small adapter matrices are trained on top of the frozen weights. The base model doesn't change. The adapters learn what the base model lacks. At inference you load both and that's your fine-tuned model.

Training data was FinGPT's financial sentiment dataset - tweets, headlines, and analyst commentary labelled as positive, negative, or neutral. In-domain data is the thing here. It's not general English, it's financial news, the same kind of text the model would see in production.

Before any training I ran the base model on 2,388 evaluation samples. Out of 1,566 neutral tweets, it correctly identified almost none of them. Neutral recall was essentially zero. The model defaulted to positive for most inputs, occasionally catching a negative. This isn't a problem with the model architecture. The base model just has no concept of what financial neutrality looks like - it has never been shown it - so it never predicts it.

Training ran for three epochs, 14,397 steps. Loss went from 0.57 early on down to 0.045 by the end, decaying cleanly with the cosine schedule. Token accuracy reached 99.3% on the training set.

The fine-tuned model landed at 77.9% accuracy and a macro F1 of 0.76. Recall on positive and negative classes was 92% and 94%. Neutral precision was 0.97, meaning when it calls something neutral it's almost always right. The weak spot is neutral recall at 70% - 468 neutral tweets got called positive or negative instead. The boundary between neutral and mildly positive or negative is genuinely fuzzy, and I suspect more training examples at that edge would move the number, though I haven't tested it.

27.4% to 77.9% from three epochs of domain-specific training. The base model already knew English, understood financial vocabulary, and could format a response. What it couldn't do was draw the line between positive, negative, and neutral in a financial context. That turns out to be a learnable thing, and it doesn't take much data to learn it.