What we learned building private SLMs for three different verticals

Finance, legal, and ops all have different failure modes. Here's what we found when we trained domain-specific models for each.

We've now trained private small language models for teams across three verticals: financial services, legal, and enterprise operations. Each had a different primary use case, a different failure mode, and required different approaches to data preparation and evaluation.

Here's what we learned.

Financial services: precision over recall

The use case was transaction anomaly triage - classifying flagged transactions as genuine anomalies, false positives, or uncertain, and generating a brief explanation for the analyst.

The failure mode that mattered was false negatives on genuine anomalies. Missing a real fraud event is far more costly than surfacing a false positive. The team had acceptable tolerance for over-flagging but near-zero tolerance for under-flagging.

This shaped the training data selection and the scoring function. We weighted the training set toward confirmed anomaly cases (oversampled 3:1 relative to their base rate) and used an asymmetric reward in the fine-tuning objective that penalized missed anomalies at 5× the weight of false positives.

The result was a model that ran 12× cheaper than the frontier model it replaced, with a false negative rate that was actually *lower* than the frontier model's on their specific transaction types. The frontier model had better general anomaly detection capability, but the private model had better calibration for their particular data distribution.

The lesson: for high-stakes classification tasks, a well-calibrated narrow model often outperforms a capable general model. The general model doesn't know what's normal for your specific data.

Legal: consistency over accuracy

The use case was contract clause extraction - identifying and categorizing specific clause types across large volumes of contracts, and flagging non-standard language for attorney review.

The failure mode here was inconsistency. The attorneys could live with the model occasionally misclassifying a clause type. What they couldn't live with was the model classifying the same clause differently across two contracts, because inconsistency made the outputs untrustworthy for downstream processes.

We measured this explicitly during eval: we ran the same clause through the model 50 times with identical context and measured output variance. The frontier model had a variance rate (different classification on the same input) of around 4%. The initial fine-tuned model was similar.

Getting consistency below 1% required two things. First, temperature 0 inference - obvious in retrospect, but the team had been running at 0.3 for "more natural" outputs. Second, a consistency-weighted fine-tuning objective that explicitly penalized output variance across semantically identical inputs in the training batch.

The lesson: for document processing workflows where outputs feed into structured systems, consistency is a first-class metric. Optimizing for it requires measuring it explicitly.

Enterprise operations: coverage over everything

The use case was support ticket routing - classifying inbound tickets across 40+ routing categories and generating a brief triage summary.

The failure mode was coverage gaps. The team had a long tail of ticket types that didn't fit neatly into any category, and the model was routing these to a catch-all bucket that was chronically understaffed. The actual problem wasn't misclassification of well-represented categories - the base model handled those fine. It was the 15% of tickets that were genuinely ambiguous or rare.

The fix required expanding the training data specifically for underrepresented categories. We pulled the last 6 months of tickets routed to the catch-all bucket, had the team label them against an expanded taxonomy, and added them to the training set with 2× weighting.

More importantly, we changed the output format. Instead of predicting a single category, the model now outputs a ranked list of the top 3 most likely categories with confidence scores. Tickets where the top confidence score is below a threshold go to a human review queue rather than being auto-routed. This reduced miscategorization on ambiguous tickets by 60% at the cost of a modest increase in human review volume.

The lesson: for routing and triage tasks, a "I'm not sure" output is often more valuable than a wrong output. Design your output schema to make uncertainty expressible.

What generalizes

Across all three cases, the same principles held:

Your failure mode is your training signal. The most important thing you can do before starting a fine-tuning project is define what failure looks like and make sure your eval set measures it directly.

Production data beats synthetic data. Synthetic training examples have a distribution that doesn't match what your model will see in production. Real labeled traces from your actual workload are worth 10× more per example.

Eval infrastructure is the bottleneck. In every case, the thing that determined how fast we could iterate was how quickly we could run a meaningful eval. Teams that invested in fast, trusted evals shipped better models faster.

The verticals were different. The discipline was the same.