How to use LLMs to improve specialized ML tools

The case of criminal offense classification

Dylan Pieper

The dilemma 😫

Are specialized ML tools and LLMs in competition, or can they work together?

Specialized tools:

Better for specific tasks

Validated, deterministic

Straightforward to retrain

LLMs:

Flexible for many tasks

Not validated, undeterministic

Difficult to retrain

My use case 🔎

Classify 1,537 offense descriptions for a racial disparities study in Pennsylvania
- Get UCCS crime type (violent, public order, property, criminal traffic, drug, DUI; Uniform Crime Classification Standard)
- Get harm-based measures (UK’s Office for National Statistics)
  - Harm level (e.g., individual vs. institutional)
  - Harm type (e.g., physical vs. financial or economic)

Goals 🎯

Evaluation:

Use LLM to discover biases and limitations of specialized tool through model disagreement and uncertainty
Use specialized tool as baseline to evaluate LLM performance through distribution overlap (PA offenses)

Extension: Use LLMs to get data not otherwise available

Tools 🛠️

TOC:
- Text-based Offense Classification (specialized ML tool; University of Michigan)
- Matches text to 258 codes (“murder”) to get type (“violent”)

GPT-5: LLM by OpenAI
- ellmer’s parallel_chat() function
- ellmer’s structured data feature

TOC results

Data TOC was “good” at classifying:

Mean confidence: 92%

TOC results (cont’d)

Data TOC was “bad” at classifying:

IDSI = Involuntary Deviate Sexual Intercourse

Mean confidence: 92%

LLM results

Data LLM was “good” at classifying:

Mean confidence: 77%

Model agreement

Topics of disagreement

Public order disagreements

Use a comparison table to facilitate review:

Bias detection

I also discovered systematic biases:

Rape misclassified (“IDSI”)

TOC classified 5 of 10 as violent
LLM classified 10 of 10 as violent

Animal cruelty misclassified (“Animal”)

TOC classified 2 of 11 as violent
LLM classified 11 of 11 as violent

IDSI comparison

Hybrid framework

Not an binary decision

Initial tool: Develop specialized ML tool as a baseline
Auditing signal: Use LLMs to identify disagreements, uncertain cases, and systematic biases
Human review: Use model comparison to focus review and model retraining

Implications

The bias detection capability alone justifies LLM costs for criminal justice applications

For ML tool users:

Audit your classifications
Flag cases needing expert review
Report biases and limitations
- In your own work
- To tool developers

Implications (cont’d)

For ML tool developers:

Audit your tools using model disagreement methods
Create bias elimination and retraining workflows
Leave behind audit trails and write reports
If you have money, post bias bounties

Thank you️ ☺️

Questions?

Contact: Dylan Pieper

dylanpieper@gmail.com
dylanpieper.github.io

Repo: github.com/dylanpieper/posit-conf-2025