How to use LLMs to improve specialized ML tools

The case of criminal offense classification

Dylan Pieper

The dilemma 😫

Are specialized ML tools and LLMs in competition, or can they work together?

Specialized tools:

  • Better for specific tasks
  • Validated, deterministic
  • Straightforward to retrain

LLMs:

  • Flexible for many tasks
  • Not validated, undeterministic
  • Difficult to retrain

My use case šŸ”Ž

Goals šŸŽÆ

  1. Evaluation:
  • Use LLM to discover biases and limitations of specialized tool through model disagreement and uncertainty
  • Use specialized tool as baseline to evaluate LLM performance through distribution overlap (PA offenses)
  1. Extension: Use LLMs to get data not otherwise available

Tools šŸ› ļø

  • TOC:
    • Text-based Offense Classification (specialized ML tool; University of Michigan)
    • Matches text to 258 codes (ā€œmurderā€) to get type (ā€œviolentā€)

TOC results

Data TOC was ā€œgoodā€ at classifying:

Mean confidence: 92%

TOC results (cont’d)

Data TOC was ā€œbadā€ at classifying:

IDSI = Involuntary Deviate Sexual Intercourse

Mean confidence: 92%

LLM results

Data LLM was ā€œgoodā€ at classifying:

Mean confidence: 77%

Model agreement

Topics of disagreement

Public order disagreements

Use a comparison table to facilitate review:

Bias detection

I also discovered systematic biases:

Rape misclassified (ā€œIDSIā€)

  • TOC classified 5 of 10 as violent
  • LLM classified 10 of 10 as violent

Animal cruelty misclassified (ā€œAnimalā€)

  • TOC classified 2 of 11 as violent
  • LLM classified 11 of 11 as violent

IDSI comparison

Hybrid framework

Not an binary decision

  1. Initial tool: Develop specialized ML tool as a baseline

  2. Auditing signal: Use LLMs to identify disagreements, uncertain cases, and systematic biases

  3. Human review: Use model comparison to focus review and model retraining

Implications

The bias detection capability alone justifies LLM costs for criminal justice applications

For ML tool users:

  • Audit your classifications
  • Flag cases needing expert review
  • Report biases and limitations
    • In your own work
    • To tool developers

Implications (cont’d)

For ML tool developers:

  • Audit your tools using model disagreement methods
  • Create bias elimination and retraining workflows
  • Leave behind audit trails and write reports
  • If you have money, post bias bounties

Thank youļø ā˜ŗļø

Questions?

Contact: Dylan Pieper

Repo: github.com/dylanpieper/posit-conf-2025