From messy to meaningful data

LLM-powered classification in R 🪄

Dylan Pieper

University of Pittsburgh

Magic does exist

LLM Workflow

LLMs make unstructured data analysis accessible

One model to process thousands of texts, documents, and images

What we’ll cover today

Ellmer R package logo

Use library(ellmer) for robust classification, focusing on:

  • Structure
  • Prompting
  • Accuracy
  • Confidence
  • Validation

What is extraction?

Describe the distinct features of an object

Setosa

Iris setosa flower Wikipedia, public domain

  • Smallest
  • Short petals
  • Wide sepals

Virginica

Iris virginica flower Wikipedia by Eric Hunt, CC-BY-SA 4.0

  • Medium sized
  • Most variable
  • Balanced proportions

Versicolor

Iris versicolor flower Wikipedia by D. Gordon E. Robertson, CC-BY-SA 3.0

  • Largest
  • Long petals
  • Narrow structure

What is classification?

Predict the distinct category an object belongs to

Setosa

Iris setosa flower Wikipedia, public domain

  • .95 setosa
  • .05 virginica
  • .05 versicolor

Virginica

Iris virginica flower Wikipedia by Eric Hunt, CC-BY-SA 4.0

  • .05 setosa
  • .95 virginica
  • .05 versicolor

Versicolor

Iris versicolor flower Wikipedia by D. Gordon E. Robertson, CC-BY-SA 3.0

  • .05 setosa
  • .05 virginica
  • .95 versicolor

Specify type structure

Iris versicolor flower Iris versicolor

chat <- chat("anthropic/claude-sonnet-4-20250514")
type_flower <- type_object(
  genus    = type_string(),
  species  = type_string(),
  features = type_string("Focus on morphology"),
)
str(chat$chat_structured(
  content_image_file("images/versicolor.jpg"),
  type = type_flower
))
List of 3
 $ genus   : chr "Iris"
 $ species : chr "versicolor"
 $ features: chr "Purple-blue flowers with three large falls (drooping petals) and three smaller standards (upright petals), yell"| __truncated__

Mimic traditional ML output

Iris versicolor flower Iris versicolor

prob <- "Probability (0.00-1.00) Iris is {{species}}"
type_flower <- type_object(
  species         = type_enum(c("setosa", "virginica", "versicolor")),
  prob_setosa     = type_number(interpolate(prob, species = "setosa")),
  prob_virginica  = type_number(interpolate(prob, species = "virginica")),
  prob_versicolor = type_number(interpolate(prob, species = "versicolor"))
)
str(chat$clone()$chat_structured(
  content_image_file("images/versicolor.jpg"), type = type_flower
))
List of 4
 $ species        : chr "versicolor"
 $ prob_setosa    : num 0.05
 $ prob_virginica : num 0.1
 $ prob_versicolor: num 0.85

Evaluate task limitations

Iris setosa flower Iris virginica flower Iris versicolor flower

parallel_chat_structured(
  chat = chat$clone(),
  prompts = list(
    content_image_file("images/setosa.jpg"),
    content_image_file("images/virginica.jpg"),
    content_image_file("images/versicolor.jpg")
  ),
  type = type_flower
)
     species prob_setosa prob_virginica prob_versicolor
1 versicolor        0.05           0.15            0.80
2 versicolor        0.10           0.15            0.75
3 versicolor        0.05           0.10            0.85

Adjust your expectations

Iris setosa flower Rose flower

type_flower <- type_object(
  genus = type_enum(c( "iris", "rose")),
)
parallel_chat_structured(
  chat = chat$clone(),
  prompts = list(
    content_image_file("images/setosa.jpg"),
    content_image_file("images/rose.jpg")
  ),
  type = type_flower
)
  genus
1  iris
2  rose

Prompting philosophy

Less is more

Focus on

  • Clear structure
  • Task limitations

Add context for

  • Edge cases
  • What if multiple categories fit?

Evidence

“There was no need to provide those carefully crafted 20,000 tokens of context and, worse than that, doing so actually hurt performance.”

Simon Couch (2025)

Disease symptoms

Psoriasis rash

Psoriasis

“The skin on my palms and soles is thickened and has deep cracks. These cracks are painful and bleed easily.”

Disease symptoms

Data source: Kaggle, public domain, n = 1,200

symptoms <- read.csv("cls_health/Symptom2Disease.csv")
type_health <- type_object(
  diagnosis   = type_enum(unique(symptoms$label)), # n = 24
  uncertainty = type_number("Err on the side of caution, and
                             provide a score (0.00-1.00) of
                             uncertainty in your diagnosis.")
)
unique(symptoms$label) |> head(12)
 [1] "Psoriasis"             "Varicose Veins"        "Typhoid"              
 [4] "Chicken pox"           "Impetigo"              "Dengue"               
 [7] "Fungal infection"      "Common Cold"           "Pneumonia"            
[10] "Dimorphic Hemorrhoids" "Arthritis"             "Acne"                 

Disease symptoms

oai_dat <- parallel_chat_structured(
  chat = chat("openai/gpt-5-mini"),
  prompts = as.list(symptoms$text),
  type = type_health,
  include_cost = TRUE
)
accuracy <- mean(oai_dat$diagnosis == symptoms$label)
confidence <- 1 - mean(oai_dat$uncertainty)
cost <- sum(oai_dat$cost)
  • Accuracy: 62% (vs. 93% in tidymodels) 🫣
  • Confidence: 42%
  • Cost: $0.95

Disease symptoms

Disease symptoms

Crimes

Pennsylvania State Police Source: PA.gov

Police reports

Descriptions of crimes

“Criminal Conspiracy Engaging Harassment - Comm. Lewd, Threatening, Etc. Language”

Crimes

Schema source: Uniform Crime Classification Standard

type_crime <- type_object(
  crime_type = type_enum(
    c("violent", "property", "drug", "dui offense", 
      "public order", "criminal traffic", "not known/missing"),
    "If violent and another type clearly applies, choose violent,
     but only if intent to harm or injure is clearly present.
     Threats, harassment, stalking, and similar are all violent."
  ),
  uncertainty = type_number(
    "Your uncertainty in the classification responses and scores,
     higher scores reflect unclear or difficult to classify descriptions,
     ranging from 0.0 to 1.0."
  )
)

Crimes

Data source: Pennsylvania Courts, n = 1,537

crimes <- read.csv("cls_offense/crimes.csv")
oai_dat <- parallel_chat_structured(
  chat = chat("openai/gpt-5-mini"),
  prompts = as.list(crimes$description),
  type = type_crime,
  include_cost = TRUE
)
  • Agreement with ML tool: 81% 🎯
  • Certainty: 81%
  • Cost: $0.89

Crimes

ML tool misclassifies animal cruelty as non-violent
Description ML Tool ML Cert. LLM LLM Cert.
Aggravated Cruelty to Animals - Causing SBI or Death public order 1 violent 0.9
Cruelty to Animals public order 1 violent 0.8

High certainty ≠ correct classification

Crimes

Crimes

Description ML Tool ML Cert. LLM LLM Cert.
Accidents Involving Death Or Personal In… violent 1.00 not known/missing 0.10
BAC .02 or Higher - 4th Off or Sub Off drug 0.43 dui offense 0.95

Advanced workflows

Development

🤖 LLM: Classify

👥 Human: Review

ML: Train

Production

ML: Classify

🤖 LLM: Classify

👥 Human: Compare

ML: Train

What’s the catch?

  • Frontier models are a black box
    • Billions of parameters obscure behaviors
    • Unknown/proprietary training datasets
    • Unknown/secretive training and fine-tuning methods
  • Local, open-source models often perform poorly
  • LLMs are non-deterministic (not always consistent)

LLM reliability

First Run

chat <- chat("openai/gpt-5-mini")
chat$chat("Tell a joke")
Why don't scientists trust atoms? Because they make up everything.

Want another one?

Second Run

chat <- chat("openai/gpt-5-mini")
chat$chat("Tell a joke")
Why did the scarecrow win an award? Because he was outstanding in his field.

Want another?

Same input → Different outputs

Always validate and look at the responses

Takeaways

  • LLMs are accessible tools for unstructured data analysis
  • Use minimal prompting that focuses on data structure and task boundaries or limitations
  • Use uncertainty scores and human review to evaluate LLM output
  • Use LLMs to develop and improve existing ML tools

Thank you!

Dylan Pieper

University of Pittsburgh

📧 djp119@pitt.edu
🦋 @dylanpieper.bsky.social
💻 github.com/dylanpieper

Questions?

Let’s discuss LLMs!