From messy to meaningful data

LLM-powered classification in R 🪄

Dylan Pieper

University of Pittsburgh

Magic does exist

LLM Workflow

LLMs make unstructured data analysis accessible

One model to process thousands of texts, documents, and images

What we’ll cover today

Ellmer R package logo

Use `library(ellmer)` for robust classification, focusing on:

Structure
Prompting
Accuracy
Confidence
Validation

What is extraction?

Describe the distinct features of an object

Setosa

Iris setosa flower Wikipedia, public domain

Smallest
Short petals
Wide sepals

Virginica

Iris virginica flower Wikipedia by Eric Hunt, CC-BY-SA 4.0

Medium sized
Most variable
Balanced proportions

Versicolor

Iris versicolor flower Wikipedia by D. Gordon E. Robertson, CC-BY-SA 3.0

Largest
Long petals
Narrow structure

What is classification?

Predict the distinct category an object belongs to

Setosa

Iris setosa flower Wikipedia, public domain

.95 setosa
.05 virginica
.05 versicolor

Virginica

Iris virginica flower Wikipedia by Eric Hunt, CC-BY-SA 4.0

.05 setosa
.95 virginica
.05 versicolor

Versicolor

Iris versicolor flower Wikipedia by D. Gordon E. Robertson, CC-BY-SA 3.0

.05 setosa
.05 virginica
.95 versicolor

Specify type structure

Iris versicolor flower Iris versicolor

chat <- chat("anthropic/claude-sonnet-4-20250514")
type_flower <- type_object(
  genus    = type_string(),
  species  = type_string(),
  features = type_string("Focus on morphology"),
)
str(chat$chat_structured(
  content_image_file("images/versicolor.jpg"),
  type = type_flower
))

List of 3
 $ genus   : chr "Iris"
 $ species : chr "versicolor"
 $ features: chr "Purple-blue flowers with three large falls (drooping petals) and three smaller standards (upright petals), yell"| __truncated__

Mimic traditional ML output

Iris versicolor flower Iris versicolor

prob <- "Probability (0.00-1.00) Iris is {{species}}"
type_flower <- type_object(
  species         = type_enum(c("setosa", "virginica", "versicolor")),
  prob_setosa     = type_number(interpolate(prob, species = "setosa")),
  prob_virginica  = type_number(interpolate(prob, species = "virginica")),
  prob_versicolor = type_number(interpolate(prob, species = "versicolor"))
)
str(chat$clone()$chat_structured(
  content_image_file("images/versicolor.jpg"), type = type_flower
))

List of 4
 $ species        : chr "versicolor"
 $ prob_setosa    : num 0.05
 $ prob_virginica : num 0.1
 $ prob_versicolor: num 0.85

Evaluate task limitations

Iris setosa flower Iris virginica flower Iris versicolor flower

parallel_chat_structured(
  chat = chat$clone(),
  prompts = list(
    content_image_file("images/setosa.jpg"),
    content_image_file("images/virginica.jpg"),
    content_image_file("images/versicolor.jpg")
  ),
  type = type_flower
)

     species prob_setosa prob_virginica prob_versicolor
1 versicolor        0.05           0.15            0.80
2 versicolor        0.10           0.15            0.75
3 versicolor        0.05           0.10            0.85

Adjust your expectations

Iris setosa flower Rose flower

type_flower <- type_object(
  genus = type_enum(c( "iris", "rose")),
)
parallel_chat_structured(
  chat = chat$clone(),
  prompts = list(
    content_image_file("images/setosa.jpg"),
    content_image_file("images/rose.jpg")
  ),
  type = type_flower
)

  genus
1  iris
2  rose

Prompting philosophy

Less is more

✅ Focus on

Clear structure
Task limitations

✅ Add context for

Edge cases
What if multiple categories fit?

Evidence

“There was no need to provide those carefully crafted 20,000 tokens of context and, worse than that, doing so actually hurt performance.”

Simon Couch (2025)

Disease symptoms

Psoriasis rash

Psoriasis

“The skin on my palms and soles is thickened and has deep cracks. These cracks are painful and bleed easily.”

Disease symptoms

Data source: Kaggle, public domain, n = 1,200

symptoms <- read.csv("cls_health/Symptom2Disease.csv")
type_health <- type_object(
  diagnosis   = type_enum(unique(symptoms$label)), # n = 24
  uncertainty = type_number("Err on the side of caution, and
                             provide a score (0.00-1.00) of
                             uncertainty in your diagnosis.")
)
unique(symptoms$label) |> head(12)

 [1] "Psoriasis"             "Varicose Veins"        "Typhoid"              
 [4] "Chicken pox"           "Impetigo"              "Dengue"               
 [7] "Fungal infection"      "Common Cold"           "Pneumonia"            
[10] "Dimorphic Hemorrhoids" "Arthritis"             "Acne"

Disease symptoms

oai_dat <- parallel_chat_structured(
  chat = chat("openai/gpt-5-mini"),
  prompts = as.list(symptoms$text),
  type = type_health,
  include_cost = TRUE
)

accuracy <- mean(oai_dat$diagnosis == symptoms$label)
confidence <- 1 - mean(oai_dat$uncertainty)
cost <- sum(oai_dat$cost)

Accuracy: 62% (vs. 93% in tidymodels) 🫣
Confidence: 42%
Cost: $0.95

Disease symptoms

Crimes

Pennsylvania State Police Source: PA.gov

Police reports

Descriptions of crimes

“Criminal Conspiracy Engaging Harassment - Comm. Lewd, Threatening, Etc. Language”

Crimes

Schema source: Uniform Crime Classification Standard

type_crime <- type_object(
  crime_type = type_enum(
    c("violent", "property", "drug", "dui offense", 
      "public order", "criminal traffic", "not known/missing"),
    "If violent and another type clearly applies, choose violent,
     but only if intent to harm or injure is clearly present.
     Threats, harassment, stalking, and similar are all violent."
  ),
  uncertainty = type_number(
    "Your uncertainty in the classification responses and scores,
     higher scores reflect unclear or difficult to classify descriptions,
     ranging from 0.0 to 1.0."
  )
)

Crimes

Data source: Pennsylvania Courts, n = 1,537

crimes <- read.csv("cls_offense/crimes.csv")
oai_dat <- parallel_chat_structured(
  chat = chat("openai/gpt-5-mini"),
  prompts = as.list(crimes$description),
  type = type_crime,
  include_cost = TRUE
)

Agreement with ML tool: 81% 🎯
Certainty: 81%
Cost: $0.89

Crimes

ML tool misclassifies animal cruelty as non-violent
Description	ML Tool	ML Cert.	LLM	LLM Cert.
Aggravated Cruelty to Animals - Causing SBI or Death	public order	1	violent	0.9
Cruelty to Animals	public order	1	violent	0.8

High certainty ≠ correct classification

Crimes

Description	ML Tool	ML Cert.	LLM	LLM Cert.
Accidents Involving Death Or Personal In…	violent	1.00	not known/missing	0.10
BAC .02 or Higher - 4th Off or Sub Off	drug	0.43	dui offense	0.95

Advanced workflows

Development

🤖 LLM: Classify

👥 Human: Review

⚡ ML: Train

Production

⚡ ML: Classify

🤖 LLM: Classify

👥 Human: Compare

⚡ ML: Train

What’s the catch?

Frontier models are a black box
- Billions of parameters obscure behaviors
- Unknown/proprietary training datasets
- Unknown/secretive training and fine-tuning methods
Local, open-source models often perform poorly
LLMs are non-deterministic (not always consistent)

LLM reliability

First Run

chat <- chat("openai/gpt-5-mini")
chat$chat("Tell a joke")

Why don't scientists trust atoms? Because they make up everything.

Want another one?

Second Run

chat <- chat("openai/gpt-5-mini")
chat$chat("Tell a joke")

Why did the scarecrow win an award? Because he was outstanding in his field.

Want another?

Same input → Different outputs

Always validate and look at the responses

Takeaways

LLMs are accessible tools for unstructured data analysis
Use minimal prompting that focuses on data structure and task boundaries or limitations
Use uncertainty scores and human review to evaluate LLM output
Use LLMs to develop and improve existing ML tools

Thank you!

Dylan Pieper

University of Pittsburgh

📧 djp119@pitt.edu
🦋 @dylanpieper.bsky.social
💻 github.com/dylanpieper

Questions?

Let’s discuss LLMs!

From messy to meaningful data

Magic does exist

LLMs make unstructured data analysis accessible

What we’ll cover today

Use library(ellmer) for robust classification, focusing on:

What is extraction?

Setosa

Virginica

Versicolor

What is classification?

Setosa

Virginica

Versicolor

Specify type structure

Mimic traditional ML output

Evaluate task limitations

Adjust your expectations

Prompting philosophy

Less is more

Evidence

Disease symptoms

Psoriasis

Disease symptoms

Disease symptoms

Disease symptoms

Disease symptoms

Crimes

Police reports

Crimes

Crimes

Crimes

Crimes

Crimes

Advanced workflows

Development

Production

What’s the catch?

LLM reliability

First Run

Second Run

Takeaways

Thank you!

Dylan Pieper

Questions?

Use `library(ellmer)` for robust classification, focusing on: