Compare Similarity of Texts, Factors, or Numbers • samesies

Compare lists of texts, factors, or numerical values to measure their similarity. Use cases include comparing responses from multiple raters, evaluating model outputs, and assessing data consistency.

Installation

From CRAN:

# install.packages("pak")
pak::pak("samesies")

Development version:

pak::pak("dylanpieper/samesies")

Basic Usage

samesies provides three main functions for measuring similarity:

same_text()

Compare similarity between multiple lists of character strings:

library(samesies)

r1 <- list("R is a statistical computing software", 
           "R enables grammar of graphics using ggplot2", 
           "R supports advanced statistical models")
r2 <- list("R is a full-stack programming language",
           "R enables advanced data visualizations", 
           "R supports machine learning algorithms")

tex <- same_text(r1, r2)
#> ✔ Computed osa scores for "r1_r2" [mean: 0.43]
#> ✔ Computed lv scores for "r1_r2" [mean: 0.43]
#> ✔ Computed dl scores for "r1_r2" [mean: 0.43]
#> ✔ Computed hamming scores for "r1_r2" [mean: 0.123]
#> ✔ Computed lcs scores for "r1_r2" [mean: 0.061]
#> ✔ Computed qgram scores for "r1_r2" [mean: 0.682]
#> ✔ Computed cosine scores for "r1_r2" [mean: 0.771]
#> ✔ Computed jaccard scores for "r1_r2" [mean: 0.735]
#> ✔ Computed jw scores for "r1_r2" [mean: 0.818]
#> ✔ Computed soundex scores for "r1_r2" [mean: 0.667]

Methods available via stringdist (e.g., method = "osa"):

Edit Distance Methods
- osa, lv, dl
Token-Based Similarity
- hamming, lcs, qgram, cosine, jaccard
Phonetic Methods
- jw, soundex

same_factor()

Compare similarity between multiple lists of categorical data:

f1 <- list("R", "R", "Python")
f2 <- list("R", "Python", "R")

fct <- same_factor(f1, f2)
#> ℹ Skipping 'order' method as factor levels are not ordered
#> ✔ Computed exact scores for "f1_f2" [mean: 0.333]

Compare similarity based on ordered factors:

of1 <- list("High School", "Bachelor's", "Master's", "PhD")
of2 <- list("Bachelor's", "High School", "PhD", "Master's")

edu_comparison <- same_factor(of1, of2, 
                              levels = c("High School", "Bachelor's", "Master's", "PhD"))

fct_ordered <- average_similarity(edu_comparison)
#> ✔ Computed exact scores for "of1_of2" [mean: 0]
#> ✔ Computed order scores for "of1_of2" [mean: 0.667]

Methods available (e.g., method = "exact"):

exact: Exact matching
order: Distances across ordered factor levels

same_number()

Compare similarity between multiple lists of numeric values:

n1 <- list(1, 2, 3)
n2 <- list(1, 2.1, 3.2)

num <- same_number(n1, n2)
#> ✔ Computed exact scores for "n1_n2" [mean: 0.333]
#> ✔ Computed raw scores for "n1_n2" [mean: 0.1]
#> ✔ Computed exp scores for "n1_n2" [mean: 0.908]
#> ✔ Computed percent scores for "n1_n2" [mean: 0.963]
#> ✔ Computed normalized scores for "n1_n2" [mean: 0.955]
#> ✔ Computed fuzzy scores for "n1_n2" [mean: 0.977]

Methods available (e.g., method = "exact"):

exact: Exact matching
raw: Absolute difference
exp: Exponential decay on the absolute difference
pct_diff: Percentage difference
normalized: Normalized difference
fuzzy: Intelligent threshold-based matching that adapts to data scale:
- Uses two tolerance thresholds: absolute (calculated) and relative (default 2%)
- Calculation of the absolute threshold averages data variability (10% of standard deviation), magnitude (0.5% of mean absolute values), and range (1% of value range)
- Effective threshold is the larger of these two values
- Perfect matches (score = 1.0) when difference ≤ effective threshold
- Scores decrease smoothly as differences exceed the threshold

List Support

Named Lists

For more control over results, you can use named lists to specify custom labels for comparisons:

result_number <- same_number("baseline" = n1, "treatment" = n2, method = "fuzzy")
#> ✔ Computed fuzzy scores for "baseline_treatment" [mean: 0.978]

More Lists

When you input more than two lists, samesies computes pairwise comparisons across all lists:

r1 <- list("Statistical computing", "Data visualization", "Machine learning")
r2 <- list("Statistical software", "Data plotting", "ML algorithms") 
r3 <- list("Statistical analysis", "Data graphics", "AI models")

multi_text <- same_text(r1, r2, r3, method = "cosine")
#> ✔ Computed cosine scores for "r1_r2" [mean: 0.717]
#> ✔ Computed cosine scores for "r1_r3" [mean: 0.607]
#> ✔ Computed cosine scores for "r2_r3" [mean: 0.68]

Nested Lists

Compare nested lists with identical structure and names:

nest1 <- list(
  group_a = list("Good", "Very Good", "Fair"),
  group_b = list("Excellent", "Good", "Poor")
)

nest2 <- list(
  group_a = list("Very Good", "Good", "Fair"), 
  group_b = list("Good", "Excellent", "Fair")  
)

nested_result <- same_text(nest1, nest2, method = "jw")
#> ✔ Computed jw scores for "nest1_nest2" [mean: 0.25]

Methods

All three functions return similar objects that support the following methods:

result <- same_text(r1, r2, method = "cosine")

# Print detailed results
print(result)

# Summarize results  
summary(result)
#>  method  pair avg_score
#>  cosine r1_r2     0.771

# Get average similarity scores
average_similarity(result)
#> cosine 
#>  0.771

# Get pair-wise averages
pair_averages(result)  
#>   method  pair avg_score
#> 1 cosine r1_r2     0.771

Accessing Object Data

The package uses S3 objects, allowing access to the underlying data using $:

result <- same_text(r1, r2, method = "cosine")

# Access similarity scores
result$scores
#> $cosine
#> $cosine$r1_r2
#>       R is a statistical computing software 
#>                                   0.7917448 
#> R enables grammar of graphics using ggplot2 
#>                                   0.7027498 
#>      R supports advanced statistical models 
#>                                   0.8197365

# Access methods used  
result$methods
#> [1] "cosine"

# Access list names
result$list_names  
#> [1] "r1" "r2"

Available components:

scores: A list of similarity scores for each method and comparison pair
summary: A list of statistical summaries for each method and comparison pair
methods: The similarity methods used in the analysis
list_names: Names of the input lists
raw_values: The original input values
digits: Number of decimal places for rounding results in output

Credits

The Spiderman image in the hex logo is fan art created by the Reddit user WistlerR15.