Compare Text Similarity Across Lists — same

Computes similarity scores between two or more lists of character strings. Implements multiple string distance algorithms including edit-based methods (OSA, Levenshtein, Damerau-Levenshtein, Hamming), sequence-based approaches (LCS), q-gram techniques (qgram, cosine, jaccard), phonetic matching (soundex), and hybrid methods (Jaro-Winkler).

Usage

same_text(
  ...,
  method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw",
    "soundex"),
  q = 1,
  p = NULL,
  bt = 0,
  weight = c(d = 1, i = 1, s = 1, t = 1),
  digits = 3
)

Arguments

...: Lists of character strings to compare.
method: Character vector of similarity methods from stringdist (default: c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex")).
q: Size of q-gram for "qgram", "cosine", and "jaccard" methods (default: 1).
p: Winkler scaling factor for "jw" method (default: 0.1).
bt: Booth matching threshold for "jw" method (default: 0).
weight: Vector of weights for operations: deletion (d), insertion (i), substitution (s), transposition (t) for "osa", "lv", and "dl" methods.
digits: Number of digits to round results (default: 3).

Value

An S3 class object of type "similar_text" containing:

scores: Numeric similarity scores by method and comparison
summary: Summary statistics by method and comparison
methods: Methods used for comparison
list_names: Names of compared lists

Examples

list1 <- list("hello", "world")
list2 <- list("helo", "word")

# Using unnamed lists
result1 <- same_text(list1, list2)
#> ✔ Computed osa scores for "list1_list2" [mean: 0.8]
#> ✔ Computed lv scores for "list1_list2" [mean: 0.8]
#> ✔ Computed dl scores for "list1_list2" [mean: 0.8]
#> ✔ Computed hamming scores for "list1_list2" [mean: 0]
#> ✔ Computed lcs scores for "list1_list2" [mean: 0.8]
#> ✔ Computed qgram scores for "list1_list2" [mean: 0.889]
#> ✔ Computed cosine scores for "list1_list2" [mean: 0.92]
#> ✔ Computed jaccard scores for "list1_list2" [mean: 0.9]
#> ✔ Computed jw scores for "list1_list2" [mean: 0.953]
#> ✔ Computed soundex scores for "list1_list2" [mean: 0.5]

# Using named lists for more control
result2 <- same_text("l1" = list1, "l2" = list2)
#> ✔ Computed osa scores for "l1_l2" [mean: 0.8]
#> ✔ Computed lv scores for "l1_l2" [mean: 0.8]
#> ✔ Computed dl scores for "l1_l2" [mean: 0.8]
#> ✔ Computed hamming scores for "l1_l2" [mean: 0]
#> ✔ Computed lcs scores for "l1_l2" [mean: 0.8]
#> ✔ Computed qgram scores for "l1_l2" [mean: 0.889]
#> ✔ Computed cosine scores for "l1_l2" [mean: 0.92]
#> ✔ Computed jaccard scores for "l1_l2" [mean: 0.9]
#> ✔ Computed jw scores for "l1_l2" [mean: 0.953]
#> ✔ Computed soundex scores for "l1_l2" [mean: 0.5]