Compare Text Similarity Across Lists
same_text.Rd
Computes similarity scores between two or more lists of character strings. Implements multiple string distance algorithms including edit-based methods (OSA, Levenshtein, Damerau-Levenshtein, Hamming), sequence-based approaches (LCS), q-gram techniques (qgram, cosine, jaccard), phonetic matching (soundex), and hybrid methods (Jaro-Winkler).
Arguments
- ...
Lists of character strings to compare.
- method
Character vector of similarity methods from
stringdist
(default: c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex")).- q
Size of q-gram for "qgram", "cosine", and "jaccard" methods (default: 1).
- p
Winkler scaling factor for "jw" method (default: 0.1).
- bt
Booth matching threshold for "jw" method (default: 0).
- weight
Vector of weights for operations: deletion (d), insertion (i), substitution (s), transposition (t) for "osa", "lv", and "dl" methods.
- digits
Number of digits to round results (default: 3).
Value
An S3 class object of type "similar_text" containing:
scores: Numeric similarity scores by method and comparison
summary: Summary statistics by method and comparison
methods: Methods used for comparison
list_names: Names of compared lists
Examples
list1 <- list("hello", "world")
list2 <- list("helo", "word")
# Using unnamed lists
result1 <- same_text(list1, list2)
#> ✔ Computed osa scores for "list1_list2" [mean: 0.8]
#> ✔ Computed lv scores for "list1_list2" [mean: 0.8]
#> ✔ Computed dl scores for "list1_list2" [mean: 0.8]
#> ✔ Computed hamming scores for "list1_list2" [mean: 0]
#> ✔ Computed lcs scores for "list1_list2" [mean: 0.8]
#> ✔ Computed qgram scores for "list1_list2" [mean: 0.889]
#> ✔ Computed cosine scores for "list1_list2" [mean: 0.92]
#> ✔ Computed jaccard scores for "list1_list2" [mean: 0.9]
#> ✔ Computed jw scores for "list1_list2" [mean: 0.953]
#> ✔ Computed soundex scores for "list1_list2" [mean: 0.5]
# Using named lists for more control
result2 <- same_text("l1" = list1, "l2" = list2)
#> ✔ Computed osa scores for "l1_l2" [mean: 0.8]
#> ✔ Computed lv scores for "l1_l2" [mean: 0.8]
#> ✔ Computed dl scores for "l1_l2" [mean: 0.8]
#> ✔ Computed hamming scores for "l1_l2" [mean: 0]
#> ✔ Computed lcs scores for "l1_l2" [mean: 0.8]
#> ✔ Computed qgram scores for "l1_l2" [mean: 0.889]
#> ✔ Computed cosine scores for "l1_l2" [mean: 0.92]
#> ✔ Computed jaccard scores for "l1_l2" [mean: 0.9]
#> ✔ Computed jw scores for "l1_l2" [mean: 0.953]
#> ✔ Computed soundex scores for "l1_l2" [mean: 0.5]