Compare Text Similarity Across Lists
same_text.Rd
Computes similarity scores between two or more lists of character strings. Implements multiple string distance algorithms including edit-based methods (OSA, Levenshtein, Damerau-Levenshtein, Hamming), sequence-based approaches (LCS), q-gram techniques (qgram, cosine, jaccard), phonetic matching (soundex), and hybrid methods (Jaro-Winkler).
Arguments
- ...
Lists of character strings to compare.
- method
Character vector of similarity methods from
stringdist
(default: c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex")).- q
Size of q-gram for "qgram", "cosine", and "jaccard" methods (default: 1).
- p
Winkler scaling factor for "jw" method (default: 0.1).
- bt
Booth matching threshold for "jw" method (default: 0).
- weight
Vector of weights for operations: deletion (d), insertion (i), substitution (s), transposition (t) for "osa", "lv", and "dl" methods.
- digits
Number of digits to round results (default: 3).
Value
An S3 class object of type "similar_text" containing:
scores: Numeric similarity scores by method and comparison
summary: Summary statistics by method and comparison
methods: Methods used for comparison
list_names: Names of compared lists
Examples
r1 <- list("R is a statistical computing software",
"R enables grammar of graphics using ggplot2")
r2 <- list("R is a full-stack programming language",
"R enables advanced data visualizations")
result <- same_text(r1, r2)
#> ✔ Computed osa scores for "r1_r2" [mean: 0.421]
#> ✔ Computed lv scores for "r1_r2" [mean: 0.421]
#> ✔ Computed dl scores for "r1_r2" [mean: 0.421]
#> ✔ Computed hamming scores for "r1_r2" [mean: 0]
#> ✔ Computed lcs scores for "r1_r2" [mean: 0.039]
#> ✔ Computed qgram scores for "r1_r2" [mean: 0.655]
#> ✔ Computed cosine scores for "r1_r2" [mean: 0.747]
#> ✔ Computed jaccard scores for "r1_r2" [mean: 0.708]
#> ✔ Computed jw scores for "r1_r2" [mean: 0.801]
#> ✔ Computed soundex scores for "r1_r2" [mean: 0.5]