Skip to contents

Computes similarity scores between two or more lists of character strings. Implements multiple string distance algorithms including edit-based methods (OSA, Levenshtein, Damerau-Levenshtein, Hamming), sequence-based approaches (LCS), q-gram techniques (qgram, cosine, jaccard), phonetic matching (soundex), and hybrid methods (Jaro-Winkler).

Usage

same_text(
  ...,
  method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw",
    "soundex"),
  q = 1,
  p = NULL,
  bt = 0,
  weight = c(d = 1, i = 1, s = 1, t = 1),
  digits = 3
)

Arguments

...

Lists of character strings to compare.

method

Character vector of similarity methods from stringdist (default: c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex")).

q

Size of q-gram for "qgram", "cosine", and "jaccard" methods (default: 1).

p

Winkler scaling factor for "jw" method (default: 0.1).

bt

Booth matching threshold for "jw" method (default: 0).

weight

Vector of weights for operations: deletion (d), insertion (i), substitution (s), transposition (t) for "osa", "lv", and "dl" methods.

digits

Number of digits to round results (default: 3).

Value

An S3 class object of type "similar_text" containing:

  • scores: Numeric similarity scores by method and comparison

  • summary: Summary statistics by method and comparison

  • methods: Methods used for comparison

  • list_names: Names of compared lists

Examples

r1 <- list("R is a statistical computing software", 
           "R enables grammar of graphics using ggplot2")
r2 <- list("R is a full-stack programming language",
           "R enables advanced data visualizations")

result <- same_text(r1, r2)
#>  Computed osa scores for "r1_r2" [mean: 0.421]
#>  Computed lv scores for "r1_r2" [mean: 0.421]
#>  Computed dl scores for "r1_r2" [mean: 0.421]
#>  Computed hamming scores for "r1_r2" [mean: 0]
#>  Computed lcs scores for "r1_r2" [mean: 0.039]
#>  Computed qgram scores for "r1_r2" [mean: 0.655]
#>  Computed cosine scores for "r1_r2" [mean: 0.747]
#>  Computed jaccard scores for "r1_r2" [mean: 0.708]
#>  Computed jw scores for "r1_r2" [mean: 0.801]
#>  Computed soundex scores for "r1_r2" [mean: 0.5]