Skip to contents

Computes similarity scores between two or more lists of character strings. Implements multiple string distance algorithms including edit-based methods (OSA, Levenshtein, Damerau-Levenshtein, Hamming), sequence-based approaches (LCS), q-gram techniques (qgram, cosine, jaccard), phonetic matching (soundex), and hybrid methods (Jaro-Winkler).

Usage

same_text(
  ...,
  method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw",
    "soundex"),
  q = 1,
  p = NULL,
  bt = 0,
  weight = c(d = 1, i = 1, s = 1, t = 1),
  digits = 3
)

Arguments

...

Lists of character strings to compare.

method

Character vector of similarity methods from stringdist (default: c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex")).

q

Size of q-gram for "qgram", "cosine", and "jaccard" methods (default: 1).

p

Winkler scaling factor for "jw" method (default: 0.1).

bt

Booth matching threshold for "jw" method (default: 0).

weight

Vector of weights for operations: deletion (d), insertion (i), substitution (s), transposition (t) for "osa", "lv", and "dl" methods.

digits

Number of digits to round results (default: 3).

Value

An S3 class object of type "similar_text" containing:

  • scores: Numeric similarity scores by method and comparison

  • summary: Summary statistics by method and comparison

  • methods: Methods used for comparison

  • list_names: Names of compared lists

Examples

list1 <- list("hello", "world")
list2 <- list("helo", "word")

# Using unnamed lists
result1 <- same_text(list1, list2)
#>  Computed osa scores for "list1_list2" [mean: 0.8]
#>  Computed lv scores for "list1_list2" [mean: 0.8]
#>  Computed dl scores for "list1_list2" [mean: 0.8]
#>  Computed hamming scores for "list1_list2" [mean: 0]
#>  Computed lcs scores for "list1_list2" [mean: 0.8]
#>  Computed qgram scores for "list1_list2" [mean: 0.889]
#>  Computed cosine scores for "list1_list2" [mean: 0.92]
#>  Computed jaccard scores for "list1_list2" [mean: 0.9]
#>  Computed jw scores for "list1_list2" [mean: 0.953]
#>  Computed soundex scores for "list1_list2" [mean: 0.5]

# Using named lists for more control
result2 <- same_text("l1" = list1, "l2" = list2)
#>  Computed osa scores for "l1_l2" [mean: 0.8]
#>  Computed lv scores for "l1_l2" [mean: 0.8]
#>  Computed dl scores for "l1_l2" [mean: 0.8]
#>  Computed hamming scores for "l1_l2" [mean: 0]
#>  Computed lcs scores for "l1_l2" [mean: 0.8]
#>  Computed qgram scores for "l1_l2" [mean: 0.889]
#>  Computed cosine scores for "l1_l2" [mean: 0.92]
#>  Computed jaccard scores for "l1_l2" [mean: 0.9]
#>  Computed jw scores for "l1_l2" [mean: 0.953]
#>  Computed soundex scores for "l1_l2" [mean: 0.5]