Word similarity

with duckdb

duckdb

Author

Nick Twort

Published

27 May 2024

Suppose you have been asked with comparing or joining two datasets, but the datasets have been compiled manually or otherwise there are inconsistencies between them. How can you efficiently join without resorting to manual comparison?

One option is to use a string distance metric. A very easy way to implement this is using the duckdb package (which has so many other uses).

library(duckdb)


df_1 <- data.frame(
  x = c("New South Wales", "Victoria", "queenslnd", "Southern Australia", "Tasmania", "West Australia", "Australian Capital Teritory", "North Teritory"),
  y = runif(8)
)

df_2 <- data.frame(
  x = c("New South Wales", "Victoria", "Queensland", "South Australia", "Tasmania", "Western Australia", "Australian Capital Territory", "Northern Territory"),
  z = runif(8)
)

con <- dbConnect(duckdb())
dbWriteTable(con, "df_1", df_1)
dbWriteTable(con, "df_2", df_2)

dbGetQuery(
  con,
  "
  SELECT *, jaro_winkler_similarity(df_1.x, df_2.x) score
  from df_1, df_2
  WHERE score > 0.87
  "
)

Enjoy your new-found productivity!

Attribution

Thanks for the tip, Nicholas E.