Word similarity
with duckdb
R
duckdb
Suppose you have been asked with comparing or joining two datasets, but the datasets have been compiled manually or otherwise there are inconsistencies between them. How can you efficiently join without resorting to manual comparison?
One option is to use a string distance metric. A very easy way to implement this is using the duckdb package (which has so many other uses).
library(duckdb)
df_1 <- data.frame(
x = c("New South Wales", "Victoria", "queenslnd", "Southern Australia", "Tasmania", "West Australia", "Australian Capital Teritory", "North Teritory"),
y = runif(8)
)
df_2 <- data.frame(
x = c("New South Wales", "Victoria", "Queensland", "South Australia", "Tasmania", "Western Australia", "Australian Capital Territory", "Northern Territory"),
z = runif(8)
)
con <- dbConnect(duckdb())
dbWriteTable(con, "df_1", df_1)
dbWriteTable(con, "df_2", df_2)
dbGetQuery(
con,
"
SELECT *, jaro_winkler_similarity(df_1.x, df_2.x) score
from df_1, df_2
WHERE score > 0.87
"
)Enjoy your new-found productivity!