Cocktails: Experimenting with Cosine String Distance

- R

You suddenly discover that you are hosting a party tonight. The guests are expecting you to prepare cocktails. With limited time to prepare, you can only make one trip to the store. What ingredients should you buy in order to maximize your mixological palette? Assume your shopping cart holds n items.

In Part 1, we will introduce the cocktails dataset, which forms the inspiration for this problem. Although data cleaning for cocktails is mostly done, some touch-ups are needed. We will apply cosine string distance comparisons to eliminate redundant ingredients.

Part 2 focuses on the approach used to optimize ingredient selection.

Click here to see the Shiny app that illustrates the final results!

Background

The cocktails dataset comes from TidyTuesday1.

library(tidyverse)

# Getting data from TidyTuesday repo
cocktails <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-26/cocktails.csv') %>% 
  select(drink, ingredient_number, ingredient, measure) 

cocktails contains 546 unique drinks. I never considered myself a proper connoisseur by any means, but this dataset serves as a humbling realization that I’ve barely scratched the surface of mixology: Gin Squirt, Godchild, Space Odyssey, to name a few.

The data is structured as shown below. Each ingredient + associated measure is stored in a unique row.

The plot below shows the most frequently used ingredients in cocktails. At first glance, there are \(333\) distinct ingredients.

Vodka & gin take gold & silver, but I was surprised to see that whiskey doesn’t even make the top 8! Otherwise, the below charts seems to make sense.



With just these ingredients, the only drink you’d be able to make to order is a Screwdriver (Vodka + Orange Juice). If you’re trying impress company, this list is not comprehensive enough - your first guest who orders an Old-Fashioned will be disappointed.

Is the solution to simply buy more ingredients? We are assuming that going down the ingredient count list is the smartest method to solve this problem. Spoiler alert: it’s not (more in part 2).

The First Roadblock

Skimming through the distinct ingredients, we can observe duplicates due to small differences in capitalization and superfluous adjectives. This could add unwanted redundancy to our selection method and unnecessarily convolute our network plot.

After all, how many different types of lime juice do we need?

cocktails %>% 
  distinct(ingredient) %>% 
  filter(str_detect(ingredient, "(L|l)ime"))
## # A tibble: 9 x 1
##   ingredient        
##   <chr>             
## 1 Lime juice        
## 2 Fresh Lime Juice  
## 3 Lime              
## 4 Lime juice cordial
## 5 Lime Juice        
## 6 Lemon-lime soda   
## 7 Lime vodka        
## 8 Lime peel         
## 9 Limeade

The first thing we can do is force everything to lowercase and remove punctuation.

cocktails <- cocktails %>% 
   mutate(ingredient = str_to_lower(ingredient),
         ingredient = str_remove(ingredient, "'"))

Although this is a good start, we really need a systematic way to detect similar ingredients. Making \(333 ^ 2 /2 = 5.54445\times 10^{4}\) manual comparisons is unrealistic.

Hence, a perfect opportunity to use string distance comparisons. I highly recommend M van der Loo’s paper2 which provides a thorough introduction to the stringdist package. We will use the fuzzyjoin package which harnesses the stringdist tools in a user-friendly interface; but we need to be careful to use a targeted approach.

For example, lime peel and lime juice both contain lime and by many string distance metrics, will be fairly similar. But we certainly don’t want to merge lime juice and lime peel into a single ingredient.

We can perform more meaningful comparisons by first separating the ingredients into categories. This is an excellent opportunity to use the tokenizers package + new dplyr 1.0.0 functions!

I’m a huge fan of with_groups() since it combines grouping + action + ungroup into one operation, making code chunks more concise.

library(tokenizers)
with_categories <- cocktails %>% 
  distinct(ingredient) %>% 
  with_groups(ingredient,
              summarise,
              category = tokenize_regex(ingredient,
                                        # essentially word tokenization
                                        pattern = " ",
                                        simplify = TRUE)) %>% 
  add_count(category, name = "freq") %>% 
  # within each ingredient, what are the most common words?
  with_groups(ingredient, slice_max, n = 1, order_by = freq) %>%
  with_groups(ingredient, filter, n() == 1) %>% 
  filter(freq > 6)

This results in the below categories.3


We can now apply string distance matching within each category. Consider the juice category as an example. We have the following juices:

##  [1] "apple juice"         "cranberry juice"     "fresh lemon juice"  
##  [4] "fresh lime juice"    "fruit juice"         "grape juice"        
##  [7] "grapefruit juice"    "guava juice"         "lemon juice"        
## [10] "lime juice"          "lime juice cordial"  "orange juice"       
## [13] "passion fruit juice" "pineapple juice"     "tomato juice"

We want to evaluate the string distance between each one of these elements. Since these are all juices, we can omit “juice”. Let’s also omit white-space.


I opted to use q-gram cosine as distance metric, with \(q = 1\). I experimented with edit-based methods, but q-gram seemed to perform best on this dataset.

Suppose we are comparing freshlime and lime. With q-gram cosine, we first construct a set of all possible unique q-length strings (in this case, \(length = 1\)) between the two elements, yielding the following set: \[set: [f, r, e, s, h, l, i, m]\] Easy enough. Now, we assign a vector to each element that corresponds to the above set. \[v(freshlime) = [1, 1, 2, 1, 1, 1, 1, 1]\\v(lime) = [0,0,1,0,0,1,1,1]\] Now, we simply take the dot product of the two vectors and subtract from one. \[1 - \frac{2 + 1 + 1 + 1}{\sqrt{11}\sqrt{4}} = 0.246\] A value of \(0\) indicates an exact match (vectors are identical), whereas a value of \(1\) indicates no q-gram commonality (perpendicular vectors).

A q-gram cosine matrix for a few juices is shown below. freshlime and lime share the lowest value in this set - which is what we intended.


I set the lower string distance threshold to \(0.3\) - anything below this value will be held out for review.

The resulting table is below! Although there are quite a few false positives, we’ve effectively reduced a list of \(305 ^ 2 /2\) combinations to \(22\) rows for our review.

(I’ve highlighted the rows I’ve chosen to merge)

Conclusion

Altogether, the various procedures above have reduced the distinct ingredients from \(333\) to \(298\). Click here to see data transformation code from start to finish.4

In Part 2, we will return to the original problem:

What ingredients should you buy in order to maximize your mixological palette?


  1. TidyTuesday is a weekly project hosted by the R4DS Online Learning Community that encourages beginner & seasoned data scientists to apply their skills on a variety of datasets.↩︎

  2. van der Loo M (2014). “The stringdist package for approximate string matching.” The R Journal, 6, 111-122. <URL: https://CRAN.R-project.org/package=stringdist>.↩︎

  3. Keywords that contain a global frequency of less than 6 remain unclassified.↩︎

  4. If I had more time, I would like to run the string distance comparison on all uncategorized ingredients.↩︎