Thoughts on Lean Code

Jul 06, 2020 - R

Operational efficiency. Or, in other words: packing code into as few operations as possible.

I often follow this philosophy to a fault; for me, discovering new tools & methods to achieve the same result is one of the most rewarding aspects of programming. In that sense, “data science” is a playground.

Recently, I encountered a situation involving nested columns within a grouped data frame. Consider the below excerpt from the coffee_ratings dataset.

df

## # A tibble: 10 x 3
##    species farm_name                    companies
##    <chr>   <chr>                        <list>   
##  1 Arabica alicia's farm                <chr [2]>
##  2 Arabica ampcg                        <chr [2]>
##  3 Arabica conquista / morito           <chr [4]>
##  4 Arabica doi tung development project <chr [2]>
##  5 Arabica el morito                    <chr [2]>
##  6 Arabica finca el morito              <chr [2]>
##  7 Arabica finca medina                 <chr [3]>
##  8 Arabica finca santa clara            <chr [2]>
##  9 Robusta sethuraman estate            <chr [2]>
## 10 Robusta sethuraman estates           <chr [2]>

Each element within the companies column contains multiple character elements. In one operation, how can we obtain a data frame that contains a unique count of companies + farms for both Arabica & Robusta.

For the farm_name column, this is trivial with conventional dplyr syntax:

df %>% 
  group_by(species) %>% 
  summarise(n_farms = n_distinct(farm_name))

## # A tibble: 2 x 2
##   species n_farms
##   <chr>     <int>
## 1 Arabica       8
## 2 Robusta       2

However, if we try to extend this approach to the farm_name column, we get the following:

df %>% 
  group_by(species) %>% 
  summarise(n_companies = n_distinct(companies)) ## this is misleading!

## # A tibble: 2 x 2
##   species n_companies
##   <chr>         <int>
## 1 Arabica           8
## 2 Robusta           2

What’s wrong? The summarise operation return an integer column like we expected. In fact, n_distinct counted the number of distinct permutations of list elements in the companies columns

company_list <- df %>% 
  filter(species == "Arabica") %>% 
  pull(companies)

company_list # expanded form of 'companies' column

## [[1]]
## [1] "yunnan new century tech inc."   "yunnan louis herbs r& d center"
## 
## [[2]]
## [1] "taylor winch (t) ltd"             "volcafe/taylorwinch tanzania ltd"
## 
## [[3]]
## [1] "coffee resources inc."                   
## [2] "unex guatemala, s.a."                    
## [3] "asociación nacional del café - anacafe -"
## [4] "eduardo ambrocio"                        
## 
## [[4]]
## [1] "mae fah luang foundation"     "doi tung development project"
## 
## [[5]]
## [1] "armajaro guatemala, s. a." "unex guatemala, s.a."     
## 
## [[6]]
## [1] "unex guatemala, s.a." "finca el morito"     
## 
## [[7]]
## [1] "siembras vision, s.a."                     
## [2] "finca medina"                              
## [3] "siembras vision, s.a. / ing. jorge bolaños"
## 
## [[8]]
## [1] "exportcafe"    "café san blas"

This is not exactly what we want. We want the number of unique companies between the permutations.

Cue purrr::flatten_chr, which simplifies the hierarchical structure of its argument.¹

purrr::flatten_chr(company_list)

##  [1] "yunnan new century tech inc."              
##  [2] "yunnan louis herbs r& d center"            
##  [3] "taylor winch (t) ltd"                      
##  [4] "volcafe/taylorwinch tanzania ltd"          
##  [5] "coffee resources inc."                     
##  [6] "unex guatemala, s.a."                      
##  [7] "asociación nacional del café - anacafe -"  
##  [8] "eduardo ambrocio"                          
##  [9] "mae fah luang foundation"                  
## [10] "doi tung development project"              
## [11] "armajaro guatemala, s. a."                 
## [12] "unex guatemala, s.a."                      
## [13] "unex guatemala, s.a."                      
## [14] "finca el morito"                           
## [15] "siembras vision, s.a."                     
## [16] "finca medina"                              
## [17] "siembras vision, s.a. / ing. jorge bolaños"
## [18] "exportcafe"                                
## [19] "café san blas"

So much nicer! Now we can apply n_distinct to this result.

Method A

df %>% 
  group_by(species) %>% 
  summarise(n_farms = n_distinct(farm_name),
            n_companies = n_distinct(flatten_chr(companies))) ## achieves desired result

## # A tibble: 2 x 3
##   species n_farms n_companies
##   <chr>     <int>       <int>
## 1 Arabica       8          17
## 2 Robusta       2           4

For extra credit, we can use with_groups to accomplish everything in one operation.

Method B

df %>% 
  with_groups(species,
              summarise,
              tibble(
                n_farms = n_distinct(farm_name),
                n_companies = n_distinct(flatten_chr(companies))
              )
  )

## # A tibble: 2 x 3
##   species n_farms n_companies
##   <chr>     <int>       <int>
## 1 Arabica       8          17
## 2 Robusta       2           4

But is Method B really better? Sure, it’s one operation - and if the primary consideration is operational efficiency than I daresay this is an excellent solution.

This observation urges us to revisit our definition of operational efficiency: from a memory perspective, is there any computational difference between Method A & B?

## # A tibble: 2 x 5
##   expression      min   median `itr/sec` mem_alloc
##   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>
## 1 A            2.69ms   2.98ms      316.    3.56KB
## 2 B             4.6ms    5.5ms      171.   10.38KB

Actually, method B is worse than method A.

Conclusion

Constantly being self-critical of one’s code is a dangerous trap, but I also believe this is what gives us the excitement to try new things. Ultimately, I believe the most “efficient” code takes intelligibility & computational performance into consideration while striving to be as concise as possible.

I am constantly reminding myself how important is to be conscientious of these factors.

This is similar to base::unlist, but only removes one layer of hierarchy at a time.↩︎

Method A

Method B

Conclusion

See Also