Thoughts on Lean Code
- R
Operational efficiency. Or, in other words: packing code into as few operations as possible.
I often follow this philosophy to a fault; for me, discovering new tools & methods to achieve the same result is one of the most rewarding aspects of programming. In that sense, “data science” is a playground.
Recently, I encountered a situation involving nested columns within a grouped data frame. Consider the below excerpt from the coffee_ratings
dataset.
df
## # A tibble: 10 x 3
## species farm_name companies
## <chr> <chr> <list>
## 1 Arabica alicia's farm <chr [2]>
## 2 Arabica ampcg <chr [2]>
## 3 Arabica conquista / morito <chr [4]>
## 4 Arabica doi tung development project <chr [2]>
## 5 Arabica el morito <chr [2]>
## 6 Arabica finca el morito <chr [2]>
## 7 Arabica finca medina <chr [3]>
## 8 Arabica finca santa clara <chr [2]>
## 9 Robusta sethuraman estate <chr [2]>
## 10 Robusta sethuraman estates <chr [2]>
Each element within the companies
column contains multiple character elements. In one operation, how can we obtain a data frame that contains a unique count of companies + farms for both Arabica & Robusta.
For the farm_name
column, this is trivial with conventional dplyr
syntax:
df %>%
group_by(species) %>%
summarise(n_farms = n_distinct(farm_name))
## # A tibble: 2 x 2
## species n_farms
## <chr> <int>
## 1 Arabica 8
## 2 Robusta 2
However, if we try to extend this approach to the farm_name
column, we get the following:
df %>%
group_by(species) %>%
summarise(n_companies = n_distinct(companies)) ## this is misleading!
## # A tibble: 2 x 2
## species n_companies
## <chr> <int>
## 1 Arabica 8
## 2 Robusta 2
What’s wrong? The summarise
operation return an integer column like we expected. In fact, n_distinct
counted the number of distinct permutations of list elements in the companies
columns
company_list <- df %>%
filter(species == "Arabica") %>%
pull(companies)
company_list # expanded form of 'companies' column
## [[1]]
## [1] "yunnan new century tech inc." "yunnan louis herbs r& d center"
##
## [[2]]
## [1] "taylor winch (t) ltd" "volcafe/taylorwinch tanzania ltd"
##
## [[3]]
## [1] "coffee resources inc."
## [2] "unex guatemala, s.a."
## [3] "asociación nacional del café - anacafe -"
## [4] "eduardo ambrocio"
##
## [[4]]
## [1] "mae fah luang foundation" "doi tung development project"
##
## [[5]]
## [1] "armajaro guatemala, s. a." "unex guatemala, s.a."
##
## [[6]]
## [1] "unex guatemala, s.a." "finca el morito"
##
## [[7]]
## [1] "siembras vision, s.a."
## [2] "finca medina"
## [3] "siembras vision, s.a. / ing. jorge bolaños"
##
## [[8]]
## [1] "exportcafe" "café san blas"
This is not exactly what we want. We want the number of unique companies between the permutations.
Cue purrr::flatten_chr
, which simplifies the hierarchical structure of its argument.1
purrr::flatten_chr(company_list)
## [1] "yunnan new century tech inc."
## [2] "yunnan louis herbs r& d center"
## [3] "taylor winch (t) ltd"
## [4] "volcafe/taylorwinch tanzania ltd"
## [5] "coffee resources inc."
## [6] "unex guatemala, s.a."
## [7] "asociación nacional del café - anacafe -"
## [8] "eduardo ambrocio"
## [9] "mae fah luang foundation"
## [10] "doi tung development project"
## [11] "armajaro guatemala, s. a."
## [12] "unex guatemala, s.a."
## [13] "unex guatemala, s.a."
## [14] "finca el morito"
## [15] "siembras vision, s.a."
## [16] "finca medina"
## [17] "siembras vision, s.a. / ing. jorge bolaños"
## [18] "exportcafe"
## [19] "café san blas"
So much nicer! Now we can apply n_distinct
to this result.
Method A
df %>%
group_by(species) %>%
summarise(n_farms = n_distinct(farm_name),
n_companies = n_distinct(flatten_chr(companies))) ## achieves desired result
## # A tibble: 2 x 3
## species n_farms n_companies
## <chr> <int> <int>
## 1 Arabica 8 17
## 2 Robusta 2 4
For extra credit, we can use with_groups
to accomplish everything in one operation.
Method B
df %>%
with_groups(species,
summarise,
tibble(
n_farms = n_distinct(farm_name),
n_companies = n_distinct(flatten_chr(companies))
)
)
## # A tibble: 2 x 3
## species n_farms n_companies
## <chr> <int> <int>
## 1 Arabica 8 17
## 2 Robusta 2 4
But is Method B really better? Sure, it’s one operation - and if the primary consideration is operational efficiency than I daresay this is an excellent solution.
This observation urges us to revisit our definition of operational efficiency: from a memory perspective, is there any computational difference between Method A & B?
## # A tibble: 2 x 5
## expression min median `itr/sec` mem_alloc
## <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt>
## 1 A 2.69ms 2.98ms 316. 3.56KB
## 2 B 4.6ms 5.5ms 171. 10.38KB
Actually, method B is worse than method A.
Conclusion
Constantly being self-critical of one’s code is a dangerous trap, but I also believe this is what gives us the excitement to try new things. Ultimately, I believe the most “efficient” code takes intelligibility & computational performance into consideration while striving to be as concise as possible.
I am constantly reminding myself how important is to be conscientious of these factors.
This is similar to
base::unlist
, but only removes one layer of hierarchy at a time.↩︎