chebur
chebur

Data newbie, fan of British sitcoms, learning, coding, translating, reading

R data.table variable assignment explained

This article will discuss three things:

  • variable assignment mechanism in base R
  • variable assignment mechanism in data.table in different scenarios
  • different ways to assign/update multiple columns in data.table and the impact w.r.t variable assignment

The second and the third points are discussed via examples at the same time

Variable assignment in base R

terms:

  • copy-by-value/pass-by-value/deep copy/deep clone: when making a copy of a variable, makes a copy of the value of original variable. The latter two are used more often in languages such as Python or Java
  • copy-by-reference/shadow copy: when making a copy of a variable, makes a reference that points back to the original variable
  • copy-on-write/copy-on-modify: it usually is used in the copy-by-value case. Means R only copies the object the first time that object gets modified.
copy-by-reference vs. copy-by-value

R is known for copy-by-value, meaning you don’t need to worry about changing a downstream variable that’s been assigned from some other variable would change the original variable. However, this is not entirely true. R takes a rather mixed approach with multiple checks and steps as below graph shows, it’s just normally it’s safe to treat it as copy-by-value.

actual steps R takes to assign a value

The function tracemem( ) returns the address where the value of the variable is at. Here’s a simple example:

> a <- 1 
> b <- a 
> tracemem(a) 
[1] "<0x7f9001d2b880>" 
> tracemem(b) 
[1] "<0x7f9001d2b880>" 

Notice that even we are told that R makes a new copy of the value, it doesn’t happen right away. Making a new copy when the two variables are exactly the same doesn’t do much good but only consumes more space, therefore, R does not really make a copy right away until we first modify any one of the variables:

> b <- 3 
> a 
[1] 1 
> tracemem(b)
[1] "<0x7f90018a0e20>" 

Here, after changing variable b, the address of the value that b points at changes as well, meaning a copy has been made.

variable assignment in data.table

The value assignment mechanism is a bit different in data.table, especially in times when using :=. Here we take a look at some examples. In these examples, three variables are used, m is assigned by a, and b is assigned by m. We will change the value of m and check if a and b change at all.

scenario 1: data.table being passed as a whole

First we define the variables. All examples discussed here will be initiated by the same values unless specified.

>a <- data.table::as.data.table(mtcars)
>a <- a[1:5, 1:4]
>m <- a
>b <- m
> m
    mpg cyl disp  hp
1: 21.0   6  160 110
2: 21.0   6  160 110
3: 22.8   4  108  93
4: 21.4   6  258 110
5: 18.7   8  360 175

> tracemem(a)
[1] "<0x7f8ffe95c400>"
> tracemem(b)
[1] "<0x7f8ffe95c400>"
> tracemem(m)
[1] "<0x7f8ffe95c400>"

We can see right now all the addresses are the same.

eg1: use (.SD) and "by = ". — Not in-place change
>m[, .SD * 2, by = mpg] 
>m
    mpg cyl disp  hp
1: 21.0   6  160 110
2: 21.0   6  160 110
3: 22.8   4  108  93
4: 21.4   6  258 110
5: 18.7   8  360 175
m <- m[, .SD * 2, by = mpg]

>m 
    mpg cyl disp  hp
1: 21.0  12  320 220
2: 21.0  12  320 220
3: 22.8   8  216 186
4: 21.4  12  516 220
5: 18.7  16  720 350

>a
    mpg cyl disp  hp
1: 21.0   6  160 110
2: 21.0   6  160 110
3: 22.8   4  108  93
4: 21.4   6  258 110
5: 18.7   8  360 175

b = a. This shows that by using (.SD), values are not changed in-place, and therefore no new copy is made. When you assign the new value back to m, this does no difference as assigning a new value to mm in this case is modified for the first time, and a new copy is made, meaning that the pointer that points from the name of variable to its value is now pointing at a new address, and at a new value. After that any change regarding mwould have nothing to do with a or b. This is essentially the same as the value assignment in base R. The quirk about using data.table is not shown yet.

eg2: use lapply within data.table with (.SD) — not in-place change

Sometimes we would use lapply within the data.table, often time in j location, in order to do computations for multiple columns all at the same time. An example would be to multiply all columns except for column ‘mpg’ by 2:

m[, lapply(.SD, function(x) x*2), by = mpg]  # m is not changed 

As m is not right away changed, it needs to be assigned back to m. This is similar to eg1, R will make a copy of mat the first modification of m and therefore assigning different addresses to m and a (b will be the same as a).

eg3: change each column individually using :=. In-place change

If you have only a few columns to change, changing them individually may not be a bad idea. This is probably also the approach whoever new to data.table would try. However, be very careful with not only the in-place change, but also the change-by-reference mechanism, as we will see in this example:

> m    # use the settings to initial value for m, a and b
    mpg cyl disp  hp
1: 21.0   6  160 110
2: 21.0   6  160 110
3: 22.8   4  108  93
4: 21.4   6  258 110
5: 18.7   8  360 175

> tracemem(a)    # check for address assigned for a and m
[1] "<0x7f8ffe0aba00>"
> tracemem(m)
[1] "<0x7f8ffe0aba00>"
> m[, cyl := cyl * 2]  # this also changes in place. Good for fewer columns 
> m
    mpg cyl disp  hp
1: 21.0  12  160 110
2: 21.0  12  160 110
3: 22.8   8  108  93
4: 21.4  12  258 110
5: 18.7  16  360 175
> a
    mpg cyl disp  hp
1: 21.0  12  160 110
2: 21.0  12  160 110
3: 22.8   8  108  93
4: 21.4  12  258 110
5: 18.7  16  360 175

> tracemem(m)
[1] "<0x7f8ffe0aba00>"
> tracemem(a)
[1] "<0x7f8ffe0aba00>"

Here by using :=m is changed in-place. Not only so, it also changed the value for variable a and b. Recall that we say R does copy-on-write, and before any variable is modified, variables assigned by others are pointing at the same value that occupies only one address location. In this case, when initiating variable ma, and b, there are three pointers created, pointing from their names (“a”, “b”, “c”) to one value. :=, unlike baseR operations, only modifies the value, but didn’t do anything about the pointer (did’t locate a new address and make a new value for the pointer to point at), so the three pointers still point at the same address, and since value at that address has been changed, all three variables will be changed (Notice that not only do the addresses for m and a are the same after the operation, they are also unchanged from the initiating stage).

A graph to illustrate what happened to the value, variables, and the pointers:

what happened when using ":="
eg4: use := and lapply. In-place change

Suppose you want to easily change multiple columns at the same time, and want to have in-place change, or change-by-reference, for iteration purposes, this is the approach to take:

cols  m
    mpg cyl disp  hp
1: 21.0  12  320 220
2: 21.0  12  320 220
3: 22.8   8  216 186
4: 21.4  12  516 220
5: 18.7  16  720 350

You may wonder, especially after eg1 and eg2, if cols can be replaced with (.SD). The answer is no, (.SD)does not accept assign operations, ie: you can not use = or := to assign values to (.SD). Although from eg1 we see that (.SD) represents columns and the values can be changed, this value, however, is an output and is not assigned to any variable. Using = or := indicates that you are assigning a value to a variable (or creating a point from a name of variable to a value), and this is not allowed with (.SD)

eg5: use :=lapply.SD, and .SDcols to gain maximum flexibility in updating multiple columns. In-place change

All examples we mentioned before have a downside: they are not flexible enough. If we don’t want to update columns individually, we have to modify columns with only one column excepted in by = clause, this may not be what we want a lot of times. Using .SDcols in k location helps addressing this issue.

Suppose we now want to only multiply column cyl and disp by two, and leave all other columns alone, either one of the following would do. These are essentially the same approach with just some minor tweaks hence I put them together:

cols <- c("cyl","disp")    
# or to use: cols <- names(m)[c(2:4)] if referring the columns by column index. Works the same

m[, (cols) := lapply(.SD, function(x) x * 2), .SDcols = 2:3]  
# change in place. Similar to method 3, just changed  'by' to '.SDcols'

or

m[, 2:3 := lapply(.SD, function(x) x * 2), .SDcols = 2:3]  
# same thing. Just changed '(cols)' to '2:3'

or

m[, (cols) := lapply(.SD, function(x) x * 2), .SDcols = cols]  
# same thing, just changed '2:3' in '.SDcols' to 'cols'
# gives more flexibility if you want to select noncontinuous columns
eg6: use := and list( ) without lapply. In-place change

If not using lapply, in data.table the j position needs to receive a list to modify column values. Eg6 achieves similar end results as eg5 (ie: flexibility in column selection, in-place change), without using lapply function.

Again, these are essentially same approaches with little variations in this example. Either one of them would do.

cols <- c("cyl","disp","hp")
m[, (cols) := list(cyl*2, disp *2, hp*2)]  
# assign updated values back to columns, if don't want to create new columns

or

m[, c("cyl_new","disp_new","hp_new") := list(cyl*2, disp *2, hp*2)] 
# created new columns. Now m, a, and b are all having 7 columns

or

ff <- function() {list("v1","v2","v3")}
m[, c("cyl_new","disp_new","hp_new") := ff()] 
# create a function with a list of given value and then assign back to columns
# this works well if the value for a column is given, not ideal for computing column values. 

Summaries

  1. For base R, although it’s said to do copy-by-value, the "copy" action does not activate until the variable is first modified. Hence the "call-on-write". However, for most cases, for simplicity we can treat it as copy-by-value and not worry about referring to original variables
  2. For data.table, the only case that an in-place change would be induced is when using :=. All other modification does not change/modify the variable, it only outputs a result. Hence no difference from base R.
  3. In cases := is used in data.table, not only the variable is changed in-place, b/c the pointers are not changed, data.table does not make any new copy of the value, but keeps all the reference the same. Hence, regardless of upstream or downstream variables related to this modified variable, the value of these variables will be changed as well.

In future posts I will continue this topic but extend it to another data type: list of data.table. List is very powerful as it allows different data type being stored as elements. Combining list with apply family functions can sometimes make your code much shorter and avoid writing loops. It can be a bit hard to understand and may be prone to error, so few examples to go through some scenarios would be a must.

Reference:

1: https://stackoverflow.com/questions/15759117/what-exactly-is-copy-on-modify-semantics-in-r-and-where-is-the-canonical-source 

2: https://stackoverflow.com/questions/184710/what-is-the-difference-between-a-deep-copy-and-a-shallow-copy 

3: https://intellipaat.com/community/12994/assign-multiple-columns-using-in-data-table-by-group

CC BY-NC-ND 2.0 版权声明

喜欢我的文章吗?
别忘了给点支持与赞赏,让我知道创作的路上有你陪伴。

加载中…

发布评论