Pandas for data.table Users

R and Python are both great languages for data analysis. While they are remarkably similar in some aspects, they are drastically different in others. In this post, I will focus on the similarities and differences between Pandas and data.table, two of the most prominent data manipulation packages in Python/R.

There is alreay an excellent post that nicely compares basic data manipulations by Fisseha Berhane, which you can find on this website

I want to focus on how more advanced data.table operations map to Pandas, such as inplace modifications etc.

Let’s get started by setting up our environment. I will use the reticulate package to work with both R and Python in tandem.

library(data.table)
library(reticulate)

We will use the following super simple dataset:

df_r = data.table(a = 1:6, 
                b = 7:12,
                c = c('q', 'q', 'q', 'q', 'w', 'w')
                )

setDT(df_r)

Assignment: copy vs in-place

R is a functional language and therefore data is generally immutable, e.g. ‘changing’ a dataframe is not possible. R will copy the original dataframe, apply your updates and store it as a new dataframe. While immutable data makes it easier to reason about your code, it takes more time and memory to perform computations compared to modifying data in-place. This is one of the primary reasons why data.table is so blazzing fast and memory efficient. It avoids unnecessary copies and directly modifies your original data.frame (see my post “Using data.table deep copy” for more infos).

So how can add a column by reference?

New column: in-place & copy

We will start by adding a new column that is simply a multiple of an existing one:

df_add_col = copy(df_r)
tracemem(df_add_col)

## [1] "<00000000198074A8>"

df_add_col[, new_col := a * 2] 

df_add_col

##    a  b c new_col
## 1: 1  7 q       2
## 2: 2  8 q       4
## 3: 3  9 q       6
## 4: 4 10 q       8
## 5: 5 11 w      10
## 6: 6 12 w      12

tracemem(df_add_col)

## [1] "<00000000198074A8>"

import pandas as pd
df_py = r.df_r
print('Old memory location: '+str(id(df_py)),'\n')

## Old memory location: 517604352

df_py['new_col'] = df_py['a'] * 2
print(df_py,'\n')

##    a   b  c  new_col
## 0  1   7  q        2
## 1  2   8  q        4
## 2  3   9  q        6
## 3  4  10  q        8
## 4  5  11  w       10
## 5  6  12  w       12

print('New memory location: '+str(id(df_py)))

## New memory location: 517604352

In both cases, we can see that the memory location did not change.

If we want to return a new dataframe (and allocate new memory), we have to do it like this:

df_add_col = copy(df_r)
tracemem(df_add_col)

## [1] "<000000001F9A94C0>"

df_add_col = df_add_col[, .(a, b, c, new_col = a * 2)] 

df_add_col

##    a  b c new_col
## 1: 1  7 q       2
## 2: 2  8 q       4
## 3: 3  9 q       6
## 4: 4 10 q       8
## 5: 5 11 w      10
## 6: 6 12 w      12

tracemem(df_add_col)

## [1] "<000000001FABF538>"

Likewise with Pandas we have to use:

import pandas as pd
df_py = r.df_r
print('Old memory location: '+str(id(df_py)),'\n')

## Old memory location: 525783280

df_py = df_py.assign(new_col = df_py['a']*2)
print(df_py,'\n')

##    a   b  c  new_col
## 0  1   7  q        2
## 1  2   8  q        4
## 2  3   9  q        6
## 3  4  10  q        8
## 4  5  11  w       10
## 5  6  12  w       12

print('New Memory location: '+str(id(df_py)))

## New Memory location: 525784288

Using assign copies the entire dataframe and assigns it to a new memory location.

New column grouped calculation

We often want to add additional columns based on groups in our dataframe. Let’s take a look:

df_add_group_col = copy(df_r)
tracemem(df_add_group_col)

## [1] "<000000001F9A94C0>"

df_add_group_col[, group_col := sum(a) / sum(b), by = c] 

df_add_group_col

##    a  b c group_col
## 1: 1  7 q 0.2941176
## 2: 2  8 q 0.2941176
## 3: 3  9 q 0.2941176
## 4: 4 10 q 0.2941176
## 5: 5 11 w 0.4782609
## 6: 6 12 w 0.4782609

tracemem(df_add_group_col)

## [1] "<000000001F9A94C0>"

In Pandas we have to do the following:

df_py = r.df_r
print('Old memory location: '+str(id(df_py)),'\n')

## Old memory location: 525786640

df_gd = df_py.groupby('c')
df_py['group_col'] = df_gd['a'].transform(lambda x: sum(x))/df_gd['b'].transform('sum')
print(df_py,'\n')

##    a   b  c  group_col
## 0  1   7  q   0.294118
## 1  2   8  q   0.294118
## 2  3   9  q   0.294118
## 3  4  10  q   0.294118
## 4  5  11  w   0.478261
## 5  6  12  w   0.478261

print('New memory location: '+str(id(df_py)))

## New memory location: 525786640

In Pandas transform is used to broadcast a groupby result back to the original dataframe, i.e. transform is similar to SQL window functions. I used a lambda function in my code above, but you could simply write transform('sum') in this simple case as well.

Unfortunately, transform works on columns in sequence, i.e. first column a is passed to our lambda function and then column b. So this code will fail:

df_py = r.df_r
# This will fail:
df_tmp = df_py.groupby('c')[['a', 'b']].transform(lambda x: sum(x['a']))

Alternatively, you could use apply:

df_py = r.df_r
# This will fail:
df_tmp = df_py.groupby('c')[['a', 'b']].apply(lambda x: sum(x['a'])/sum(x['b']))
print(df_tmp)

## c
## q    0.294118
## w    0.478261
## dtype: float64

Apply takes a dataframe as input, so it ‘sees’ all columns. What is left is to merge the results back to the original dataframe.

Merging

Merging in data.table can be done in two ways:

Creating a new object containing the merged results
Merging in-place

In data.table merging is quite easy. I will only show LEFT JOINS here. For more info checkout this link.

df_merge = copy(df_r)
tracemem(df_merge)

## [1] "<0000000018C1C1E8>"

df_merge = df_r[df_r, on = "a"]
tracemem(df_merge)

## [1] "<0000000019C069C0>"

df_merge

##    a  b c i.b i.c
## 1: 1  7 q   7   q
## 2: 2  8 q   8   q
## 3: 3  9 q   9   q
## 4: 4 10 q  10   q
## 5: 5 11 w  11   w
## 6: 6 12 w  12   w

As expected, a copy is created and new memory is assigned. Now let’s do the same by reference:

df_merge = copy(df_r)
tracemem(df_merge)

## [1] "<000000001FD4D538>"

## Get all column names from df for merging without merge key
key = "a"
cols_merge = setdiff(colnames(df_r), key)
cols_name = paste0(cols_merge,"_m")

df_merge[df_r, (cols_name) := mget(cols_merge), on = key]
df_merge

##    a  b c b_m c_m
## 1: 1  7 q   7   q
## 2: 2  8 q   8   q
## 3: 3  9 q   9   q
## 4: 4 10 q  10   q
## 5: 5 11 w  11   w
## 6: 6 12 w  12   w

tracemem(df_merge)

## [1] "<000000001FD4D538>"

How can we do that in Pandas? Let’s take a look.

In Pandas you will find DataFrame.merge and DataFrame.join. DataFrame.join is just a convenience function that saves you some typing when you want to join on dataframe indices. We will stick to merge.

df_py = r.df_r
print(id(df_py),'\n')

## 452735048

df_merge = df_py.merge(right=df_py, how='left', on='a',copy=False)
print(df_merge,'\n')

##    a  b_x c_x  b_y c_y
## 0  1    7   q    7   q
## 1  2    8   q    8   q
## 2  3    9   q    9   q
## 3  4   10   q   10   q
## 4  5   11   w   11   w
## 5  6   12   w   12   w

print(id(df_merge))

## 534060168

Pandas has an option copy=False to avoid unnecessary copies, but you still need to allocate new memory for the merged result as far as I saw.

Updating

Updating in data.table is easy:

df_update = copy(df_r)

df_update[a >= 3, b := 3] 
df_update

##    a b c
## 1: 1 7 q
## 2: 2 8 q
## 3: 3 3 q
## 4: 4 3 q
## 5: 5 3 w
## 6: 6 3 w

In Pandas it works quite similarly:

df_update = r.df_r
df_update.loc[df_update['a']>=3, 'b'] = 3
print(df_update, '\n')

##    a  b  c
## 0  1  7  q
## 1  2  8  q
## 2  3  3  q
## 3  4  3  q
## 4  5  3  w
## 5  6  3  w

Melt/Cast

The functions melt and cast(data.table)/pivot(Pandas) are also handy to know. We can use melt to bring wide data in long form and cast for the reverse. I will only show melting here:

df_melted = melt(data = df_r, id.vars = "c", measure.vars = c("a", "b"),
                 variable.name = "var", value.name = "value")

df_melted

##     c var value
##  1: q   a     1
##  2: q   a     2
##  3: q   a     3
##  4: q   a     4
##  5: w   a     5
##  6: w   a     6
##  7: q   b     7
##  8: q   b     8
##  9: q   b     9
## 10: q   b    10
## 11: w   b    11
## 12: w   b    12

Pandas syntax is almost 100% equivalent:

df_py = r.df_r
df_melted = pd.melt(frame=df_py, id_vars="c", value_vars=["a", "b"],
                    var_name="var", value_name="value")
                    
print(df_melted)

##     c var  value
## 0   q   a      1
## 1   q   a      2
## 2   q   a      3
## 3   q   a      4
## 4   w   a      5
## 5   w   a      6
## 6   q   b      7
## 7   q   b      8
## 8   q   b      9
## 9   q   b     10
## 10  w   b     11
## 11  w   b     12

Indexing

Indices help you speed up lookups. However since indices need time to build, they are only worthwhile if you repeatedly query your dataframe based on the index.

Let’s start again by looking at data.table:

df_index = copy(df_r)
# set keys
setkey(df_index, a) 

# get keys
key(df_index)

## [1] "a"

# If you want to use character vectors:
# setkeyv(df_r, 'a')

# Query data.table based on key
df_index[.(1:3), .(a, b, c)]

##    a b c
## 1: 1 7 q
## 2: 2 8 q
## 3: 3 9 q

Setting a key in data.table physically reorders the rows by reference in increasing order and sets a sorted attribute for the key columns. So setting a key is equivalent to creating a clustered columnstore index in SQL Server. You can also generate non-clustered indices like so:

setindex(df_index, b)
indices(df_index)

## [1] "b"

# Fast subsetting
df_index[.(7:9), .(a, b, c), on = 'b', verbose = TRUE]

## on= matches existing index, using index
## Starting bmerge ...done in 0.000sec 
## Detected that j uses these columns: a,b,c

##    a b c
## 1: 1 7 q
## 2: 2 8 q
## 3: 3 9 q

For more details, see the data.table vignette for working with keys and the vignette for working with indices.

In Pandas we can set indices the following way:

df_index = r.df_r
# Set index on existing DataFrame:
# Alternatively, you can include an index when you create the DataFrame
df_index.set_index(keys=['a'], drop=True, inplace=True)
# Check if index is set correctly:
print("Check index: \n", df_index, '\n')
# Filter with index

## Check index: 
##      b  c
## a       
## 1   7  q
## 2   8  q
## 3   9  q
## 4  10  q
## 5  11  w
## 6  12  w

print("Lookup using index: \n", df_index.loc[[1,2,3]],'\n')
# Conditional filter with index:

## Lookup using index: 
##     b  c
## a      
## 1  7  q
## 2  8  q
## 3  9  q

filter_a = (df_index.index.get_level_values('a') >= 3)
print("Conditional lookup using index: \n", df_index.loc[filter_a], '\n')
# Removing index

## Conditional lookup using index: 
##      b  c
## a       
## 3   9  q
## 4  10  q
## 5  11  w
## 6  12  w

df_index.reset_index(inplace = True)
print("Index removed: \n", df_index)

## Index removed: 
##     a   b  c
## 0  1   7  q
## 1  2   8  q
## 2  3   9  q
## 3  4  10  q
## 4  5  11  w
## 5  6  12  w

Chaining

Chaining operations can make code more readable. If you are coming from the tidyverse fear not, data.table also works with magrittr:)

library(magrittr)

df_r[,.(b, c)] %>%
  .[b > 9]

##     b c
## 1: 10 q
## 2: 11 w
## 3: 12 w

You can also chain method calls in Pandas using \\. Alternatively, you can skip the backslash and put the entire block in ().

df_res = df_py[['b', 'c']] \
            .loc[df_py['b'] > 9]
print(df_res)

##     b  c
## 3  10  q
## 4  11  w
## 5  12  w

Do by group

In this section I want to show you how you can conveniently run calculations per group. Imagine you want to fit a model per group. One way to do it is to loop over the entire data.table and filter each run by the respective group. More convenient in my opinion is the following structure:

df_list = df_r[,list(data = list(.SD)), by = 'c']
df_list

##    c         data
## 1: q <data.table>
## 2: w <data.table>

We created a new data.table where we collapsed the data per group into a list. This has a couple of advantages:

It is trival to parallelize
Works well with map pattern in functional programming
No filter errors possible after creation

A very simple example could look like this:

library(foreach)
library(iterators)

# create row iterator
iter_row = iter(df_list, by = 'row')


# Iterate over dataset
# Register parallel backend -> use %dopar%
foreach(i = iter_row) %do% {
  
  lm(a ~ b, data = i$data[[1]])
  
}

## [[1]]
## 
## Call:
## lm(formula = a ~ b, data = i$data[[1]])
## 
## Coefficients:
## (Intercept)            b  
##          -6            1  
## 
## 
## [[2]]
## 
## Call:
## lm(formula = a ~ b, data = i$data[[1]])
## 
## Coefficients:
## (Intercept)            b  
##          -6            1

In Pandas the code is also very concise, but requires a bit more work to parallelize:

df_py = r.df_r
for group, data in df_py.groupby(['c']):
  print("Group : ", group,"\n Data: \n", data)

## Group :  q 
##  Data: 
##     a   b  c
## 0  1   7  q
## 1  2   8  q
## 2  3   9  q
## 3  4  10  q
## Group :  w 
##  Data: 
##     a   b  c
## 4  5  11  w
## 5  6  12  w

Lagging/leading variables

Both data.table and Pandas have a shift function/method that allows you to lag/lead columns:

df_shift = copy(df_r)

# Order columns before shifting
setorderv(df_shift, cols = "a")

# Create column names
shift_col_names <- paste(rep(c("a", "b"), each = 2), "lag", 1:2, sep="_")

# Shift columns
df_shift[, (shift_col_names) := shift(.SD, n = 1:2, type = "lag"), 
         .SDcols = c("a", "b"), by = "c"]

df_shift

##    a  b c a_lag_1 a_lag_2 b_lag_1 b_lag_2
## 1: 1  7 q      NA      NA      NA      NA
## 2: 2  8 q       1      NA       7      NA
## 3: 3  9 q       2       1       8       7
## 4: 4 10 q       3       2       9       8
## 5: 5 11 w      NA      NA      NA      NA
## 6: 6 12 w       5      NA      11      NA

In Pandas you can do:

df_shift = r.df_r
for var in ['a', 'b']:
  for lag in range(1,3):
    df_shift[var+'_lag_'+str(lag)] = df_shift.groupby(['c'])[var].shift(lag)
print(df_shift)

##    a   b  c  a_lag_1  a_lag_2  b_lag_1  b_lag_2
## 0  1   7  q      NaN      NaN      NaN      NaN
## 1  2   8  q      1.0      NaN      7.0      NaN
## 2  3   9  q      2.0      1.0      8.0      7.0
## 3  4  10  q      3.0      2.0      9.0      8.0
## 4  5  11  w      NaN      NaN      NaN      NaN
## 5  6  12  w      5.0      NaN     11.0      NaN

Rolling calculations

Rolling calculations are especially important when working with time-series data.

In data.table there are a couple of packages that allow rolling calculations:

RcppRoll: optimized C++ code for max, mean, median, min, prod, sd, sum and var. Partial application has not been implemented yet.
zoo::rollapply: allows arbitrary rolling functions, but slower
Use Reduce(fun(), lapply(shift())

I will quickly show you how to work with zoo::rollapply():

df_rolling = copy(df_r)

df_rolling[, mean := shift(zoo::rollapply(a, width = 2, mean, align = "right", 
                                          partial = F, fill = NA),
                          n = 1),
           by = c]

df_rolling

##    a  b c mean
## 1: 1  7 q   NA
## 2: 2  8 q   NA
## 3: 3  9 q  1.5
## 4: 4 10 q  2.5
## 5: 5 11 w   NA
## 6: 6 12 w   NA

You can calculate rolling statistics for multiple columns in one function call by using .SD. Note that shift() and rollapply() take the ordering of your columns as given!

In Pandas, it is not as straight forward I am afraid, because .shift() and .rolling() behave differently. While .shift() does not create a new index, .rolling() does. This means that you need to be careful when you assign the results from .rolling() back to your dataframe (check if the indices match!). To complicate things further, you cannot simply chain .shift(n).rolling(m), because .shift() will remove the grouping and therefore .rolling() results can be incorrect.

We will therefore define a function to do the following:

We perform the rolling calculation
We remove the additional index created by grouping (index 0)
We assign the result to the original DataFrame based on the old index
We shift the values

import numpy as np
from typing import List
df_rolling = r.df_r
# Calculate 2-day mean from day before
def calc_rolling(df: pd.DataFrame, fun, grouping: List[str], 
                 window: int, y: str) -> None:
                 
  df_gd = df.groupby(grouping)
  
  col_name = y+'_'+fun.__name__+'_roll'+str(window)
  
  # raw=False silences a warning, because apply currently works with numpay arrays, 
  # but will be switched to PandasSeries
  df[col_name] = df_gd[y]  \
                    .rolling(2) \
                    .apply(lambda x: fun(x), raw=False) \
                    .reset_index(level=list(range(len(grouping))),
                                                          drop=True
                                                          )
                                  
  df[col_name] = df.groupby(grouping)[col_name].shift(1)
calc_rolling(df_rolling, fun=np.mean, grouping=['c'], window=2, y='a')
print(df_rolling)

##    a   b  c  a_mean_roll2
## 0  1   7  q           NaN
## 1  2   8  q           NaN
## 2  3   9  q           1.5
## 3  4  10  q           2.5
## 4  5  11  w           NaN
## 5  6  12  w           NaN

In Pandas the rolling method also supports a time period offset (only valid for datetimelike indices). By default, Pandas rolling method uses aligns the series to the right, but centering is also possible. Partial application (i.e. using a smaller window than specified at the beginning) is not possible at the moment.

Some final notes

Changing data in-place can offer substantial speedups and reduce memory usage considerably.

However, while data.table is very explicit about when assignment is made by reference (:= and all set* functions) Pandas is less so. There is actually no guarantee that Pandas performs operations in-place, even if you specify inplace=True, see for example this stackoverflow thread.

Also note that modifying data in-place can make your operations more difficult to understand/debug.

If you are coming from R, be careful when you assign one Pandas DataFrame column to another DataFrame. Pandas will match the columns based on indices!

I hope you found this post useful! If you find any errors, please open an issue on Github.

semper quaerere