R and Python are both great languages for data analysis. While they are remarkably
similar in some aspects, they are drastically different in others. In this post,
I will focus on the similarities and differences between Pandas
and data.table
,
two of the most prominent data manipulation packages in Python/R.
There is alreay an excellent post that nicely compares basic data manipulations by Fisseha Berhane, which you can find on this website
I want to focus on how more advanced data.table
operations map to Pandas
,
such as inplace modifications etc.
Let’s get started by setting up our environment. I will use the reticulate
package to work with both R and Python in tandem.
library(data.table)
library(reticulate)
We will use the following super simple dataset:
df_r = data.table(a = 1:6,
b = 7:12,
c = c('q', 'q', 'q', 'q', 'w', 'w')
)
setDT(df_r)
Assignment: copy vs in-place
R is a functional language and therefore data is generally immutable, e.g. ‘changing’ a
dataframe is not possible. R will copy the original dataframe, apply your updates
and store it as a new dataframe. While immutable data makes it easier to reason
about your code, it takes more time and memory to perform computations compared to
modifying data in-place. This is one of the primary reasons why data.table
is
so blazzing fast and memory efficient. It avoids unnecessary copies and directly
modifies your original data.frame (see my post “Using data.table deep copy” for more
infos).
So how can add a column by reference?
New column: in-place & copy
We will start by adding a new column that is simply a multiple of an existing one:
df_add_col = copy(df_r)
tracemem(df_add_col)
## [1] "<00000000198074A8>"
df_add_col[, new_col := a * 2]
df_add_col
## a b c new_col
## 1: 1 7 q 2
## 2: 2 8 q 4
## 3: 3 9 q 6
## 4: 4 10 q 8
## 5: 5 11 w 10
## 6: 6 12 w 12
tracemem(df_add_col)
## [1] "<00000000198074A8>"
import pandas as pd
df_py = r.df_r
print('Old memory location: '+str(id(df_py)),'\n')
## Old memory location: 517604352
df_py['new_col'] = df_py['a'] * 2
print(df_py,'\n')
## a b c new_col
## 0 1 7 q 2
## 1 2 8 q 4
## 2 3 9 q 6
## 3 4 10 q 8
## 4 5 11 w 10
## 5 6 12 w 12
print('New memory location: '+str(id(df_py)))
## New memory location: 517604352
In both cases, we can see that the memory location did not change.
If we want to return a new dataframe (and allocate new memory), we have to do it like this:
df_add_col = copy(df_r)
tracemem(df_add_col)
## [1] "<000000001F9A94C0>"
df_add_col = df_add_col[, .(a, b, c, new_col = a * 2)]
df_add_col
## a b c new_col
## 1: 1 7 q 2
## 2: 2 8 q 4
## 3: 3 9 q 6
## 4: 4 10 q 8
## 5: 5 11 w 10
## 6: 6 12 w 12
tracemem(df_add_col)
## [1] "<000000001FABF538>"
Likewise with Pandas we have to use:
import pandas as pd
df_py = r.df_r
print('Old memory location: '+str(id(df_py)),'\n')
## Old memory location: 525783280
df_py = df_py.assign(new_col = df_py['a']*2)
print(df_py,'\n')
## a b c new_col
## 0 1 7 q 2
## 1 2 8 q 4
## 2 3 9 q 6
## 3 4 10 q 8
## 4 5 11 w 10
## 5 6 12 w 12
print('New Memory location: '+str(id(df_py)))
## New Memory location: 525784288
Using assign
copies the entire dataframe and assigns it to a new memory location.
New column grouped calculation
We often want to add additional columns based on groups in our dataframe. Let’s take a look:
df_add_group_col = copy(df_r)
tracemem(df_add_group_col)
## [1] "<000000001F9A94C0>"
df_add_group_col[, group_col := sum(a) / sum(b), by = c]
df_add_group_col
## a b c group_col
## 1: 1 7 q 0.2941176
## 2: 2 8 q 0.2941176
## 3: 3 9 q 0.2941176
## 4: 4 10 q 0.2941176
## 5: 5 11 w 0.4782609
## 6: 6 12 w 0.4782609
tracemem(df_add_group_col)
## [1] "<000000001F9A94C0>"
In Pandas we have to do the following:
df_py = r.df_r
print('Old memory location: '+str(id(df_py)),'\n')
## Old memory location: 525786640
df_gd = df_py.groupby('c')
df_py['group_col'] = df_gd['a'].transform(lambda x: sum(x))/df_gd['b'].transform('sum')
print(df_py,'\n')
## a b c group_col
## 0 1 7 q 0.294118
## 1 2 8 q 0.294118
## 2 3 9 q 0.294118
## 3 4 10 q 0.294118
## 4 5 11 w 0.478261
## 5 6 12 w 0.478261
print('New memory location: '+str(id(df_py)))
## New memory location: 525786640
In Pandas transform
is used to broadcast a groupby result back to the original
dataframe, i.e. transform
is similar to SQL window functions. I used a lambda
function in my code above, but you could simply write transform('sum')
in this
simple case as well.
Unfortunately, transform
works on columns in sequence, i.e. first column a
is
passed to our lambda function and then column b
. So this code will fail:
df_py = r.df_r
# This will fail:
df_tmp = df_py.groupby('c')[['a', 'b']].transform(lambda x: sum(x['a']))
Alternatively, you could use apply
:
df_py = r.df_r
# This will fail:
df_tmp = df_py.groupby('c')[['a', 'b']].apply(lambda x: sum(x['a'])/sum(x['b']))
print(df_tmp)
## c
## q 0.294118
## w 0.478261
## dtype: float64
Apply takes a dataframe as input, so it ‘sees’ all columns. What is left is to merge the results back to the original dataframe.
Merging
Merging in data.table can be done in two ways:
- Creating a new object containing the merged results
- Merging in-place
In data.table merging is quite easy. I will only show LEFT JOINS here. For more info checkout this link.
df_merge = copy(df_r)
tracemem(df_merge)
## [1] "<0000000018C1C1E8>"
df_merge = df_r[df_r, on = "a"]
tracemem(df_merge)
## [1] "<0000000019C069C0>"
df_merge
## a b c i.b i.c
## 1: 1 7 q 7 q
## 2: 2 8 q 8 q
## 3: 3 9 q 9 q
## 4: 4 10 q 10 q
## 5: 5 11 w 11 w
## 6: 6 12 w 12 w
As expected, a copy is created and new memory is assigned. Now let’s do the same by reference:
df_merge = copy(df_r)
tracemem(df_merge)
## [1] "<000000001FD4D538>"
## Get all column names from df for merging without merge key
key = "a"
cols_merge = setdiff(colnames(df_r), key)
cols_name = paste0(cols_merge,"_m")
df_merge[df_r, (cols_name) := mget(cols_merge), on = key]
df_merge
## a b c b_m c_m
## 1: 1 7 q 7 q
## 2: 2 8 q 8 q
## 3: 3 9 q 9 q
## 4: 4 10 q 10 q
## 5: 5 11 w 11 w
## 6: 6 12 w 12 w
tracemem(df_merge)
## [1] "<000000001FD4D538>"
How can we do that in Pandas? Let’s take a look.
In Pandas you will find DataFrame.merge
and DataFrame.join
. DataFrame.join
is just a
convenience function that saves you some typing when you want to join on dataframe
indices. We will stick to merge
.
df_py = r.df_r
print(id(df_py),'\n')
## 452735048
df_merge = df_py.merge(right=df_py, how='left', on='a',copy=False)
print(df_merge,'\n')
## a b_x c_x b_y c_y
## 0 1 7 q 7 q
## 1 2 8 q 8 q
## 2 3 9 q 9 q
## 3 4 10 q 10 q
## 4 5 11 w 11 w
## 5 6 12 w 12 w
print(id(df_merge))
## 534060168
Pandas has an option copy=False
to avoid unnecessary copies, but you still need
to allocate new memory for the merged result as far as I saw.
Updating
Updating in data.table is easy:
df_update = copy(df_r)
df_update[a >= 3, b := 3]
df_update
## a b c
## 1: 1 7 q
## 2: 2 8 q
## 3: 3 3 q
## 4: 4 3 q
## 5: 5 3 w
## 6: 6 3 w
In Pandas it works quite similarly:
df_update = r.df_r
df_update.loc[df_update['a']>=3, 'b'] = 3
print(df_update, '\n')
## a b c
## 0 1 7 q
## 1 2 8 q
## 2 3 3 q
## 3 4 3 q
## 4 5 3 w
## 5 6 3 w
Melt/Cast
The functions melt
and cast
(data.table)/pivot
(Pandas) are also handy to know. We can use
melt
to bring wide data in long form and cast
for the reverse. I will only show
melting here:
df_melted = melt(data = df_r, id.vars = "c", measure.vars = c("a", "b"),
variable.name = "var", value.name = "value")
df_melted
## c var value
## 1: q a 1
## 2: q a 2
## 3: q a 3
## 4: q a 4
## 5: w a 5
## 6: w a 6
## 7: q b 7
## 8: q b 8
## 9: q b 9
## 10: q b 10
## 11: w b 11
## 12: w b 12
Pandas syntax is almost 100% equivalent:
df_py = r.df_r
df_melted = pd.melt(frame=df_py, id_vars="c", value_vars=["a", "b"],
var_name="var", value_name="value")
print(df_melted)
## c var value
## 0 q a 1
## 1 q a 2
## 2 q a 3
## 3 q a 4
## 4 w a 5
## 5 w a 6
## 6 q b 7
## 7 q b 8
## 8 q b 9
## 9 q b 10
## 10 w b 11
## 11 w b 12
Indexing
Indices help you speed up lookups. However since indices need time to build, they are only worthwhile if you repeatedly query your dataframe based on the index.
Let’s start again by looking at data.table:
df_index = copy(df_r)
# set keys
setkey(df_index, a)
# get keys
key(df_index)
## [1] "a"
# If you want to use character vectors:
# setkeyv(df_r, 'a')
# Query data.table based on key
df_index[.(1:3), .(a, b, c)]
## a b c
## 1: 1 7 q
## 2: 2 8 q
## 3: 3 9 q
Setting a key in data.table physically reorders the rows by reference in increasing order
and sets a sorted
attribute for the key columns. So setting a key is equivalent to
creating a clustered columnstore index in SQL Server. You can also generate non-clustered
indices like so:
setindex(df_index, b)
indices(df_index)
## [1] "b"
# Fast subsetting
df_index[.(7:9), .(a, b, c), on = 'b', verbose = TRUE]
## on= matches existing index, using index
## Starting bmerge ...done in 0.000sec
## Detected that j uses these columns: a,b,c
## a b c
## 1: 1 7 q
## 2: 2 8 q
## 3: 3 9 q
For more details, see the data.table vignette for working with keys and the vignette for working with indices.
In Pandas we can set indices the following way:
df_index = r.df_r
# Set index on existing DataFrame:
# Alternatively, you can include an index when you create the DataFrame
df_index.set_index(keys=['a'], drop=True, inplace=True)
# Check if index is set correctly:
print("Check index: \n", df_index, '\n')
# Filter with index
## Check index:
## b c
## a
## 1 7 q
## 2 8 q
## 3 9 q
## 4 10 q
## 5 11 w
## 6 12 w
print("Lookup using index: \n", df_index.loc[[1,2,3]],'\n')
# Conditional filter with index:
## Lookup using index:
## b c
## a
## 1 7 q
## 2 8 q
## 3 9 q
filter_a = (df_index.index.get_level_values('a') >= 3)
print("Conditional lookup using index: \n", df_index.loc[filter_a], '\n')
# Removing index
## Conditional lookup using index:
## b c
## a
## 3 9 q
## 4 10 q
## 5 11 w
## 6 12 w
df_index.reset_index(inplace = True)
print("Index removed: \n", df_index)
## Index removed:
## a b c
## 0 1 7 q
## 1 2 8 q
## 2 3 9 q
## 3 4 10 q
## 4 5 11 w
## 5 6 12 w
Chaining
Chaining operations can make code more readable. If you are coming from the tidyverse fear not, data.table also works with magrittr:)
library(magrittr)
df_r[,.(b, c)] %>%
.[b > 9]
## b c
## 1: 10 q
## 2: 11 w
## 3: 12 w
You can also chain method calls in Pandas using \\
. Alternatively, you can skip the
backslash and put the entire block in ()
.
df_res = df_py[['b', 'c']] \
.loc[df_py['b'] > 9]
print(df_res)
## b c
## 3 10 q
## 4 11 w
## 5 12 w
Do by group
In this section I want to show you how you can conveniently run calculations per group. Imagine you want to fit a model per group. One way to do it is to loop over the entire data.table and filter each run by the respective group. More convenient in my opinion is the following structure:
df_list = df_r[,list(data = list(.SD)), by = 'c']
df_list
## c data
## 1: q <data.table>
## 2: w <data.table>
We created a new data.table where we collapsed the data per group into a list. This has a couple of advantages:
- It is trival to parallelize
- Works well with
map
pattern in functional programming - No filter errors possible after creation
A very simple example could look like this:
library(foreach)
library(iterators)
# create row iterator
iter_row = iter(df_list, by = 'row')
# Iterate over dataset
# Register parallel backend -> use %dopar%
foreach(i = iter_row) %do% {
lm(a ~ b, data = i$data[[1]])
}
## [[1]]
##
## Call:
## lm(formula = a ~ b, data = i$data[[1]])
##
## Coefficients:
## (Intercept) b
## -6 1
##
##
## [[2]]
##
## Call:
## lm(formula = a ~ b, data = i$data[[1]])
##
## Coefficients:
## (Intercept) b
## -6 1
In Pandas the code is also very concise, but requires a bit more work to parallelize:
df_py = r.df_r
for group, data in df_py.groupby(['c']):
print("Group : ", group,"\n Data: \n", data)
## Group : q
## Data:
## a b c
## 0 1 7 q
## 1 2 8 q
## 2 3 9 q
## 3 4 10 q
## Group : w
## Data:
## a b c
## 4 5 11 w
## 5 6 12 w
Lagging/leading variables
Both data.table and Pandas have a shift
function/method that allows you to lag/lead
columns:
df_shift = copy(df_r)
# Order columns before shifting
setorderv(df_shift, cols = "a")
# Create column names
shift_col_names <- paste(rep(c("a", "b"), each = 2), "lag", 1:2, sep="_")
# Shift columns
df_shift[, (shift_col_names) := shift(.SD, n = 1:2, type = "lag"),
.SDcols = c("a", "b"), by = "c"]
df_shift
## a b c a_lag_1 a_lag_2 b_lag_1 b_lag_2
## 1: 1 7 q NA NA NA NA
## 2: 2 8 q 1 NA 7 NA
## 3: 3 9 q 2 1 8 7
## 4: 4 10 q 3 2 9 8
## 5: 5 11 w NA NA NA NA
## 6: 6 12 w 5 NA 11 NA
In Pandas you can do:
df_shift = r.df_r
for var in ['a', 'b']:
for lag in range(1,3):
df_shift[var+'_lag_'+str(lag)] = df_shift.groupby(['c'])[var].shift(lag)
print(df_shift)
## a b c a_lag_1 a_lag_2 b_lag_1 b_lag_2
## 0 1 7 q NaN NaN NaN NaN
## 1 2 8 q 1.0 NaN 7.0 NaN
## 2 3 9 q 2.0 1.0 8.0 7.0
## 3 4 10 q 3.0 2.0 9.0 8.0
## 4 5 11 w NaN NaN NaN NaN
## 5 6 12 w 5.0 NaN 11.0 NaN
Rolling calculations
Rolling calculations are especially important when working with time-series data.
In data.table there are a couple of packages that allow rolling calculations:
- RcppRoll: optimized C++ code for max, mean, median, min, prod, sd, sum and var. Partial application has not been implemented yet.
- zoo::rollapply: allows arbitrary rolling functions, but slower
- Use
Reduce(fun(), lapply(shift())
I will quickly show you how to work with zoo::rollapply()
:
df_rolling = copy(df_r)
df_rolling[, mean := shift(zoo::rollapply(a, width = 2, mean, align = "right",
partial = F, fill = NA),
n = 1),
by = c]
df_rolling
## a b c mean
## 1: 1 7 q NA
## 2: 2 8 q NA
## 3: 3 9 q 1.5
## 4: 4 10 q 2.5
## 5: 5 11 w NA
## 6: 6 12 w NA
You can calculate rolling statistics for multiple columns in one function call by
using .SD
. Note that shift()
and rollapply()
take the ordering of your columns
as given!
In Pandas, it is not as straight forward I am afraid, because .shift()
and .rolling()
behave differently. While .shift()
does not create a new index, .rolling()
does. This
means that you need to be careful when you assign the results from .rolling()
back to
your dataframe (check if the indices match!). To complicate things further, you cannot simply chain .shift(n).rolling(m)
, because .shift()
will remove the grouping and therefore .rolling()
results can be incorrect.
We will therefore define a function to do the following:
- We perform the rolling calculation
- We remove the additional index created by grouping (index 0)
- We assign the result to the original DataFrame based on the old index
- We shift the values
import numpy as np
from typing import List
df_rolling = r.df_r
# Calculate 2-day mean from day before
def calc_rolling(df: pd.DataFrame, fun, grouping: List[str],
window: int, y: str) -> None:
df_gd = df.groupby(grouping)
col_name = y+'_'+fun.__name__+'_roll'+str(window)
# raw=False silences a warning, because apply currently works with numpay arrays,
# but will be switched to PandasSeries
df[col_name] = df_gd[y] \
.rolling(2) \
.apply(lambda x: fun(x), raw=False) \
.reset_index(level=list(range(len(grouping))),
drop=True
)
df[col_name] = df.groupby(grouping)[col_name].shift(1)
calc_rolling(df_rolling, fun=np.mean, grouping=['c'], window=2, y='a')
print(df_rolling)
## a b c a_mean_roll2
## 0 1 7 q NaN
## 1 2 8 q NaN
## 2 3 9 q 1.5
## 3 4 10 q 2.5
## 4 5 11 w NaN
## 5 6 12 w NaN
In Pandas the rolling
method also supports a time period offset (only valid for datetimelike indices). By default, Pandas rolling
method uses aligns the series to the right, but centering is also possible. Partial application (i.e. using a smaller window than specified at the beginning) is not possible at the moment.
Some final notes
Changing data in-place can offer substantial speedups and reduce memory usage considerably.
However, while data.table is very explicit about when assignment is made by reference
(:=
and all set*
functions) Pandas is less so. There is actually no guarantee that Pandas
performs operations in-place, even if you specify inplace=True
, see for example this stackoverflow thread.
Also note that modifying data in-place can make your operations more difficult to understand/debug.
If you are coming from R, be careful when you assign one Pandas DataFrame column to another DataFrame. Pandas will match the columns based on indices!
I hope you found this post useful! If you find any errors, please open an issue on Github.