A pandas, dyplr, data.table, and pydatatable Repository

Motivation

         The content and formatting of this post is inspired by the following comprehensive post by Atrebas and the pydatable page. You could also find Atrebas on Twitter here. I use both R and Python extensively in my work everyday, especially the data wrangling packages and library data.table (pydatatable), dplyr and pandas. I find that, in my case, I often have to translate between similar but different syntax when switching from one another. In R, there are ways to accomplishing tasks and there are more Pythonic ways to do things in Python. This is an attempt to create a list of code snippets that can hopefully become a reference for myself.

         The outputs of code blocks are hidden by default, but they may be shown by clicking the button below:



Packages and Library

library(tibble)
library(tidyr)
library(dplyr)
library(data.table)
import pandas as pd
import numpy as np
from random import sample
from datatable import dt, f, by, g, join, sort, update, ifelse, first


DataFrame Creation

# Pandas
nums = sample(range(1, 20), 9)
PD = pd.DataFrame({
  'A': nums,
  'B': range(1, 10),
  'C': ["b"]*3 + ["a"]*3 + ["c"]*3,
  'D': ['startswith_this', np.NaN, 'c', 'd', 'e', np.NaN, 'this_endwith', 'k', 'z']
})
PD 
##     A  B  C                D
## 0  19  1  b  startswith_this
## 1   4  2  b              NaN
## 2   2  3  b                c
## 3   9  4  a                d
## 4  11  5  a                e
## 5  15  6  a              NaN
## 6  16  7  c     this_endwith
## 7   6  8  c                k
## 8  12  9  c                z
# Pydatatable
PY_DT = dt.Frame({
  'A': nums,
  'B': range(1, 10),
  'C': ["b"]*3 + ["a"]*3 + ["c"]*3,
  'D': ['startswith_this', None, 'c', 'd', 'e', None, 'this_endwith', 'k', 'z']
})
PY_DT
##    |     A      B  C      D              
##    | int32  int32  str32  str32          
## -- + -----  -----  -----  ---------------
##  0 |    19      1  b      startswith_this
##  1 |     4      2  b      NA             
##  2 |     2      3  b      c              
##  3 |     9      4  a      d              
##  4 |    11      5  a      e              
##  5 |    15      6  a      NA             
##  6 |    16      7  c      this_endwith   
##  7 |     6      8  c      k              
##  8 |    12      9  c      z              
## [9 rows x 4 columns]
# Dplyr
TB <- tibble(
  A = py$nums,
  B = 1:9,
  C = rep(c("b", "a", "c"), each = 3),
  D = c("startswith_this", NA_character_, "c", "d", "e", NA_character_, "this_endwith", "k", "z")
)
TB
## # A tibble: 9 × 4
##       A     B C     D              
##   <int> <int> <chr> <chr>          
## 1    19     1 b     startswith_this
## 2     4     2 b     <NA>           
## 3     2     3 b     c              
## 4     9     4 a     d              
## 5    11     5 a     e              
## 6    15     6 a     <NA>           
## 7    16     7 c     this_endwith   
## 8     6     8 c     k              
## 9    12     9 c     z
# Data.table
R_DT <- data.table(
  A = py$nums,
  B = 1:9,
  C = rep(c("b", "a", "c"), each = 3),
  D = c("startswith_this", NA_character_, "c", "d", "e", NA_character_, "this_endwith", "k", "z")
)
R_DT
##     A B C               D
## 1: 19 1 b startswith_this
## 2:  4 2 b            <NA>
## 3:  2 3 b               c
## 4:  9 4 a               d
## 5: 11 5 a               e
## 6: 15 6 a            <NA>
## 7: 16 7 c    this_endwith
## 8:  6 8 c               k
## 9: 12 9 c               z

Row Selection

Select a single row

# Pandas
PD.loc[2]
## A    2
## B    3
## C    b
## D    c
## Name: 2, dtype: object
# Pydatatable
PY_DT[2, :]
##    |     A      B  C      D    
##    | int32  int32  str32  str32
## -- + -----  -----  -----  -----
##  0 |     2      3  b      c    
## [1 row x 4 columns]
# Dplyr
TB[2, ]
# Or
slice(TB, 2)
## # A tibble: 1 × 4
##       A     B C     D    
##   <int> <int> <chr> <chr>
## 1     4     2 b     <NA> 
## # A tibble: 1 × 4
##       A     B C     D    
##   <int> <int> <chr> <chr>
## 1     4     2 b     <NA>
# Data.table
R_DT[2]
##    A B C    D
## 1: 4 2 b <NA>

Select several rows by their indices

Note that Python uses zero-based indexing.

# Pandas
PD.iloc[[2, 3, 4], :]
##     A  B  C  D
## 2   2  3  b  c
## 3   9  4  a  d
## 4  11  5  a  e
# Pydatatable
PY_DT[[2, 3, 4], :]
##    |     A      B  C      D    
##    | int32  int32  str32  str32
## -- + -----  -----  -----  -----
##  0 |     2      3  b      c    
##  1 |     9      4  a      d    
##  2 |    11      5  a      e    
## [3 rows x 4 columns]
# Dplyr
TB[c(3, 4, 5), ]
# Or
slice(TB, c(3, 4, 5))
## # A tibble: 3 × 4
##       A     B C     D    
##   <int> <int> <chr> <chr>
## 1     2     3 b     c    
## 2     9     4 a     d    
## 3    11     5 a     e    
## # A tibble: 3 × 4
##       A     B C     D    
##   <int> <int> <chr> <chr>
## 1     2     3 b     c    
## 2     9     4 a     d    
## 3    11     5 a     e
# Data.table
R_DT[c(3, 4, 5), ]
# OR
R_DT[c(3, 4, 5)] 
##     A B C D
## 1:  2 3 b c
## 2:  9 4 a d
## 3: 11 5 a e
##     A B C D
## 1:  2 3 b c
## 2:  9 4 a d
## 3: 11 5 a e

Select a slice of rows by position

# Pandas
PD.iloc[2:5]
# Or
PD.iloc[range(2, 5)]
##     A  B  C  D
## 2   2  3  b  c
## 3   9  4  a  d
## 4  11  5  a  e
##     A  B  C  D
## 2   2  3  b  c
## 3   9  4  a  d
## 4  11  5  a  e
# Pydatatable
PY_DT[2:5, :]
# Or
PY_DT[range(2, 5), :]
##    |     A      B  C      D    
##    | int32  int32  str32  str32
## -- + -----  -----  -----  -----
##  0 |     2      3  b      c    
##  1 |     9      4  a      d    
##  2 |    11      5  a      e    
## [3 rows x 4 columns]
##    |     A      B  C      D    
##    | int32  int32  str32  str32
## -- + -----  -----  -----  -----
##  0 |     2      3  b      c    
##  1 |     9      4  a      d    
##  2 |    11      5  a      e    
## [3 rows x 4 columns]
# Dplyr
TB[3:5, ]
# Or
slice(TB, 3:5)
## # A tibble: 3 × 4
##       A     B C     D    
##   <int> <int> <chr> <chr>
## 1     2     3 b     c    
## 2     9     4 a     d    
## 3    11     5 a     e    
## # A tibble: 3 × 4
##       A     B C     D    
##   <int> <int> <chr> <chr>
## 1     2     3 b     c    
## 2     9     4 a     d    
## 3    11     5 a     e
# Data.table
R_DT[3:5, ]
# Or
R_DT[3:5] 
##     A B C D
## 1:  2 3 b c
## 2:  9 4 a d
## 3: 11 5 a e
##     A B C D
## 1:  2 3 b c
## 2:  9 4 a d
## 3: 11 5 a e

Select every second row

# Pandas
PD.iloc[::2]
##     A  B  C                D
## 0  19  1  b  startswith_this
## 2   2  3  b                c
## 4  11  5  a                e
## 6  16  7  c     this_endwith
## 8  12  9  c                z
# Pydatatable
PY_DT[::2, :]
##    |     A      B  C      D              
##    | int32  int32  str32  str32          
## -- + -----  -----  -----  ---------------
##  0 |    19      1  b      startswith_this
##  1 |     2      3  b      c              
##  2 |    11      5  a      e              
##  3 |    16      7  c      this_endwith   
##  4 |    12      9  c      z              
## [5 rows x 4 columns]
# Dplyr
TB[seq.int(from = 1, to = nrow(TB), by = 2), ]
# Or
slice(TB, seq.int(from = 1, to = nrow(TB), by = 2))
## # A tibble: 5 × 4
##       A     B C     D              
##   <int> <int> <chr> <chr>          
## 1    19     1 b     startswith_this
## 2     2     3 b     c              
## 3    11     5 a     e              
## 4    16     7 c     this_endwith   
## 5    12     9 c     z              
## # A tibble: 5 × 4
##       A     B C     D              
##   <int> <int> <chr> <chr>          
## 1    19     1 b     startswith_this
## 2     2     3 b     c              
## 3    11     5 a     e              
## 4    16     7 c     this_endwith   
## 5    12     9 c     z
# Data.table
R_DT[seq.int(from = 1, to = nrow(TB), by = 2), ]
# Or
R_DT[seq.int(from = 1, to = nrow(TB), by = 2)] 
##     A B C               D
## 1: 19 1 b startswith_this
## 2:  2 3 b               c
## 3: 11 5 a               e
## 4: 16 7 c    this_endwith
## 5: 12 9 c               z
##     A B C               D
## 1: 19 1 b startswith_this
## 2:  2 3 b               c
## 3: 11 5 a               e
## 4: 16 7 c    this_endwith
## 5: 12 9 c               z

Select all rows other than rows 1, 2, 3, and 9

# Pandas
# Start at index 3 and stop at 8, and 1 step at a time
PD.iloc[slice(3, 8, None)]
##     A  B  C             D
## 3   9  4  a             d
## 4  11  5  a             e
## 5  15  6  a           NaN
## 6  16  7  c  this_endwith
## 7   6  8  c             k
# Pydatatable
# Start at index 3 and stop at 8, and 1 step at a time
PY_DT[slice(3, 8, None), :]
##    |     A      B  C      D           
##    | int32  int32  str32  str32       
## -- + -----  -----  -----  ------------
##  0 |     9      4  a      d           
##  1 |    11      5  a      e           
##  2 |    15      6  a      NA          
##  3 |    16      7  c      this_endwith
##  4 |     6      8  c      k           
## [5 rows x 4 columns]
# Dplyr
TB[-c(1:3, 9), ]
# Or
slice(TB, -c(1:3, 9))
## # A tibble: 5 × 4
##       A     B C     D           
##   <int> <int> <chr> <chr>       
## 1     9     4 a     d           
## 2    11     5 a     e           
## 3    15     6 a     <NA>        
## 4    16     7 c     this_endwith
## 5     6     8 c     k           
## # A tibble: 5 × 4
##       A     B C     D           
##   <int> <int> <chr> <chr>       
## 1     9     4 a     d           
## 2    11     5 a     e           
## 3    15     6 a     <NA>        
## 4    16     7 c     this_endwith
## 5     6     8 c     k
# Data.table
R_DT[!c(1:3, 9), ] 
# Or 
R_DT[-c(1:3, 9), ]
##     A B C            D
## 1:  9 4 a            d
## 2: 11 5 a            e
## 3: 15 6 a         <NA>
## 4: 16 7 c this_endwith
## 5:  6 8 c            k
##     A B C            D
## 1:  9 4 a            d
## 2: 11 5 a            e
## 3: 15 6 a         <NA>
## 4: 16 7 c this_endwith
## 5:  6 8 c            k

Select rows using a boolean mask

# Pandas
# Number of elements must match the number of rows
PD.iloc[[True, False, True] * 3]
##     A  B  C                D
## 0  19  1  b  startswith_this
## 2   2  3  b                c
## 3   9  4  a                d
## 5  15  6  a              NaN
## 6  16  7  c     this_endwith
## 8  12  9  c                z
# Pydatatable
PY_DT[[True, False, True] * 3, :]
##    |     A      B  C      D              
##    | int32  int32  str32  str32          
## -- + -----  -----  -----  ---------------
##  0 |    19      1  b      startswith_this
##  1 |     2      3  b      c              
##  2 |     9      4  a      d              
##  3 |    15      6  a      NA             
##  4 |    16      7  c      this_endwith   
##  5 |    12      9  c      z              
## [6 rows x 4 columns]
# Dplyr
TB[rep(c(TRUE, FALSE, TRUE), 3), ]
## # A tibble: 6 × 4
##       A     B C     D              
##   <int> <int> <chr> <chr>          
## 1    19     1 b     startswith_this
## 2     2     3 b     c              
## 3     9     4 a     d              
## 4    15     6 a     <NA>           
## 5    16     7 c     this_endwith   
## 6    12     9 c     z
# Data.table
R_DT[rep(c(TRUE, FALSE, TRUE), 3), ]
##     A B C               D
## 1: 19 1 b startswith_this
## 2:  2 3 b               c
## 3:  9 4 a               d
## 4: 15 6 a            <NA>
## 5: 16 7 c    this_endwith
## 6: 12 9 c               z

Select rows on a single condition

# Pandas
PD.loc[PD['A'] > 9]
##     A  B  C                D
## 0  19  1  b  startswith_this
## 4  11  5  a                e
## 5  15  6  a              NaN
## 6  16  7  c     this_endwith
## 8  12  9  c                z
# Pydatatable
PY_DT[f.A > 9, :]
##    |     A      B  C      D              
##    | int32  int32  str32  str32          
## -- + -----  -----  -----  ---------------
##  0 |    19      1  b      startswith_this
##  1 |    11      5  a      e              
##  2 |    15      6  a      NA             
##  3 |    16      7  c      this_endwith   
##  4 |    12      9  c      z              
## [5 rows x 4 columns]
# Dplyr
filter(TB, A > 9)
## # A tibble: 5 × 4
##       A     B C     D              
##   <int> <int> <chr> <chr>          
## 1    19     1 b     startswith_this
## 2    11     5 a     e              
## 3    15     6 a     <NA>           
## 4    16     7 c     this_endwith   
## 5    12     9 c     z
# Data.table
R_DT[A > 9]
# Or set key for faster binary search
setkeyv(R_DT, cols = "A")
R_DT[A > 9] 
# Remove key
setkeyv(R_DT, cols = NULL)
##     A B C               D
## 1: 19 1 b startswith_this
## 2: 11 5 a               e
## 3: 15 6 a            <NA>
## 4: 16 7 c    this_endwith
## 5: 12 9 c               z
##     A B C               D
## 1: 11 5 a               e
## 2: 12 9 c               z
## 3: 15 6 a            <NA>
## 4: 16 7 c    this_endwith
## 5: 19 1 b startswith_this

Select rows on multiple conditions, using OR

# Pandas
PD.loc[(PD['A'] > 9) | (PD['C'] == 'b')]
##     A  B  C                D
## 0  19  1  b  startswith_this
## 1   4  2  b              NaN
## 2   2  3  b                c
## 4  11  5  a                e
## 5  15  6  a              NaN
## 6  16  7  c     this_endwith
## 8  12  9  c                z
# Pydatatable
PY_DT[(f.A > 9) | (f.C == 'b'), :]
##    |     A      B  C      D              
##    | int32  int32  str32  str32          
## -- + -----  -----  -----  ---------------
##  0 |    19      1  b      startswith_this
##  1 |     4      2  b      NA             
##  2 |     2      3  b      c              
##  3 |    11      5  a      e              
##  4 |    15      6  a      NA             
##  5 |    16      7  c      this_endwith   
##  6 |    12      9  c      z              
## [7 rows x 4 columns]
# Dplyr
filter(TB, A > 9 | C == 'b')
## # A tibble: 7 × 4
##       A     B C     D              
##   <int> <int> <chr> <chr>          
## 1    19     1 b     startswith_this
## 2     4     2 b     <NA>           
## 3     2     3 b     c              
## 4    11     5 a     e              
## 5    15     6 a     <NA>           
## 6    16     7 c     this_endwith   
## 7    12     9 c     z
# Data.table
R_DT[A > 9 | C == 'b', ]
# Or set keys for faster binary search
setkeyv(R_DT, cols = c("A", "C"))
R_DT[A > 9 | C == 'b', ] 
# Remove key
setkeyv(R_DT, cols = NULL)
##     A B C               D
## 1:  2 3 b               c
## 2:  4 2 b            <NA>
## 3: 11 5 a               e
## 4: 12 9 c               z
## 5: 15 6 a            <NA>
## 6: 16 7 c    this_endwith
## 7: 19 1 b startswith_this
##     A B C               D
## 1:  2 3 b               c
## 2:  4 2 b            <NA>
## 3: 11 5 a               e
## 4: 12 9 c               z
## 5: 15 6 a            <NA>
## 6: 16 7 c    this_endwith
## 7: 19 1 b startswith_this

Select rows on multiple conditions, using AND

# Pandas
PD.loc[(PD['A'] > 9) & (PD['C'] == 'a')]
##     A  B  C    D
## 4  11  5  a    e
## 5  15  6  a  NaN
# Pydatatable
PY_DT[(f.A > 9) & (f.C == 'a'), :]
##    |     A      B  C      D    
##    | int32  int32  str32  str32
## -- + -----  -----  -----  -----
##  0 |    11      5  a      e    
##  1 |    15      6  a      NA   
## [2 rows x 4 columns]
# Dplyr
filter(TB, A > 9 & C == 'a')
## # A tibble: 2 × 4
##       A     B C     D    
##   <int> <int> <chr> <chr>
## 1    11     5 a     e    
## 2    15     6 a     <NA>
# Data.table
R_DT[A > 9 & C == 'a', ]
# Or set keys for faster binary search
setkeyv(R_DT, cols = c("A", "C"))
R_DT[A > 9 & C == 'a', ] 
# Remove key
setkeyv(R_DT, cols = NULL)
##     A B C    D
## 1: 11 5 a    e
## 2: 15 6 a <NA>
##     A B C    D
## 1: 11 5 a    e
## 2: 15 6 a <NA>

Select rows on conditions, using IN

# Pandas
PD.loc[PD.C.isin(('a', 'c'))]
##     A  B  C             D
## 3   9  4  a             d
## 4  11  5  a             e
## 5  15  6  a           NaN
## 6  16  7  c  this_endwith
## 7   6  8  c             k
## 8  12  9  c             z

The isin functionality is currently missing but may be implemented; the solution below involves the re.match function, which was added in 1.1.0.

# Pydatatable
items = ['a','c']
regex = f"{'|'.join(items)}"
regex
PY_DT[dt.re.match(column=f.C, pattern=regex), :]
# Dplyr
filter(TB, C %in% c("a", "c"))
## # A tibble: 6 × 4
##       A     B C     D           
##   <int> <int> <chr> <chr>       
## 1     9     4 a     d           
## 2    11     5 a     e           
## 3    15     6 a     <NA>        
## 4    16     7 c     this_endwith
## 5     6     8 c     k           
## 6    12     9 c     z
# Data.table
R_DT[C %chin% c("a", "c"), ]
##     A B C            D
## 1:  6 8 c            k
## 2:  9 4 a            d
## 3: 11 5 a            e
## 4: 12 9 c            z
## 5: 15 6 a         <NA>
## 6: 16 7 c this_endwith

Filter unique rows

# Pandas
PD.drop_duplicates(subset='C', keep='first', inplace=False)
##     A  B  C                D
## 0  19  1  b  startswith_this
## 3   9  4  a                d
## 6  16  7  c     this_endwith

The unique functionality from R is currently not implemented. The solution below groups by the column in which we wish to look for duplicates, in which case the duplicated rows will be grouped together. Then, we keep the first row of each group.

# Pydatatable
PY_DT[:, first(f[:]), by(f.C)]
##    | C          A      B  D              
##    | str32  int32  int32  str32          
## -- + -----  -----  -----  ---------------
##  0 | a          9      4  d              
##  1 | b         19      1  startswith_this
##  2 | c         16      7  this_endwith   
## [3 rows x 4 columns]
# Dplyr
distinct(TB, C, .keep_all = T)
## # A tibble: 3 × 4
##       A     B C     D              
##   <int> <int> <chr> <chr>          
## 1    19     1 b     startswith_this
## 2     9     4 a     d              
## 3    16     7 c     this_endwith

The duplicated generic has a method defined for data.table, which returns a logical vector indicating which rows of a data.table are duplicates of a row with smaller subscripts. Note that for data.table, the returned unique rows returned may be different from those returned by dplyr and pandas since the rows are sorted by “C” first and foremost.

# Data.table
duplicated(R_DT, by = c("C"))
# Duplicated rows removed by columns specified in 'by' argument
setkeyv(R_DT, cols = "C")
unique(R_DT, by = c("C"))
setkeyv(R_DT, cols = NULL)
## [1] FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
##    A B C D
## 1: 9 4 a d
## 2: 2 3 b c
## 3: 6 8 c k

Remove missing rows

# Pandas
PD.dropna(axis=0, subset='D', inplace=False)
##     A  B  C                D
## 0  19  1  b  startswith_this
## 2   2  3  b                c
## 3   9  4  a                d
## 4  11  5  a                e
## 6  16  7  c     this_endwith
## 7   6  8  c                k
## 8  12  9  c                z
# Pydatatable
# Create a copy since 'del' modifies in place
copy_PY_DT = PY_DT.copy()
del copy_PY_DT[f.D == None, :]
copy_PY_DT
##    |     A      B  C      D              
##    | int32  int32  str32  str32          
## -- + -----  -----  -----  ---------------
##  0 |    19      1  b      startswith_this
##  1 |     2      3  b      c              
##  2 |     9      4  a      d              
##  3 |    11      5  a      e              
##  4 |    16      7  c      this_endwith   
##  5 |     6      8  c      k              
##  6 |    12      9  c      z              
## [7 rows x 4 columns]
# Dplyr
drop_na(TB, D)
## # A tibble: 7 × 4
##       A     B C     D              
##   <int> <int> <chr> <chr>          
## 1    19     1 b     startswith_this
## 2     2     3 b     c              
## 3     9     4 a     d              
## 4    11     5 a     e              
## 5    16     7 c     this_endwith   
## 6     6     8 c     k              
## 7    12     9 c     z
# Data.table
na.omit(R_DT, cols = "D")
##     A B C               D
## 1:  9 4 a               d
## 2: 11 5 a               e
## 3:  2 3 b               c
## 4: 19 1 b startswith_this
## 5:  6 8 c               k
## 6: 12 9 c               z
## 7: 16 7 c    this_endwith

Randomly sample rows

# Pandas
PD.sample(
  n=5, 
  frac=None, 
  # Allow or disallow sampling of the same row more than once
  replace=False, 
  weights=None, 
  random_state=None, 
  # Defaults to 0 (index) for Series and DataFrames
  axis=None, 
  ignore_index=False
)
PD.sample(
  n=None, 
  frac=0.25, 
  # Allow or disallow sampling of the same row more than once
  replace=False, 
  weights=None, 
  random_state=None, 
  # Defaults to 0 (index) for Series and DataFrames
  axis=None, 
  ignore_index=False
)
##     A  B  C             D
## 2   2  3  b             c
## 1   4  2  b           NaN
## 4  11  5  a             e
## 5  15  6  a           NaN
## 6  16  7  c  this_endwith
##     A  B  C                D
## 0  19  1  b  startswith_this
## 4  11  5  a                e
# Pydatatable
PY_DT[sample(range(0, PY_DT.nrows), 5), :]
PY_DT[sample(range(0, PY_DT.nrows), int(PY_DT.nrows * 0.25)), :]
##    |     A      B  C      D              
##    | int32  int32  str32  str32          
## -- + -----  -----  -----  ---------------
##  0 |     6      8  c      k              
##  1 |    11      5  a      e              
##  2 |     4      2  b      NA             
##  3 |     9      4  a      d              
##  4 |    19      1  b      startswith_this
## [5 rows x 4 columns]
##    |     A      B  C      D    
##    | int32  int32  str32  str32
## -- + -----  -----  -----  -----
##  0 |    12      9  c      z    
##  1 |     9      4  a      d    
## [2 rows x 4 columns]
# Dplyr
slice_sample(TB, n = 5)   
slice_sample(TB, prop = 0.25)
## # A tibble: 5 × 4
##       A     B C     D              
##   <int> <int> <chr> <chr>          
## 1    11     5 a     e              
## 2     4     2 b     <NA>           
## 3    19     1 b     startswith_this
## 4    15     6 a     <NA>           
## 5    12     9 c     z              
## # A tibble: 2 × 4
##       A     B C     D              
##   <int> <int> <chr> <chr>          
## 1    19     1 b     startswith_this
## 2     6     8 c     k
# Data.table
R_DT[sample(.N, 5)]
R_DT[sample(.N, .N * 0.25)]
##     A B C               D
## 1: 19 1 b startswith_this
## 2: 15 6 a            <NA>
## 3:  2 3 b               c
## 4: 16 7 c    this_endwith
## 5:  4 2 b            <NA>
##    A B C D
## 1: 6 8 c k
## 2: 2 3 b c

Select rows using regex

The argument na is the default value shown if element tested is not a string.

# Pandas
PD.loc[PD.D.str.startswith('startswith', na=False)]
PD.loc[PD.D.str.endswith('endwith', na=False)]
##     A  B  C                D
## 0  19  1  b  startswith_this
##     A  B  C             D
## 6  16  7  c  this_endwith
# Pydatatable
PY_DT[dt.re.match(column=f.D, pattern="^startswith"), :]
PY_DT[dt.re.match(column=f.D, pattern="endwith$"), :]
# Dplyr
filter(TB, grepl("^startswith", D))
filter(TB, grepl("endwith$", D))
## # A tibble: 1 × 4
##       A     B C     D              
##   <int> <int> <chr> <chr>          
## 1    19     1 b     startswith_this
## # A tibble: 1 × 4
##       A     B C     D           
##   <int> <int> <chr> <chr>       
## 1    16     7 c     this_endwith
# Data.table
# The functional form is like(vector, pattern, ignore.case = FALSE, fixed = FALSE)
R_DT[D %like% "^startswith"]
R_DT[D %like% "endwith$"]
##     A B C               D
## 1: 19 1 b startswith_this
##     A B C            D
## 1: 16 7 c this_endwith

Regex-replace rows

# Pandas
# Create a copy
copy_PD = PD.copy()
# Modify in place via subset-select-assign
copy_PD.loc[copy_PD.D.str.startswith('startswith', na=False), 'D'] = 'replaced_start'
copy_PD.loc[copy_PD.D.str.endswith('endwith', na=False), 'D'] = 'replaced_end'
copy_PD
# Replace all '_' with ' '
copy_PD.replace(to_replace='_', value=' ', inplace=False, regex=True)
# Replace 'replaced' with 'key' and '_' with white space all at once
copy_PD.replace(to_replace='^replaced_', value='key ', inplace=False, regex=True)
##     A  B  C               D
## 0  19  1  b  replaced_start
## 1   4  2  b             NaN
## 2   2  3  b               c
## 3   9  4  a               d
## 4  11  5  a               e
## 5  15  6  a             NaN
## 6  16  7  c    replaced_end
## 7   6  8  c               k
## 8  12  9  c               z
##     A  B  C               D
## 0  19  1  b  replaced start
## 1   4  2  b             NaN
## 2   2  3  b               c
## 3   9  4  a               d
## 4  11  5  a               e
## 5  15  6  a             NaN
## 6  16  7  c    replaced end
## 7   6  8  c               k
## 8  12  9  c               z
##     A  B  C          D
## 0  19  1  b  key start
## 1   4  2  b        NaN
## 2   2  3  b          c
## 3   9  4  a          d
## 4  11  5  a          e
## 5  15  6  a        NaN
## 6  16  7  c    key end
## 7   6  8  c          k
## 8  12  9  c          z
# Pydatatable
PY_DT[dt.re.match(column=f.D, pattern="^startswith"), f.D] = 'replaced_start'
PY_DT[dt.re.match(column=f.D, pattern="endwith$"), f.D] = 'replaced_end'

Naively traverses the column D once with stringr::str_detect.

# Dplyr
mutate(TB,
  D = base::replace(
    D, stringr::str_detect(D, "^startswith"), "replaced_start"
  )
)
mutate(TB,
  D = base::replace(
    D, stringr::str_detect(D, "endwith$"), "replaced_end"
  )
)
## # A tibble: 9 × 4
##       A     B C     D             
##   <int> <int> <chr> <chr>         
## 1    19     1 b     replaced_start
## 2     4     2 b     <NA>          
## 3     2     3 b     c             
## 4     9     4 a     d             
## 5    11     5 a     e             
## 6    15     6 a     <NA>          
## 7    16     7 c     this_endwith  
## 8     6     8 c     k             
## 9    12     9 c     z             
## # A tibble: 9 × 4
##       A     B C     D              
##   <int> <int> <chr> <chr>          
## 1    19     1 b     startswith_this
## 2     4     2 b     <NA>           
## 3     2     3 b     c              
## 4     9     4 a     d              
## 5    11     5 a     e              
## 6    15     6 a     <NA>           
## 7    16     7 c     replaced_end   
## 8     6     8 c     k              
## 9    12     9 c     z
# Data.table
# Create copy
copy_R_DT = copy(R_DT)
copy_R_DT[D %like% "^startswith", D := "replaced_start"]
copy_R_DT[D %like% "endwith$", D := "replaced_end"]
copy_R_DT
##     A B C              D
## 1:  9 4 a              d
## 2: 11 5 a              e
## 3: 15 6 a           <NA>
## 4:  2 3 b              c
## 5:  4 2 b           <NA>
## 6: 19 1 b replaced_start
## 7:  6 8 c              k
## 8: 12 9 c              z
## 9: 16 7 c   replaced_end

Select rows between a numeric range

# Pandas
# Possible values for 'inclusive' are “both”, “neither”, “left”, “right”
PD[PD.A.between(left=7, right=15, inclusive='both')]
##     A  B  C    D
## 3   9  4  a    d
## 4  11  5  a    e
## 5  15  6  a  NaN
## 8  12  9  c    z
# Pydatatable
PY_DT[(f.A >= 7) & (f.A <= 15), :]
##    |     A      B  C      D    
##    | int32  int32  str32  str32
## -- + -----  -----  -----  -----
##  0 |     9      4  a      d    
##  1 |    11      5  a      e    
##  2 |    15      6  a      NA   
##  3 |    12      9  c      z    
## [4 rows x 4 columns]
# Dplyr
filter(TB, between(A, 7, 15))
## # A tibble: 4 × 4
##       A     B C     D    
##   <int> <int> <chr> <chr>
## 1     9     4 a     d    
## 2    11     5 a     e    
## 3    15     6 a     <NA> 
## 4    12     9 c     z
# Data.table
R_DT[A %between% c(7, 15), ]
# Not inclusive
R_DT[data.table::between(A, 7, 14, incbounds = FALSE), ]
##     A B C    D
## 1:  9 4 a    d
## 2: 11 5 a    e
## 3: 15 6 a <NA>
## 4: 12 9 c    z
##     A B C D
## 1:  9 4 a d
## 2: 11 5 a e
## 3: 12 9 c z

Column Selection

Select a single column

# Pandas
# PD.loc[:, 'D']
PD['D']
## 0    startswith_this
## 1                NaN
## 2                  c
## 3                  d
## 4                  e
## 5                NaN
## 6       this_endwith
## 7                  k
## 8                  z
## Name: D, dtype: object
# Pydatatable
# PY_DT['D']
# PY_DT[:, f.D]
PY_DT[:, 'D']
##    | D              
##    | str32          
## -- + ---------------
##  0 | startswith_this
##  1 | NA             
##  2 | c              
##  3 | d              
##  4 | e              
##  5 | NA             
##  6 | this_endwith   
##  7 | k              
##  8 | z              
## [9 rows x 1 column]
# Dplyr
# select(TB, D)
select(TB, "D")
## # A tibble: 9 × 1
##   D              
##   <chr>          
## 1 startswith_this
## 2 <NA>           
## 3 c              
## 4 d              
## 5 e              
## 6 <NA>           
## 7 this_endwith   
## 8 k              
## 9 z
# Data.table
# The drop argument preserver dimensionality in base R data frames
R_DT[, .(D), drop = FALSE]
##                  D
## 1:               d
## 2:               e
## 3:            <NA>
## 4:               c
## 5:            <NA>
## 6: startswith_this
## 7:               k
## 8:               z
## 9:    this_endwith

Select a single column by position

# Pandas
PD.iloc[:, 3]
## 0    startswith_this
## 1                NaN
## 2                  c
## 3                  d
## 4                  e
## 5                NaN
## 6       this_endwith
## 7                  k
## 8                  z
## Name: D, dtype: object
# Pydatatable
# PY_DT[:, 3]
PY_DT[3]
##    | D              
##    | str32          
## -- + ---------------
##  0 | startswith_this
##  1 | NA             
##  2 | c              
##  3 | d              
##  4 | e              
##  5 | NA             
##  6 | this_endwith   
##  7 | k              
##  8 | z              
## [9 rows x 1 column]
# Dplyr
select(TB, 4)
## # A tibble: 9 × 1
##   D              
##   <chr>          
## 1 startswith_this
## 2 <NA>           
## 3 c              
## 4 d              
## 5 e              
## 6 <NA>           
## 7 this_endwith   
## 8 k              
## 9 z
# Data.table
R_DT[, 4]
##                  D
## 1:               d
## 2:               e
## 3:            <NA>
## 4:               c
## 5:            <NA>
## 6: startswith_this
## 7:               k
## 8:               z
## 9:    this_endwith

Select multiple columns

# Pandas
PD.loc[:, ['A', 'B']]
##     A  B
## 0  19  1
## 1   4  2
## 2   2  3
## 3   9  4
## 4  11  5
## 5  15  6
## 6  16  7
## 7   6  8
## 8  12  9
# Pydatatable
# PY_DT[:, ['A', 'B']]
PY_DT[:, [f.A, f.B]]
##    |     A      B
##    | int32  int32
## -- + -----  -----
##  0 |    19      1
##  1 |     4      2
##  2 |     2      3
##  3 |     9      4
##  4 |    11      5
##  5 |    15      6
##  6 |    16      7
##  7 |     6      8
##  8 |    12      9
## [9 rows x 2 columns]
# Dplyr
select(TB, c(A, B))
## # A tibble: 9 × 2
##       A     B
##   <int> <int>
## 1    19     1
## 2     4     2
## 3     2     3
## 4     9     4
## 5    11     5
## 6    15     6
## 7    16     7
## 8     6     8
## 9    12     9
# Data.table
R_DT[, .(A, B)]
##     A B
## 1:  9 4
## 2: 11 5
## 3: 15 6
## 4:  2 3
## 5:  4 2
## 6: 19 1
## 7:  6 8
## 8: 12 9
## 9: 16 7

Select multiple columns by position

# Pandas
PD.iloc[:, [0, 3]]
##     A                D
## 0  19  startswith_this
## 1   4              NaN
## 2   2                c
## 3   9                d
## 4  11                e
## 5  15              NaN
## 6  16     this_endwith
## 7   6                k
## 8  12                z
# Pydatatable
PY_DT[:, [0, 3]]
##    |     A  D              
##    | int32  str32          
## -- + -----  ---------------
##  0 |    19  startswith_this
##  1 |     4  NA             
##  2 |     2  c              
##  3 |     9  d              
##  4 |    11  e              
##  5 |    15  NA             
##  6 |    16  this_endwith   
##  7 |     6  k              
##  8 |    12  z              
## [9 rows x 2 columns]
# Dplyr
select(TB, c(1, 4))
## # A tibble: 9 × 2
##       A D              
##   <int> <chr>          
## 1    19 startswith_this
## 2     4 <NA>           
## 3     2 c              
## 4     9 d              
## 5    11 e              
## 6    15 <NA>           
## 7    16 this_endwith   
## 8     6 k              
## 9    12 z
# Data.table
R_DT[, c(1, 4)]
##     A               D
## 1:  9               d
## 2: 11               e
## 3: 15            <NA>
## 4:  2               c
## 5:  4            <NA>
## 6: 19 startswith_this
## 7:  6               k
## 8: 12               z
## 9: 16    this_endwith

Select multiple columns by slicing (position)

# Pandas
PD.iloc[:, 2:4]
##    C                D
## 0  b  startswith_this
## 1  b              NaN
## 2  b                c
## 3  a                d
## 4  a                e
## 5  a              NaN
## 6  c     this_endwith
## 7  c                k
## 8  c                z
# Pydatatable
PY_DT[:, 2:4]
##    | C      D              
##    | str32  str32          
## -- + -----  ---------------
##  0 | b      startswith_this
##  1 | b      NA             
##  2 | b      c              
##  3 | a      d              
##  4 | a      e              
##  5 | a      NA             
##  6 | c      this_endwith   
##  7 | c      k              
##  8 | c      z              
## [9 rows x 2 columns]
# Dplyr
select(TB, 4:3)
## # A tibble: 9 × 2
##   D               C    
##   <chr>           <chr>
## 1 startswith_this b    
## 2 <NA>            b    
## 3 c               b    
## 4 d               a    
## 5 e               a    
## 6 <NA>            a    
## 7 this_endwith    c    
## 8 k               c    
## 9 z               c
# Data.table
R_DT[, 4:3]
##                  D C
## 1:               d a
## 2:               e a
## 3:            <NA> a
## 4:               c b
## 5:            <NA> b
## 6: startswith_this b
## 7:               k c
## 8:               z c
## 9:    this_endwith c

Select multiple columns by slicing (names)

# Pandas
PD.loc[:, 'B':'D']
##    B  C                D
## 0  1  b  startswith_this
## 1  2  b              NaN
## 2  3  b                c
## 3  4  a                d
## 4  5  a                e
## 5  6  a              NaN
## 6  7  c     this_endwith
## 7  8  c                k
## 8  9  c                z
# Pydatatable
PY_DT[:, 'B':'D']
##    |     B  C      D              
##    | int32  str32  str32          
## -- + -----  -----  ---------------
##  0 |     1  b      startswith_this
##  1 |     2  b      NA             
##  2 |     3  b      c              
##  3 |     4  a      d              
##  4 |     5  a      e              
##  5 |     6  a      NA             
##  6 |     7  c      this_endwith   
##  7 |     8  c      k              
##  8 |     9  c      z              
## [9 rows x 3 columns]
# Dplyr
select(TB, B:D)
## # A tibble: 9 × 3
##       B C     D              
##   <int> <chr> <chr>          
## 1     1 b     startswith_this
## 2     2 b     <NA>           
## 3     3 b     c              
## 4     4 a     d              
## 5     5 a     e              
## 6     6 a     <NA>           
## 7     7 c     this_endwith   
## 8     8 c     k              
## 9     9 c     z
# Data.table
R_DT[, .SD, .SDcols = B:D]
##    B C               D
## 1: 4 a               d
## 2: 5 a               e
## 3: 6 a            <NA>
## 4: 3 b               c
## 5: 2 b            <NA>
## 6: 1 b startswith_this
## 7: 8 c               k
## 8: 9 c               z
## 9: 7 c    this_endwith

Select multiple columns by position

# Pandas
PD.iloc[:, [0, 3]]
##     A                D
## 0  19  startswith_this
## 1   4              NaN
## 2   2                c
## 3   9                d
## 4  11                e
## 5  15              NaN
## 6  16     this_endwith
## 7   6                k
## 8  12                z
# Pydatatable
PY_DT[:, [0, 3]]
##    |     A  D              
##    | int32  str32          
## -- + -----  ---------------
##  0 |    19  startswith_this
##  1 |     4  NA             
##  2 |     2  c              
##  3 |     9  d              
##  4 |    11  e              
##  5 |    15  NA             
##  6 |    16  this_endwith   
##  7 |     6  k              
##  8 |    12  z              
## [9 rows x 2 columns]
# Dplyr
select(TB, c(1, 4))
## # A tibble: 9 × 2
##       A D              
##   <int> <chr>          
## 1    19 startswith_this
## 2     4 <NA>           
## 3     2 c              
## 4     9 d              
## 5    11 e              
## 6    15 <NA>           
## 7    16 this_endwith   
## 8     6 k              
## 9    12 z
# Data.table
R_DT[, c(1, 4)]
##     A               D
## 1:  9               d
## 2: 11               e
## 3: 15            <NA>
## 4:  2               c
## 5:  4            <NA>
## 6: 19 startswith_this
## 7:  6               k
## 8: 12               z
## 9: 16    this_endwith

Select multiple columns by slicing (position)

# Pandas
PD.iloc[:, 2:4]
##    C                D
## 0  b  startswith_this
## 1  b              NaN
## 2  b                c
## 3  a                d
## 4  a                e
## 5  a              NaN
## 6  c     this_endwith
## 7  c                k
## 8  c                z
# Pydatatable
PY_DT[:, 2:4]
##    | C      D              
##    | str32  str32          
## -- + -----  ---------------
##  0 | b      startswith_this
##  1 | b      NA             
##  2 | b      c              
##  3 | a      d              
##  4 | a      e              
##  5 | a      NA             
##  6 | c      this_endwith   
##  7 | c      k              
##  8 | c      z              
## [9 rows x 2 columns]
# Dplyr
select(TB, 4:3)
## # A tibble: 9 × 2
##   D               C    
##   <chr>           <chr>
## 1 startswith_this b    
## 2 <NA>            b    
## 3 c               b    
## 4 d               a    
## 5 e               a    
## 6 <NA>            a    
## 7 this_endwith    c    
## 8 k               c    
## 9 z               c
# Data.table
R_DT[, 4:3]
##                  D C
## 1:               d a
## 2:               e a
## 3:            <NA> a
## 4:               c b
## 5:            <NA> b
## 6: startswith_this b
## 7:               k c
## 8:               z c
## 9:    this_endwith c

Exclude columns by name

# Pandas
PD.drop(columns=['B', 'C'], inplace = False)
# Second way
PD.loc[:, PD.columns.difference(['B', 'C'])]
##     A                D
## 0  19  startswith_this
## 1   4              NaN
## 2   2                c
## 3   9                d
## 4  11                e
## 5  15              NaN
## 6  16     this_endwith
## 7   6                k
## 8  12                z
##     A                D
## 0  19  startswith_this
## 1   4              NaN
## 2   2                c
## 3   9                d
## 4  11                e
## 5  15              NaN
## 6  16     this_endwith
## 7   6                k
## 8  12                z
# Pydatatable
# del PY_DT[: , ['B', 'C']] 
PY_DT[:, f[:].remove([f.B, f.C])]
##    |     A  D              
##    | int32  str32          
## -- + -----  ---------------
##  0 |    19  startswith_this
##  1 |     4  NA             
##  2 |     2  c              
##  3 |     9  d              
##  4 |    11  e              
##  5 |    15  NA             
##  6 |    16  this_endwith   
##  7 |     6  k              
##  8 |    12  z              
## [9 rows x 2 columns]
# Dplyr
select(TB, -c(B, C))
## # A tibble: 9 × 2
##       A D              
##   <int> <chr>          
## 1    19 startswith_this
## 2     4 <NA>           
## 3     2 c              
## 4     9 d              
## 5    11 e              
## 6    15 <NA>           
## 7    16 this_endwith   
## 8     6 k              
## 9    12 z
# Data.table
R_DT[, .SD, .SDcols = !c("B", "C")]
# Second way
R_DT[, !c("B", "C")]
# Delete by reference
copied <- copy(R_DT)
copied[, c("B", "C") := NULL]
copied
##     A               D
## 1:  9               d
## 2: 11               e
## 3: 15            <NA>
## 4:  2               c
## 5:  4            <NA>
## 6: 19 startswith_this
## 7:  6               k
## 8: 12               z
## 9: 16    this_endwith
##     A               D
## 1:  9               d
## 2: 11               e
## 3: 15            <NA>
## 4:  2               c
## 5:  4            <NA>
## 6: 19 startswith_this
## 7:  6               k
## 8: 12               z
## 9: 16    this_endwith
##     A               D
## 1:  9               d
## 2: 11               e
## 3: 15            <NA>
## 4:  2               c
## 5:  4            <NA>
## 6: 19 startswith_this
## 7:  6               k
## 8: 12               z
## 9: 16    this_endwith

Exclude columns by position

# Pandas
PD.drop(PD.columns[[1, 2, 3]], axis=1, inplace=False)
##     A
## 0  19
## 1   4
## 2   2
## 3   9
## 4  11
## 5  15
## 6  16
## 7   6
## 8  12
# Pydatatable
# Deselecting 
PY_DT[:, [pos for pos in range(PY_DT.ncols) if pos not in [1, 2, 3]]]
# Deleting in place
copied = PY_DT.copy()
del copied[:, [1, 2, 3]]
copied
##    |     A
##    | int32
## -- + -----
##  0 |    19
##  1 |     4
##  2 |     2
##  3 |     9
##  4 |    11
##  5 |    15
##  6 |    16
##  7 |     6
##  8 |    12
## [9 rows x 1 column]
##    |     A
##    | int32
## -- + -----
##  0 |    19
##  1 |     4
##  2 |     2
##  3 |     9
##  4 |    11
##  5 |    15
##  6 |    16
##  7 |     6
##  8 |    12
## [9 rows x 1 column]
# Dplyr
select(TB, -c(2, 3, 4))
## # A tibble: 9 × 1
##       A
##   <int>
## 1    19
## 2     4
## 3     2
## 4     9
## 5    11
## 6    15
## 7    16
## 8     6
## 9    12
# Data.table
R_DT[, !c(2, 3, 4)]
# Delete by reference
copied <- copy(R_DT)
copied[, c(2, 3, 4) := NULL]
copied
##     A
## 1:  9
## 2: 11
## 3: 15
## 4:  2
## 5:  4
## 6: 19
## 7:  6
## 8: 12
## 9: 16
##     A
## 1:  9
## 2: 11
## 3: 15
## 4:  2
## 5:  4
## 6: 19
## 7:  6
## 8: 12
## 9: 16

Select columns using regex

# Import re
from re import match
# Add three columns with prefixes
NEW_PD = pd.concat(
  (PD, pd.DataFrame(
    {'prefix_D': list(range(0, 9)),
    'prefix_F': list(range(0, 9)),
    'prefix_G': list(range(0, 9))}
  )),
  axis=1)
# Pandas
NEW_PD.filter(regex='^prefix_', axis=1)
# Second way via list comp
NEW_PD.loc[:, [col for col in NEW_PD.columns if match(r'^prefix_', col)]]
##    prefix_D  prefix_F  prefix_G
## 0         0         0         0
## 1         1         1         1
## 2         2         2         2
## 3         3         3         3
## 4         4         4         4
## 5         5         5         5
## 6         6         6         6
## 7         7         7         7
## 8         8         8         8
##    prefix_D  prefix_F  prefix_G
## 0         0         0         0
## 1         1         1         1
## 2         2         2         2
## 3         3         3         3
## 4         4         4         4
## 5         5         5         5
## 6         6         6         6
## 7         7         7         7
## 8         8         8         8
# Add columns with prefix
copied = PY_DT.copy()
# Updates in place
copied[:, update(prefix_D = 3, prefix_F = 4, prefix_G = 4)]
# Pydatatable
copied[:, [col for col in copied.names if match(r'^prefix_', col)]]
##    | prefix_D  prefix_F  prefix_G
##    |    int32     int32     int32
## -- + --------  --------  --------
##  0 |        3         4         4
##  1 |        3         4         4
##  2 |        3         4         4
##  3 |        3         4         4
##  4 |        3         4         4
##  5 |        3         4         4
##  6 |        3         4         4
##  7 |        3         4         4
##  8 |        3         4         4
## [9 rows x 3 columns]
# Add columns with prefix
copied = mutate(
  TB, 
  prefix_D = sample(100, size = nrow(TB)), 
  prefix_F = 3,
  prefix_G = 'F')
# contains("pattern")
# ends_with("pattern")
# matches("pattern") matches a regular expression
# Dplyr
select(copied, starts_with("prefix_"))
## # A tibble: 9 × 3
##   prefix_D prefix_F prefix_G
##      <int>    <dbl> <chr>   
## 1       34        3 F       
## 2       47        3 F       
## 3        7        3 F       
## 4       49        3 F       
## 5       65        3 F       
## 6       96        3 F       
## 7       22        3 F       
## 8       82        3 F       
## 9       37        3 F

We can filter columns to include in .SD based on their names according to regular expressions via .SDcols=patterns(regex1, regex2, ...). The included columns will be the intersection of the columns identified by each pattern; pattern unions can easily be specified with | in a regex.

# Data.table
# Add columns with prefix
copied = copy(R_DT)
copied[, c("prefix_D", "prefix_F", "prefix_G") := .(4, 33, 5)]
# Data.table
copied[, .SD, .SDcols = patterns("^prefix_")]
##    prefix_D prefix_F prefix_G
## 1:        4       33        5
## 2:        4       33        5
## 3:        4       33        5
## 4:        4       33        5
## 5:        4       33        5
## 6:        4       33        5
## 7:        4       33        5
## 8:        4       33        5
## 9:        4       33        5

Related