Getting started

Using tidypolars requires importing data as Polars DataFrames or LazyFrames. You can read files with the various read_*_polars() functions (such as read_parquet_polars()) to import them as DataFrames, or with scan_*_polars() functions (such as scan_parquet_polars()) to import them as LazyFrames. There are several functions to import various file formats, such as CSV, Parquet, or JSON.

Note: in examples or some tutorials, the functions as_polars_df() and as_polars_lf() are sometimes used to convert an existing R data.frame to a Polars DataFrame or LazyFrame. Those are merely convenience functions to quickly convert an existing dataset to Polars, which is useful for showcase purposes. However, this conversion from R to Polars has some cost and it hurts the performance. In real-life usecases, be sure to load the data with the read_*() or the scan_*() functions mentioned above.

Here, we’re going to use the who dataset that is available in the tidyr package. I import it both as a classic R data.frame and as a Polars DataFrame so that we can easily compare dplyr and tidypolars functions.

library(polars)
library(tidypolars)
library(dplyr, warn.conflicts = FALSE)
library(tidyr, warn.conflicts = FALSE)

who_df <- tidyr::who
who_pl <- as_polars_df(tidyr::who)

tidypolars provides methods for dplyr and tidyr S3 generics. In simpler words, it means that you can use the same functions on a Polars DataFrame or LazyFrame as in a classic tidyverse workflow and it should just work (if it doesn’t, please open an issue). Note that you still need to load dplyr and tidyr in your code.

Here’s an example of some dplyr and tidyr code on the classic R data.frame:

who_df |> 
  filter(year > 1990) |> 
  drop_na(newrel_f3544) |> 
  select(iso3, year, matches("^newrel(.*)_f")) |> 
  arrange(iso3, year) |> 
  rename_with(.fn = toupper) |> 
  head()
#> # A tibble: 6 × 9
#>   ISO3   YEAR NEWREL_F014 NEWREL_F1524 NEWREL_F2534 NEWREL_F3544 NEWREL_F4554
#>   <chr> <dbl>       <dbl>        <dbl>        <dbl>        <dbl>        <dbl>
#> 1 AGO    2013         626         2644         2480         1671          991
#> 2 AIA    2013           0            0            0            0            0
#> 3 ALB    2013           5           28           34           13           18
#> 4 AND    2013           0            0            0            1            0
#> 5 ARE    2013           5            4            9            3            3
#> 6 ARG    2013         431          927          808          537          395
#> # ℹ 2 more variables: NEWREL_F5564 <dbl>, NEWREL_F65 <dbl>

We can simply use our Polars dataset instead:

who_pl |> 
  filter(year > 1990) |> 
  drop_na(newrel_f3544) |> 
  select(iso3, year, matches("^newrel(.*)_f")) |> 
  arrange(iso3, year) |> 
  rename_with(.fn = toupper) |> 
  head()
#> shape: (6, 9)
#> ┌──────┬────────┬─────────────┬────────────┬───┬────────────┬────────────┬────────────┬────────────┐
#> │ ISO3 ┆ YEAR   ┆ NEWREL_F014 ┆ NEWREL_F15 ┆ … ┆ NEWREL_F35 ┆ NEWREL_F45 ┆ NEWREL_F55 ┆ NEWREL_F65 │
#> │ ---  ┆ ---    ┆ ---         ┆ 24         ┆   ┆ 44         ┆ 54         ┆ 64         ┆ ---        │
#> │ str  ┆ f64    ┆ f64         ┆ ---        ┆   ┆ ---        ┆ ---        ┆ ---        ┆ f64        │
#> │      ┆        ┆             ┆ f64        ┆   ┆ f64        ┆ f64        ┆ f64        ┆            │
#> ╞══════╪════════╪═════════════╪════════════╪═══╪════════════╪════════════╪════════════╪════════════╡
#> │ AGO  ┆ 2013.0 ┆ 626.0       ┆ 2644.0     ┆ … ┆ 1671.0     ┆ 991.0      ┆ 481.0      ┆ 314.0      │
#> │ AIA  ┆ 2013.0 ┆ 0.0         ┆ 0.0        ┆ … ┆ 0.0        ┆ 0.0        ┆ 0.0        ┆ 0.0        │
#> │ ALB  ┆ 2013.0 ┆ 5.0         ┆ 28.0       ┆ … ┆ 13.0       ┆ 18.0       ┆ 14.0       ┆ 34.0       │
#> │ AND  ┆ 2013.0 ┆ 0.0         ┆ 0.0        ┆ … ┆ 1.0        ┆ 0.0        ┆ 0.0        ┆ 0.0        │
#> │ ARE  ┆ 2013.0 ┆ 5.0         ┆ 4.0        ┆ … ┆ 3.0        ┆ 3.0        ┆ 1.0        ┆ 6.0        │
#> │ ARG  ┆ 2013.0 ┆ 431.0       ┆ 927.0      ┆ … ┆ 537.0      ┆ 395.0      ┆ 307.0      ┆ 374.0      │
#> └──────┴────────┴─────────────┴────────────┴───┴────────────┴────────────┴────────────┴────────────┘

If you use Polars lazy API, you need to call compute() at the end of the chained expression to evaluate the query:

who_pl_lazy <- as_polars_lf(tidyr::who)

who_pl_lazy |> 
  filter(year > 1990) |> 
  drop_na(newrel_f3544) |> 
  select(iso3, year, matches("^newrel(.*)_f")) |> 
  arrange(iso3, year) |> 
  rename_with(.fn = toupper) |> 
  compute() |> 
  head()
#> shape: (6, 9)
#> ┌──────┬────────┬─────────────┬────────────┬───┬────────────┬────────────┬────────────┬────────────┐
#> │ ISO3 ┆ YEAR   ┆ NEWREL_F014 ┆ NEWREL_F15 ┆ … ┆ NEWREL_F35 ┆ NEWREL_F45 ┆ NEWREL_F55 ┆ NEWREL_F65 │
#> │ ---  ┆ ---    ┆ ---         ┆ 24         ┆   ┆ 44         ┆ 54         ┆ 64         ┆ ---        │
#> │ str  ┆ f64    ┆ f64         ┆ ---        ┆   ┆ ---        ┆ ---        ┆ ---        ┆ f64        │
#> │      ┆        ┆             ┆ f64        ┆   ┆ f64        ┆ f64        ┆ f64        ┆            │
#> ╞══════╪════════╪═════════════╪════════════╪═══╪════════════╪════════════╪════════════╪════════════╡
#> │ AGO  ┆ 2013.0 ┆ 626.0       ┆ 2644.0     ┆ … ┆ 1671.0     ┆ 991.0      ┆ 481.0      ┆ 314.0      │
#> │ AIA  ┆ 2013.0 ┆ 0.0         ┆ 0.0        ┆ … ┆ 0.0        ┆ 0.0        ┆ 0.0        ┆ 0.0        │
#> │ ALB  ┆ 2013.0 ┆ 5.0         ┆ 28.0       ┆ … ┆ 13.0       ┆ 18.0       ┆ 14.0       ┆ 34.0       │
#> │ AND  ┆ 2013.0 ┆ 0.0         ┆ 0.0        ┆ … ┆ 1.0        ┆ 0.0        ┆ 0.0        ┆ 0.0        │
#> │ ARE  ┆ 2013.0 ┆ 5.0         ┆ 4.0        ┆ … ┆ 3.0        ┆ 3.0        ┆ 1.0        ┆ 6.0        │
#> │ ARG  ┆ 2013.0 ┆ 431.0       ┆ 927.0      ┆ … ┆ 537.0      ┆ 395.0      ┆ 307.0      ┆ 374.0      │
#> └──────┴────────┴─────────────┴────────────┴───┴────────────┴────────────┴────────────┴────────────┘

Note: Several functions trigger the evaluation of a lazy query: compute(), collect(), as.data.frame(), and as_tibble(). If you want to return a Polars DataFrame, use compute(). If you want to return a standard R data.frame, for example to use it in statistical analysis, use any of the three other functions. Be warned that if the dataset is too big compared to your available memory, this will crash the R session.

tidypolars also supports many functions from base, lubridate or stringr. When these are used inside filter(), mutate() or summarize(), tidypolars will automatically convert them to use the Polars engine under the hood. Take a look at the vignette “R and Polars expressions” for more information.