Getting started

library(astgrepr)

This vignette will give you the basic knowledge so that you can start using astgrepr. If you want to know more about advanced rules and other topics, I invite you to read the docs of the Rust crate ast-grep, on which this package is built.

First steps

My main incentive for building this package was to provide a faster linter for R code. lintr is a great tool, but I’m jealous of the Python ecosystem that has a lightning-fast linter (Ruff).

Therefore, as a motivation for this vignette, let’s say we have the following code and that we want to find bad patterns:

src <- "x <- rnorm(100, mean = 2)
any(is.na(y))
plot(x)
any(is.na(x))
any(duplicated(variable))"

I can already see two of them:

any(is.na()) is slower than anyNA() (lintr)
any(duplicated()) is slower than anyDuplicated() > 0 (lintr)

Let’s start by building the abstract syntax tree (AST) corresponding to this code. This has to be the first step, all other functions depend on this tree:

root <- src |>
  tree_new() |>
  tree_root()

root
#> <AST node>

Rules and nodes

“Rules” are one of the key elements of astgrepr. They basically define what we are looking for in the code. One can build a simple rule with ast_rule():

ast_rule(id = "any_na", pattern = "any(is.na($VAR))")
#> <ast-grep rule: 'any_na'>
#> pattern: any(is.na($VAR))

There are many arguments in ast_rule(), and one can also include pattern_rule() and relational_rule() but we keep it simple for now. Once a rule is created, it can be applied on a node:

root |> 
  node_find(
    ast_rule(id = "any_na", pattern = "any(is.na($VAR))"),
    ast_rule(id = "any_dup", pattern = "any(duplicated($VAR))")
  )
#> <List of 2 rules>
#> |--any_na: 1 node
#> |--any_dup: 1 node

We can see that most astgrepr functions will return a nested list. Lists are nested on two levels: rules and nodes. For each rule, there is a specific number of nodes that were matched.

Here, node_find() returned a list of two rules, and each of them contains a single node. This is expected: node_find() stops after the first node that matches the rule. If we want to look for all nodes that match this rule, we can use node_find_all():

found_nodes <- root |> 
  node_find_all(
    ast_rule(id = "any_na", pattern = "any(is.na($VAR))"),
    ast_rule(id = "any_dup", pattern = "any(duplicated($VAR))")
  )

found_nodes
#> <List of 2 rules>
#> |--any_na: 2 nodes
#> |--any_dup: 1 nodes

More generally, most functions come with a single-node and a multi-node variants. For instance, , we use node_text_all() to extract the text corresponding to each node and node_range_all() to get their start and end coordinates in the original code¹:

found_nodes |> 
  node_text_all()
#> $any_na
#> $any_na$node_1
#> [1] "any(is.na(y))"
#> 
#> $any_na$node_2
#> [1] "any(is.na(x))"
#> 
#> 
#> $any_dup
#> $any_dup$node_1
#> [1] "any(duplicated(variable))"
found_nodes |> 
  node_range_all()
#> $any_na
#> $any_na$node_1
#> $any_na$node_1$start
#> [1] 1 0
#> 
#> $any_na$node_1$end
#> [1]  1 13
#> 
#> 
#> $any_na$node_2
#> $any_na$node_2$start
#> [1] 3 0
#> 
#> $any_na$node_2$end
#> [1]  3 13
#> 
#> 
#> 
#> $any_dup
#> $any_dup$node_1
#> $any_dup$node_1$start
#> [1] 4 0
#> 
#> $any_dup$node_1$end
#> [1]  4 25

Let’s sum up what we have. So far, we have the original code (root$text()), the location ($range()) and content ($text()) of the patterns we were looking for. This is already enough to build a linter².

astgrepr offers another feature: code rewriting.

Modifying nodes

Wouldn’t it be nice if our IDE (say, RStudio) could automatically fix those patterns?

To do so, we need two new functions: node_replace_all() and tree_rewrite(). The first one takes a list of replacements for each rule, and the second one rewrites a node based on those replacements. First, let’s see what node_find_all() looks like:

nodes_to_replace <- root |>
  node_find_all(
    ast_rule(id = "any_na", pattern = "any(is.na($VAR))"),
    ast_rule(id = "any_dup", pattern = "any(duplicated($VAR))")
  )

nodes_to_replace
#> <List of 2 rules>
#> |--any_na: 2 nodes
#> |--any_dup: 1 nodes
fixes <- nodes_to_replace |>
  node_replace_all(
    any_na = "anyNA(~~VAR~~)",
    any_dup = "anyDuplicated(~~VAR~~) > 0"
  )

fixes
#> $node_1
#> $node_1[[1]]
#> [1] 26
#> 
#> $node_1[[2]]
#> [1] 39
#> 
#> $node_1[[3]]
#> [1] "anyNA(y)"
#> 
#> 
#> $node_2
#> $node_2[[1]]
#> [1] 48
#> 
#> $node_2[[2]]
#> [1] 61
#> 
#> $node_2[[3]]
#> [1] "anyNA(x)"
#> 
#> 
#> $node_1
#> $node_1[[1]]
#> [1] 62
#> 
#> $node_1[[2]]
#> [1] 87
#> 
#> $node_1[[3]]
#> [1] "anyDuplicated(variable) > 0"
#> 
#> 
#> attr(,"class")
#> [1] "astgrep_replacements" "list"

It returns a nested list (once again) with the replacement for each node and the coordinates indicating where this replacement should be inserted. To finalize our code rewrite, we now need to apply those changes to the original tree with tree_rewrite():

# original code
cat(src)
#> x <- rnorm(100, mean = 2)
#> any(is.na(y))
#> plot(x)
#> any(is.na(x))
#> any(duplicated(variable))
# new code
tree_rewrite(root, fixes)
#> x <- rnorm(100, mean = 2)
#> anyNA(y)
#> plot(x)
#> anyNA(x)
#> anyDuplicated(variable) > 0

And that’s it. Building a linter or a code rewriter is a massive effort that is not among astgrepr objectives, but I hope this tool can serve as a foundation to build one.

Note that in each sublist, the first value refers to the row and second one to the column. Also, those values are 0-indexed, so 1 corresponds to the second row/column.↩︎
Of course, more work is needed to make the IDE report those lints, but this is outside the scope of astgrepr.↩︎