Featured AI Products
Compute
Build, deploy, and scale cloud compute resources
Containers and Images
Safely store and manage containers and backups
Managed Databases
Fully managed resources running popular database engines
Management and Dev Tools
Control infrastructure and gather insights
Networking
Secure and control traffic to apps
Security
Help protect your account and resources with these security features
Storage
Store and access any amount of data reliably in the cloud
Browse all products
AI/ML
CMS
Data and IoT
Developer Tools
Gaming and Media
Hosting
Security and Networking
Startups and SMBs
Web and App Platforms
See all solutions
Community
Documentation
Developer Tools
Get Involved
Utilities and Help
Become a Partner
Marketplace
Pricing

- Community
- DigitalOcean
- Community
- DigitalOcean

R sub() and gsub() Functions: Complete Guide & Examples

Updated on June 10, 2026

R Programming

By Prajwal CN, Bradley Kouchi and Manikandan Kurup

R sub() and gsub() Functions: Complete Guide & Examples

Introduction

R’s sub() and gsub() functions are the primary base R tools for string pattern replacement. Both accept a pattern, a replacement string, and a character vector or data frame column, and both support full regular expression syntax by default. The key difference is scope: sub() replaces only the first match in each string element, while gsub() replaces all matches globally. Modeled on Unix sed conventions and included in base R since its initial release, both functions require no external packages and are universally available across every R installation.

As of modern R releases, sub() and gsub() remain core, stable functions used daily for data cleaning, text preprocessing, log parsing, and column standardization.

In this tutorial, you will learn the syntax and parameters of both functions, use regular expressions with character classes, anchors, and capture group backreferences, explore advanced features including case-insensitive matching and the PCRE2 (Perl Compatible Regular Expressions version 2) engine enabled by perl = TRUE, replace multiple patterns at once using Reduce() and stringr::str_replace_all(), apply gsub() to data frame columns with practical cleaning examples, and compare gsub() against stringr and stringi alternatives.

Key Takeaways

Use sub() to replace only the first occurrence of a pattern per string element; use gsub() to replace all occurrences.
Both functions support full regular expressions by default. Set fixed = TRUE to match a literal string with no regex interpretation.
Set perl = TRUE to switch from the default TRE engine to PCRE2, which adds lookaheads, lookbehinds, named capture groups, and \U/\L case modifiers in replacement strings.
Use \\1, \\2, etc. in the replacement string to reference capture groups from the pattern.
To replace multiple different patterns in one operation, use stringr::str_replace_all() with a named vector, or chain gsub() calls with Reduce().
Apply gsub() to a data frame column directly: df$col <- gsub("pattern", "replacement", df$col).
Use ignore.case = TRUE for case-insensitive matching. Combine with \\b for whole-word replacement, but note that \\b is ASCII-only in TRE; use perl = TRUE for Unicode-aware word boundaries.
Both functions pass NA values through unchanged without error. Handle missing values explicitly before calling sub() or gsub() if your pipeline depends on detecting or imputing them.

Prerequisites

To follow this tutorial, you will need:

R installed locally or on a server.
The stringr package for the comparison section, installed with install.packages("stringr") (optional).

What are R’s `sub()` and `gsub()` functions?

sub() and gsub() are base R string replacement functions that find a pattern in a character vector and substitute it with a replacement string. The pattern can be a fixed literal or a regular expression. They are vectorized, meaning they process each element of the input vector independently in a single call.

`sub()` function: syntax and parameters

The full syntax for sub() is:

sub(pattern, replacement, x,
    ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)

Here’s what each function argument means:

Parameter	Type	Description
`pattern`	character	The string or regex to search for.
`replacement`	character	The string to substitute in place of each match. Supports backreferences (`\\1`, `\\2`, etc.).
`x`	character vector	The input to search. Data frame columns are passed as vectors.
`ignore.case`	logical	When `TRUE`, matching is case-insensitive. Default: `FALSE`.
`perl`	logical	When `TRUE`, uses the PCRE2 regex engine instead of TRE. Default: `FALSE`.
`fixed`	logical	When `TRUE`, `pattern` is treated as a literal string, not a regex. Default: `FALSE`.
`useBytes`	logical	When `TRUE`, matching is byte-by-byte. Rarely needed. Default: `FALSE`.

`gsub()` function: syntax and parameters

gsub() shares identical syntax and parameters with sub(). The replacement string supports the same backreference notation.

gsub(pattern, replacement, x,
     ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)

The only behavioral difference between the two functions is described in the following table:

Feature	`sub()`	`gsub()`
Replacements per element	First match only	All matches
Typical use case	Remove or change a leading token	Global search-and-replace
Syntax	Identical to `gsub()`	Identical to `sub()`
Regex support	Yes	Yes
Performance on long strings	Slightly faster (stops at first match)	Scans the entire string

Basic usage examples

Let’s look at how to use sub() and gsub. This section demonstrates the core behavior of each function with practical examples.

Note: Note that R’s console pads element spacing for readability based on the width of the longest element in the output; exact whitespace may vary slightly from what is shown here.

Replacing the first match only with `sub()`

sub() scans each string element and stops after it finds and replaces the first match.

# Replace the first occurrence of "R" in a tutorial description
tutorial_text <- "In this tutorial, we will install R and add packages from CRAN."
sub("R", "The R Language", tutorial_text)

Running this command produces the following output:

Output

[1] "In this tutorial, we will install The R Language and add packages from CRAN."

Only the first "R" is replaced. sub() replaces only the first match found in each string element, so later matches in the same string remain unchanged.

Replacing all matches with `gsub()`

gsub() scans the entire string and replaces every match it finds. Using word-boundary anchors (\\b) ensures the pattern matches only the intended whole-word token rather than any substring.

# Replace ALL standalone occurrences of "R" using word boundaries
tutorial_text <- "R is open-source. Use R for data analysis and R for visualization."
gsub("\\bR\\b", "The R Language", tutorial_text)

The output looks like this:

Output

[1] "The R Language is open-source. Use The R Language for data analysis and The R Language for visualization."

Every standalone "R" is replaced. The \\b word-boundary anchor is important here: without it, the bare pattern "R" would also match the letter inside words such as "CRAN" or "TRE", producing unintended substitutions.

Using `sub()` and `gsub()` on a data frame column

Both functions accept a data frame column vector as the x argument. Pass the column, assign the result back, and the data frame is updated in place.

# Create a sample data frame
marine_species <- data.frame(
  creature            = c("Starfish", "Blue Crab", "Bluefin Tuna", "Blue Shark", "Blue Whale"),
  population_millions = c(5, 6, 4, 2, 2),
  stringsAsFactors    = FALSE
)

# sub(): replace the first "Blue" in each element
sub("Blue", "Green", marine_species$creature)

Running this command produces the following output:

Output

[1] "Starfish"      "Green Crab"    "Greenfin Tuna" "Green Shark"   "Green Whale"

Notice that "Bluefin Tuna" becomes "Greenfin Tuna" because the substring "Blue" appears at the start of "Bluefin". sub() is not word-aware and matches any occurrence, including inside longer words. To write the result back into the data frame, assign directly to the column:

marine_species$creature <- sub("Blue", "Green", marine_species$creature)

Using regular expressions with `sub()` and `gsub()`

Both functions treat pattern as a regular expression by default using R’s TRE engine (a modified POSIX ERE implementation). The following examples cover the most common regex constructs.

Regex engine quick reference: Base R uses TRE regex by default. Setting perl = TRUE switches to PCRE2. The stringr package uses ICU regex through stringi, which has slightly different syntax and behavior from both TRE and PCRE2. Features available in one engine may not be available in another.

Matching character classes and wildcards

A character class, written inside square brackets, matches any one character from the defined set. Negating the class with ^ matches any character NOT in the set.

# Remove all digits from product codes
product_codes <- c("SKU-1234", "SKU-5678", "SKU-ABCD")
gsub("[0-9]", "", product_codes)

Stripping the digits produces the following:

Output

[1] "SKU-"     "SKU-"     "SKU-ABCD"

The dot (.) wildcard matches any single character except a newline. The following example shows . matching any character between "l" and "g", including letters, digits, and punctuation:

variants <- c("log", "lag", "l9g", "lg")
gsub("l.g", "[match]", variants)

This results in the following output:

text[label Output]
[1] "[match]" "[match]" "[match]" "lg"

"lg" is unchanged because there is no character between "l" and "g" for the dot to match. Quantifiers like + (one or more) and * (zero or more) extend a match across multiple characters. When using advanced regex constructs consistently across environments, perl = TRUE may produce more predictable behavior.

Using anchors (`^` and `$`) in patterns

^ matches the position at the start of a string; $ matches the position at the end. These anchors are useful for stripping leading or trailing content without affecting the interior of the string.

# Remove trailing whitespace from column labels
messy_labels <- c("Revenue   ", "Costs  ", "Profit ")
gsub("\\s+$", "", messy_labels)

The cleaned labels appear as:

Output

[1] "Revenue" "Costs"   "Profit"

In R, regex escape sequences require double backslashes in string literals: \\s in the R source becomes \s (the whitespace class) in the regex engine. R’s TRE engine supports \s as a documented extension to POSIX ERE, so the example above works without perl = TRUE. For maximum portability outside R or across regex tools, the strictly POSIX-compatible equivalent is [[:space:]].

Greedy matching and lazy quantifiers

By default, quantifiers in R’s regex engines are greedy: they match as many characters as possible while still allowing the overall pattern to succeed. This can produce unexpected results when you mean to match the shortest possible substring.

# Greedy: .*  consumes from the first "<" to the LAST ">"
html_tags <- c("<b>bold</b>", "<em>italic</em>")
gsub("<.*>", "", html_tags)

You should the following

Output

[1] "" ""

Within each string element, the greedy .* matched everything between the first < and the last >, deleting the entire string contents. To match the shortest possible span, use a lazy quantifier (.*?), which requires perl = TRUE.

# Lazy: .*?  stops at the NEAREST ">"
gsub("<.*?>", "", html_tags, perl = TRUE)

You’ll see the following output:

Output

[1] "bold"   "italic"

.*? stops at the first > it encounters, so each individual tag is removed and the text content is preserved.

Escaping special characters

The regex metacharacters ., *, +, ?, (, ), [, ], {, }, ^, $, |, and \ must be escaped with \\ to match them literally.

# Replace literal dots in a version string with hyphens
version_string <- "R 4.6.0 released"
gsub("\\.", "-", version_string)

Replacing the dots produces:

Output

[1] "R 4-6-0 released"

Alternatively, pass fixed = TRUE to skip regex parsing entirely. gsub(".", "-", version_string, fixed = TRUE) produces the same result with no escaping required. Use fixed = TRUE whenever the pattern contains no regex syntax and you want to guarantee literal matching. When fixed = TRUE is enabled, regex metacharacters and PCRE features are disabled because the pattern is treated literally, so combining it with perl = TRUE has no effect.

Advanced pattern matching techniques

The techniques in this section address more complex matching scenarios: case-insensitive replacement, capture group backreferences, and the PCRE2 engine features unlocked by perl = TRUE.

Case-insensitive replacement with `ignore.case = TRUE`

By default, pattern matching is case-sensitive. Setting ignore.case = TRUE applies the pattern regardless of capitalization.

# Normalize mixed-case product labels to a standard form
product_labels <- c("Widget Pro", "WIDGET PRO", "widget pro", "Super Widget Pro")
gsub("\\bwidget pro\\b", "StandardWidget", product_labels, ignore.case = TRUE)

After normalization, the labels read:

Output

[1] "StandardWidget"      "StandardWidget"      "StandardWidget"      "Super StandardWidget"

The \\b word-boundary anchor restricts the match to the whole phrase and prevents it from matching substrings inside longer words. ignore.case = TRUE works with both sub() and gsub() and is compatible with both TRE and PCRE2.

Using backreferences and capture groups

Enclosing part of a pattern in parentheses creates a capture group. In the replacement string, use \\1 for the first group, \\2 for the second, and so on to reuse the captured text at a new position.

# Reformat "First Last" names to "Last, First"
full_names <- c("Alice Johnson", "Bob Martinez", "Carol White")
gsub("(\\w+) (\\w+)", "\\2, \\1", full_names)

The reformatted names are:

Output

[1] "Johnson, Alice"  "Martinez, Bob"   "White, Carol"

This simplified example assumes exactly two space-separated words. It does not handle middle names, hyphenated surnames, or apostrophes in names such as "O'Brien". For production use, consider a more specific pattern or a dedicated name-parsing library. Note also that character classes such as \\w behave more consistently with perl = TRUE or ICU-based engines when the input may contain Unicode characters; in TRE’s default mode, \\w matches only ASCII word characters.

Backreferences also work for reformatting structured strings such as dates.

# Reformat ISO dates (YYYY-MM-DD) to US format (MM/DD/YYYY)
iso_dates <- c("2025-03-15", "2024-11-01", "2026-06-04")
gsub("(\\d{4})-(\\d{2})-(\\d{2})", "\\2/\\3/\\1", iso_dates)

Converting to US format returns:

Output

[1] "03/15/2025" "11/01/2024" "06/04/2026"

Enabling PCRE with `perl = TRUE`

Setting perl = TRUE switches the regex engine from TRE to PCRE2.

PCRE2 provides improved Unicode handling and advanced regex features for UTF-8 text. It also unlocks features that TRE does not support: lookaheads, lookbehinds, named capture groups, possessive quantifiers, and the \U and \L case-conversion modifiers in replacement strings.

# Use \U to uppercase each word (PCRE2 case modifier, perl = TRUE required)
product_names <- c("widget pro", "super gadget", "nano device")
gsub("(\\w)(\\w*)", "\\U\\1\\L\\2", product_names, perl = TRUE)

Title-casing each word outputs:

Output

[1] "Widget Pro"   "Super Gadget" "Nano Device"

\\U\\1 uppercases the first character of each word and \\L\\2 lowercases the rest. These operators exist only in PCRE2 replacement strings and have no effect without perl = TRUE.

Lookaheads and lookbehinds (`perl = TRUE` only)

Lookaheads ((?=...)) and lookbehinds ((?<=...)) are zero-width assertions that match a position based on what immediately precedes or follows it, without consuming those surrounding characters. They are PCRE2-only features and require perl = TRUE.

# Insert an underscore between a letter and a digit (e.g., "Revenue2024" to "Revenue_2024")
field_names <- c("Revenue2024", "Costs2024", "Profit2024")
gsub("(?<=[A-Za-z])(?=[0-9])", "_", field_names, perl = TRUE)

The updated field names are:

Output

[1] "Revenue_2024" "Costs_2024"   "Profit_2024"

The lookbehind (?<=[A-Za-z]) asserts that a letter precedes the current position; the lookahead (?=[0-9]) asserts that a digit follows. A comma-free separator is inserted at that position without consuming any of the surrounding characters.

Replacing multiple patterns in R

A single gsub() call accepts only one pattern. When you need to replace several different patterns in the same pass, you can chain gsub() calls with Reduce(), or use stringr::str_replace_all(), which natively accepts a named vector of pattern-replacement pairs.

Applying multiple `gsub()` replacements with `Reduce()`

Build a named character vector where each name is the search pattern and each value is its replacement, then pass it through Reduce() to apply each substitution sequentially.

# Expand SMS-style abbreviations in survey responses
survey_responses <- c("pls send info asap", "thx for ur help", "gr8 service btw")

replacements <- c(
  "pls"  = "please",
  "asap" = "as soon as possible",
  "thx"  = "thanks",
  "ur"   = "your",
  "gr8"  = "great",
  "btw"  = "by the way"
)

result <- Reduce(
  function(text, pat) gsub(pat, replacements[pat], text, fixed = TRUE),
  names(replacements),
  init = survey_responses
)
result

Expanding the abbreviations returns:

Output

[1] "please send info as soon as possible"
[2] "thanks for your help"
[3] "great service by the way"

Reduce() applies each gsub() call in order, passing the result of each step as input to the next. The order of entries in replacements matters when one substitution could change a string that a later pattern would also match. The example below shows this hazard concretely:

# Hazardous order: "cat" is replaced by "dog", then "dog" is replaced by "wolf"
chained_replacements <- c("cat" = "dog", "dog" = "wolf")
Reduce(
  function(text, pat) gsub(pat, chained_replacements[pat], text, fixed = TRUE),
  names(chained_replacements),
  init = "my cat and my dog"
)

The chained substitutions produce:

Output

[1] "my wolf and my wolf"

Both "cat" and "dog" ended up as "wolf" because the first substitution transformed "cat" into "dog", which the second substitution then caught. To avoid this, either reorder the pairs so later patterns cannot match the output of earlier ones, or use str_replace_all() with a named vector, which applies all replacements simultaneously against the original string rather than sequentially against an evolving one.

Using `stringr::str_replace_all()` with a named vector

stringr::str_replace_all() accepts the same named vector directly, making multi-pattern replacement more concise.

library(stringr)

result <- str_replace_all(survey_responses, replacements)
result

Using str_replace_all() gives the same result:

Output

[1] "please send info as soon as possible"
[2] "thanks for your help"
[3] "great service by the way"

Both approaches produce identical output. str_replace_all() is the more readable option when the replacement vector grows beyond two or three entries and integrates naturally into |> and %>% pipeline chains.

Applying `gsub()` and `sub()` to data frames

Applying either function to a data frame column follows the same pattern as applying it to a plain character vector: pass the column as x and assign the result back to the same column. The following sections cover single-column replacement, multi-column replacement with lapply(), and a complete real-world data cleaning example.

One important behavior to know before working with real data: both functions pass NA values through unchanged. A vector element that is NA before the call remains NA after it, without producing an error.

# NA values are preserved, not replaced or raised as errors
gsub("cat", "dog", c("cat", NA, "catfish"))

Note that NA is left unchanged:

Output

[1] "dog"     NA        "dogfish"

If your data cleaning pipeline depends on detecting or imputing missing values, handle NA values before calling gsub() rather than relying on the replacement to modify them.

Replacing values in a single column

To apply gsub() to a data frame column, pass the column as the x argument and assign the result back to the same column.

# Clean a price column from a CSV import
sales_data <- data.frame(
  product = c("Laptop", "Phone", "Tablet", "Monitor"),
  price   = c("$1,299", "$899", "$2,450", "$349"),
  stringsAsFactors = FALSE
)

# The character class [$,] matches either $ or , in a single regex pass
sales_data$price <- as.numeric(gsub("[$,]", "", sales_data$price))
sales_data

The cleaned data frame looks like this:

Output

  product price
1  Laptop  1299
2   Phone   899
3  Tablet  2450
4 Monitor   349

Applying `gsub()` across multiple columns with `lapply()`

To apply the same transformation across several columns without repeating the call, pass a column selector to lapply().

# Strip formatting from phone number columns
contact_data <- data.frame(
  primary_phone   = c("(555) 123-4567", "(555) 987-6543"),
  secondary_phone = c("(555) 111-2222", "(555) 333-4444"),
  stringsAsFactors = FALSE
)

phone_cols <- c("primary_phone", "secondary_phone")

contact_data[phone_cols] <- lapply(contact_data[phone_cols], function(col) {
  gsub("[^0-9]", "", col)
})

contact_data

Stripped phone numbers appear as:

Output

  primary_phone secondary_phone
1    5551234567      5551112222
2    5559876543      5553334444

lapply() returns a list that R automatically assigns back to the selected columns.

Practical data cleaning example

The following example cleans a data frame that simulates a raw CSV import with inconsistent number formatting and mixed date separators.

# Simulate a messy CSV import
orders <- data.frame(
  order_id   = c("ORD-001", "ORD-002", "ORD-003"),
  amount     = c("$1,299.00", "$ 450.50", "$3,000.00"),
  order_date = c("01-15-2025", "02/03/2025", "2025.03.10"),
  stringsAsFactors = FALSE
)

# Step 1: strip currency symbol, spaces, and commas; convert to numeric
orders$amount <- as.numeric(gsub("[$,\\s]", "", orders$amount))

# Step 2: normalize all date separators to hyphens
orders$order_date <- gsub("[/.]", "-", orders$order_date)

orders

After both cleaning steps, the data frame shows:

Output

  order_id  amount order_date
1  ORD-001 1299.00 01-15-2025
2  ORD-002  450.50 02-03-2025
3  ORD-003 3000.00 2025-03-10

Two vectorized gsub() calls handle the normalization cleanly without explicit loops or external packages.

`sub()` and `gsub()` vs stringr: when to use which

gsub() and stringr::str_replace_all() both replace all pattern matches in a character vector. The choice between them typically comes down to dependency preferences, pipeline style, and feature needs.

Comparison table: `gsub()` vs `str_replace_all()`

The table below covers the most common decision factors.

Feature	`gsub()`	`str_replace_all()`
Package	base R (no install)	stringr (tidyverse)
Multiple patterns in one call	No	Yes (named vector)
Regex engine (default)	TRE (POSIX ERE)	ICU via stringi
Regex engine (PCRE)	Yes (`perl = TRUE`)	Always ICU
Pipe-friendly	Moderate	High (native with `\|>` or `%>%`)
Unicode support	Good (excellent with `perl = TRUE`)	Excellent (ICU always)
External dependency	None	stringr + stringi

Performance considerations on large vectors

For single-pattern replacements on typical vectors, gsub() and str_replace_all() perform comparably. For high-volume workloads with fixed (non-regex) patterns, stringi::stri_replace_all_fixed() is often faster for fixed-string replacements because it is optimized for literal string matching rather than full regex interpretation.

str_replace_all() is often more convenient and may perform better than chaining multiple gsub() calls, depending on the workload. For performance-critical pipelines, always benchmark with your actual data before assuming one approach is faster.

The following example shows stringi in use for a fixed-string replacement:

library(stringi)

# High-performance fixed-string replacement using stringi
product_descriptions <- c("apple and apple pie", "apple juice", "pineapple")
stri_replace_all_fixed(product_descriptions, "apple", "pear")

The replacement returns:

Output

[1] "pear and pear pie" "pear juice"        "pinepear"

Literal replacement affects substrings inside larger words. stri_replace_all_fixed() replaced "apple" inside "pineapple", producing "pinepear". This is expected behavior for any literal-match function. Use stri_replace_all_regex() with a word-boundary pattern (\\bapple\\b) if you need to match only whole words.

Unicode and multibyte string handling

Recent R versions provide significantly improved UTF-8 support across platforms.

For data containing non-ASCII characters, emoji, or text from non-Latin scripts, two options produce more predictable results: use perl = TRUE with gsub() to activate PCRE2’s Unicode support, or switch to str_replace_all(), which always uses the ICU engine through stringi.

If you encounter unexpected matching behavior with multibyte text, inspect encoding with Encoding(x) and normalize to UTF-8 with enc2utf8(x) before calling gsub().

Common errors and how to fix them

The following subsections cover the four most common sources of unexpected behavior when using sub() and gsub(), along with concrete fixes for each.

“invalid regular expression” error

This error appears when the pattern argument contains unescaped metacharacters or malformed syntax. Unmatched parentheses, unescaped square brackets, and unclosed quantifiers are the most common causes.

# Unmatched "(" causes an invalid regular expression error
gsub("(error", "warning", "connection (error) occurred")

R raises an error instead of returning a result:

Output

Error in gsub("(error", "warning", "connection (error) occurred") :
  invalid regular expression '(error', reason 'Missing ')''

Fix the error by escaping the parenthesis with \\(, or by setting fixed = TRUE if you want a literal string match.

# Escaped version
gsub("\\(error", "warning", "connection (error) occurred")

With the parenthesis escaped, R returns:

Output

[1] "connection warning) occurred"

Backslash escaping issues in patterns and replacement strings

R requires double backslashes (\\) in string literals to represent a single regex backslash (\). A regex requiring a literal backslash needs four backslashes in the R source ("\\\\" becomes the two-character regex \\, which the engine interprets as one literal backslash). This same double-backslash rule applies to the replacement argument, not just the pattern. To use a backreference in a replacement string, write "\\1" in R source, which the regex engine sees as \1 and resolves to the first capture group. To insert a literal backslash character into the replacement output, write "\\\\" in R source, which the engine sees as \\ and outputs as a single \.

# Replace backslashes in a Windows-style file path with forward slashes
file_path <- "C:\\Users\\alice\\Documents"
gsub("\\\\", "/", file_path)

The normalized path is:

Output

[1] "C:/Users/alice/Documents"

Unexpected behavior with `fixed = TRUE` vs regex patterns

When fixed = TRUE is set, all regex metacharacters lose their special meaning and are matched literally. Passing a regex pattern such as [0-9]+ to a call with fixed = TRUE will search for the literal string "[0-9]+" rather than digits. If a replacement produces fewer substitutions than expected, check that fixed = TRUE is not unintentionally enabled.

Pattern matches more than intended

Unanchored patterns match anywhere in the string, including inside longer words. Use ^ and $ to anchor to the full string, or \\b for whole-word boundaries, to restrict the scope of the match. The ignore.case = TRUE argument combined with \\b is a reliable pattern for case-insensitive whole-word replacement.

FAQs

1. What is the difference between `sub()` and `gsub()` in R?

The primary distinction lies in how many matches are replaced per string element:

sub() replaces only the first occurrence of the specified pattern within each element of a character vector. This is useful when you want to remove or change only the initial match, such as stripping a single prefix, leaving later occurrences unchanged.
gsub() replaces all occurrences of the pattern throughout each string element—making it ideal for global search-and-replace operations or thorough data cleaning.

Both functions share identical syntax and accept the same arguments, so switching from one to the other is as simple as changing the function name. For most data cleaning and normalization workflows, gsub() is more commonly used due to its global matching. Choose sub() when precise control over just the first match is needed.

2. How do I replace multiple different patterns at once using `gsub()` in R?

Base R’s gsub() does not allow replacing several different patterns in one call. However, there are two commonly used strategies:

Chaining with Reduce(): Prepare a named character vector where names are patterns and values are replacements. Use Reduce() to apply a wrapper function over each pattern-replacement pair in sequence, so every match gets replaced one-by-one.
```
patterns <- c("foo" = "bar", "baz" = "qux")
text <- "foo and baz"
Reduce(function(x, y) gsub(y, patterns[y], x), names(patterns), init = text)
```
Using stringr::str_replace_all(): The stringr package’s str_replace_all() natively accepts a named vector of pattern-replacement pairs, enabling you to perform all substitutions in a single line.
```
library(stringr)
str_replace_all(text, patterns)
```

str_replace_all() is recommended when handling more than a couple of patterns, as it produces cleaner and more maintainable code, particularly within a tidyverse workflow.

3. How do I make `gsub()` case-insensitive in R?

To enable case-insensitive matching in gsub(), simply set the ignore.case argument to TRUE:

gsub("hello", "hi", x, ignore.case = TRUE)

This replaces all variations of "hello" regardless of capitalization ("Hello", "HELLO", etc.). If you want to avoid matching parts of other words (e.g., replacing "hello" inside "chello"), combine ignore.case = TRUE with the word boundary anchor:

gsub("\\bhello\\b", "hi", x, ignore.case = TRUE)

This ensures only whole-word matches are replaced.

4. What does `perl = TRUE` do in `gsub()` and `sub()`?

The perl = TRUE argument tells R to use the PCRE2 engine instead of the default TRE engine. Enabling this unlocks powerful regex features such as:

Lookaheads ((?=...)) and lookbehinds ((?<=...))
Named capture groups ((?<name>...))
Possessive quantifiers (*+, ++, etc.)
Case conversion operators (\\U, \\L, \\E) for changing case in replacements
Improved Unicode support for working with UTF-8 data

If your pattern or replacement relies on advanced regex capabilities only available in PCRE2, set perl = TRUE to activate them.

5. How do I use backreferences in `gsub()` replacement strings?

Backreferences let you re-use matched groups from the pattern in the replacement string. To do this:

Place the portion you want to capture in parentheses inside the pattern argument—each pair of parentheses defines a numbered capture group: ( ...).
In the replacement argument, reference the groups using \\1, \\2, etc., where the number corresponds to the order of the groups.

For example:

gsub("(\\w+) (\\w+)", "\\2, \\1", "John Smith")
# Result: "Smith, John"

If using perl = TRUE, you can further enhance the match with case-modifying escapes: \\U\\1 uppercases a group, \\L\\2 lowercases, and so forth—enabling sophisticated string transformations.

6. How do I apply `gsub()` to a column in a data frame?

To use gsub() on a single column, simply pass the column and assign the result back:

df$column <- gsub("pattern", "replacement", df$column)

For multiple columns, leverage lapply() to apply the replacement across all desired columns in one step:

df[cols] <- lapply(df[cols], function(col) gsub("pattern", "replacement", col))

This approach is scalable and avoids repeatedly writing out the same gsub() call for each column. It is especially valuable for cleaning and standardizing large datasets.

7. Should I use `gsub()` or `stringr::str_replace_all()` in R?

Choose based on your context and project requirements:

Use gsub() when you want a solution with no external dependencies, when relying on PCRE2-specific features via perl = TRUE, or if performance benchmarks indicate it’s optimal for your use case. It is native to base R and requires no extra installation.
Use stringr::str_replace_all() for tidyverse-style code, when you need to handle multiple patterns simultaneously, or when you desire consistent and modern Unicode handling (ICU engine via stringi). It also typically integrates better in data pipelines and can improve readability with named vectors of replacements.

Your choice should depend on your workflow preferences, desired features, and the complexity of the replacement task at hand.

8. Why is `gsub()` not replacing my pattern as expected?

Common causes why gsub() might not replace your pattern as expected:

Unescaped special regex characters: Regex metacharacters (such as ., *, +, [, ], (, ), |, ?, ^, $, etc.) have special meanings in patterns. If you intend to match them literally, either:
- Set fixed = TRUE to treat the pattern as a plain string (no regex), or
- Prefix each metacharacter in the pattern with double backslashes, e.g., \\. to match a literal dot.
Using PCRE2-only features without perl = TRUE: Some advanced regex constructs like lookahead ((?=...)), lookbehind ((?<=...)), and named capture groups are only available when perl = TRUE is set in the function call. Make sure to use perl = TRUE if your pattern relies on these features.
Encoding issues with non-ASCII strings: If your input contains non-ASCII (Unicode) characters, encoding mismatches can prevent patterns from matching. Check your string’s encoding with Encoding(x). If needed, convert it to UTF-8 using enc2utf8(x) before running gsub().
Unanchored patterns matching more text than intended: If your pattern lacks anchors, it may match inside larger words or substrings, leading to unintended replacements. Add the word boundary anchor (\\b) or string start (^) and end ($) anchors to restrict pattern matches to the desired scope.

Tip: Always test your pattern with grepl(pattern, x) first to confirm it matches exactly what you expect, before applying gsub() to perform replacements.

Conclusion

sub() and gsub() are foundational base R functions for string replacement that work on character vectors, data frame columns, and any object coercible to character, with no external dependencies. sub() targets the first match per element; gsub() replaces every match. Full regex support with the TRE engine covers most use cases, and perl = TRUE unlocks PCRE2 features including lookaheads, lookbehinds, and case modifiers for advanced formatting tasks. For replacing multiple patterns in a single call or for tidyverse pipeline compatibility, stringr::str_replace_all() is the natural complement.

Continue your learning with the following tutorials that cover related R string and data manipulation topics:

Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.

Learn more about our products

About the author(s)

Prajwal CN

Author

Bradley Kouchi

Editor

See author profile

Manikandan Kurup

Editor

Senior Technical Content Engineer I

See author profile

With over 6 years of experience in tech publishing, Mani has edited and published more than 75 books covering a wide range of data science topics. Known for his strong attention to detail and technical knowledge, Mani specializes in creating clear, concise, and easy-to-understand content tailored for developers.

See author profile

Category:

Tutorial

Tags:

R Programming

Still looking for an answer?

Ask a question Search for more help

Was this helpful?

This work is licensed under a Creative Commons Attribution-NonCommercial- ShareAlike 4.0 International License.

Become a contributor for community

Get paid to write technical tutorials and select a tech-focused charity to receive a matching donation.

DigitalOcean Documentation

Full documentation for every DigitalOcean product.

Learn more

Resources for startups and AI-native businesses

The Wave has everything you need to know about building a business, from raising funding to marketing your product.

Learn more

Get our newsletter

Stay up to date by signing up for DigitalOcean’s Infrastructure as a Newsletter.

New accounts only. By submitting your email you agree to our Privacy Policy

The developer cloud

Scale up as you grow — whether you're running one virtual machine or ten thousand.

View all products

Start building today

From GPU-powered inference and Kubernetes to managed databases and storage, get everything you need to build, scale, and deploy intelligent applications.

Dark mode is coming soon.

R sub() and gsub() Functions: Complete Guide & Examples

About the author(s)

Still looking for an answer?

Deploy on DigitalOcean

Become a contributor for community

DigitalOcean Documentation

Resources for startups and AI-native businesses

Get our newsletter

The developer cloud

Start building today