By Prajwal CN, Bradley Kouchi and Manikandan Kurup

R’s sub() and gsub() functions are the primary base R tools for string pattern replacement. Both accept a pattern, a replacement string, and a character vector or data frame column, and both support full regular expression syntax by default. The key difference is scope: sub() replaces only the first match in each string element, while gsub() replaces all matches globally. Modeled on Unix sed conventions and included in base R since its initial release, both functions require no external packages and are universally available across every R installation.
As of modern R releases, sub() and gsub() remain core, stable functions used daily for data cleaning, text preprocessing, log parsing, and column standardization.
In this tutorial, you will learn the syntax and parameters of both functions, use regular expressions with character classes, anchors, and capture group backreferences, explore advanced features including case-insensitive matching and the PCRE2 (Perl Compatible Regular Expressions version 2) engine enabled by perl = TRUE, replace multiple patterns at once using Reduce() and stringr::str_replace_all(), apply gsub() to data frame columns with practical cleaning examples, and compare gsub() against stringr and stringi alternatives.
Key Takeaways
sub() to replace only the first occurrence of a pattern per string element; use gsub() to replace all occurrences.fixed = TRUE to match a literal string with no regex interpretation.perl = TRUE to switch from the default TRE engine to PCRE2, which adds lookaheads, lookbehinds, named capture groups, and \U/\L case modifiers in replacement strings.\\1, \\2, etc. in the replacement string to reference capture groups from the pattern.stringr::str_replace_all() with a named vector, or chain gsub() calls with Reduce().gsub() to a data frame column directly: df$col <- gsub("pattern", "replacement", df$col).ignore.case = TRUE for case-insensitive matching. Combine with \\b for whole-word replacement, but note that \\b is ASCII-only in TRE; use perl = TRUE for Unicode-aware word boundaries.NA values through unchanged without error. Handle missing values explicitly before calling sub() or gsub() if your pipeline depends on detecting or imputing them.To follow this tutorial, you will need:
stringr package for the comparison section, installed with install.packages("stringr") (optional).sub() and gsub() functions?sub() and gsub() are base R string replacement functions that find a pattern in a character vector and substitute it with a replacement string. The pattern can be a fixed literal or a regular expression. They are vectorized, meaning they process each element of the input vector independently in a single call.
sub() function: syntax and parametersThe full syntax for sub() is:
sub(pattern, replacement, x,
ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)
Here’s what each function argument means:
| Parameter | Type | Description |
|---|---|---|
pattern |
character | The string or regex to search for. |
replacement |
character | The string to substitute in place of each match. Supports backreferences (\\1, \\2, etc.). |
x |
character vector | The input to search. Data frame columns are passed as vectors. |
ignore.case |
logical | When TRUE, matching is case-insensitive. Default: FALSE. |
perl |
logical | When TRUE, uses the PCRE2 regex engine instead of TRE. Default: FALSE. |
fixed |
logical | When TRUE, pattern is treated as a literal string, not a regex. Default: FALSE. |
useBytes |
logical | When TRUE, matching is byte-by-byte. Rarely needed. Default: FALSE. |
gsub() function: syntax and parametersgsub() shares identical syntax and parameters with sub(). The replacement string supports the same backreference notation.
gsub(pattern, replacement, x,
ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)
The only behavioral difference between the two functions is described in the following table:
| Feature | sub() |
gsub() |
|---|---|---|
| Replacements per element | First match only | All matches |
| Typical use case | Remove or change a leading token | Global search-and-replace |
| Syntax | Identical to gsub() |
Identical to sub() |
| Regex support | Yes | Yes |
| Performance on long strings | Slightly faster (stops at first match) | Scans the entire string |
Let’s look at how to use sub() and gsub. This section demonstrates the core behavior of each function with practical examples.
Note: Note that R’s console pads element spacing for readability based on the width of the longest element in the output; exact whitespace may vary slightly from what is shown here.
sub()sub() scans each string element and stops after it finds and replaces the first match.
# Replace the first occurrence of "R" in a tutorial description
tutorial_text <- "In this tutorial, we will install R and add packages from CRAN."
sub("R", "The R Language", tutorial_text)
Running this command produces the following output:
[1] "In this tutorial, we will install The R Language and add packages from CRAN."
Only the first "R" is replaced. sub() replaces only the first match found in each string element, so later matches in the same string remain unchanged.
gsub()gsub() scans the entire string and replaces every match it finds. Using word-boundary anchors (\\b) ensures the pattern matches only the intended whole-word token rather than any substring.
# Replace ALL standalone occurrences of "R" using word boundaries
tutorial_text <- "R is open-source. Use R for data analysis and R for visualization."
gsub("\\bR\\b", "The R Language", tutorial_text)
The output looks like this:
[1] "The R Language is open-source. Use The R Language for data analysis and The R Language for visualization."
Every standalone "R" is replaced. The \\b word-boundary anchor is important here: without it, the bare pattern "R" would also match the letter inside words such as "CRAN" or "TRE", producing unintended substitutions.
sub() and gsub() on a data frame columnBoth functions accept a data frame column vector as the x argument. Pass the column, assign the result back, and the data frame is updated in place.
# Create a sample data frame
marine_species <- data.frame(
creature = c("Starfish", "Blue Crab", "Bluefin Tuna", "Blue Shark", "Blue Whale"),
population_millions = c(5, 6, 4, 2, 2),
stringsAsFactors = FALSE
)
# sub(): replace the first "Blue" in each element
sub("Blue", "Green", marine_species$creature)
Running this command produces the following output:
[1] "Starfish" "Green Crab" "Greenfin Tuna" "Green Shark" "Green Whale"
Notice that "Bluefin Tuna" becomes "Greenfin Tuna" because the substring "Blue" appears at the start of "Bluefin". sub() is not word-aware and matches any occurrence, including inside longer words. To write the result back into the data frame, assign directly to the column:
marine_species$creature <- sub("Blue", "Green", marine_species$creature)
sub() and gsub()Both functions treat pattern as a regular expression by default using R’s TRE engine (a modified POSIX ERE implementation). The following examples cover the most common regex constructs.
Regex engine quick reference: Base R uses TRE regex by default. Setting perl = TRUE switches to PCRE2. The stringr package uses ICU regex through stringi, which has slightly different syntax and behavior from both TRE and PCRE2. Features available in one engine may not be available in another.
A character class, written inside square brackets, matches any one character from the defined set. Negating the class with ^ matches any character NOT in the set.
# Remove all digits from product codes
product_codes <- c("SKU-1234", "SKU-5678", "SKU-ABCD")
gsub("[0-9]", "", product_codes)
Stripping the digits produces the following:
[1] "SKU-" "SKU-" "SKU-ABCD"
The dot (.) wildcard matches any single character except a newline. The following example shows . matching any character between "l" and "g", including letters, digits, and punctuation:
variants <- c("log", "lag", "l9g", "lg")
gsub("l.g", "[match]", variants)
This results in the following output:
text[label Output]
[1] "[match]" "[match]" "[match]" "lg"
"lg" is unchanged because there is no character between "l" and "g" for the dot to match. Quantifiers like + (one or more) and * (zero or more) extend a match across multiple characters. When using advanced regex constructs consistently across environments, perl = TRUE may produce more predictable behavior.
^ and $) in patterns^ matches the position at the start of a string; $ matches the position at the end. These anchors are useful for stripping leading or trailing content without affecting the interior of the string.
# Remove trailing whitespace from column labels
messy_labels <- c("Revenue ", "Costs ", "Profit ")
gsub("\\s+$", "", messy_labels)
The cleaned labels appear as:
[1] "Revenue" "Costs" "Profit"
In R, regex escape sequences require double backslashes in string literals: \\s in the R source becomes \s (the whitespace class) in the regex engine. R’s TRE engine supports \s as a documented extension to POSIX ERE, so the example above works without perl = TRUE. For maximum portability outside R or across regex tools, the strictly POSIX-compatible equivalent is [[:space:]].
By default, quantifiers in R’s regex engines are greedy: they match as many characters as possible while still allowing the overall pattern to succeed. This can produce unexpected results when you mean to match the shortest possible substring.
# Greedy: .* consumes from the first "<" to the LAST ">"
html_tags <- c("<b>bold</b>", "<em>italic</em>")
gsub("<.*>", "", html_tags)
You should the following
[1] "" ""
Within each string element, the greedy .* matched everything between the first < and the last >, deleting the entire string contents. To match the shortest possible span, use a lazy quantifier (.*?), which requires perl = TRUE.
# Lazy: .*? stops at the NEAREST ">"
gsub("<.*?>", "", html_tags, perl = TRUE)
You’ll see the following output:
[1] "bold" "italic"
.*? stops at the first > it encounters, so each individual tag is removed and the text content is preserved.
The regex metacharacters ., *, +, ?, (, ), [, ], {, }, ^, $, |, and \ must be escaped with \\ to match them literally.
# Replace literal dots in a version string with hyphens
version_string <- "R 4.6.0 released"
gsub("\\.", "-", version_string)
Replacing the dots produces:
[1] "R 4-6-0 released"
Alternatively, pass fixed = TRUE to skip regex parsing entirely. gsub(".", "-", version_string, fixed = TRUE) produces the same result with no escaping required. Use fixed = TRUE whenever the pattern contains no regex syntax and you want to guarantee literal matching. When fixed = TRUE is enabled, regex metacharacters and PCRE features are disabled because the pattern is treated literally, so combining it with perl = TRUE has no effect.
The techniques in this section address more complex matching scenarios: case-insensitive replacement, capture group backreferences, and the PCRE2 engine features unlocked by perl = TRUE.
ignore.case = TRUEBy default, pattern matching is case-sensitive. Setting ignore.case = TRUE applies the pattern regardless of capitalization.
# Normalize mixed-case product labels to a standard form
product_labels <- c("Widget Pro", "WIDGET PRO", "widget pro", "Super Widget Pro")
gsub("\\bwidget pro\\b", "StandardWidget", product_labels, ignore.case = TRUE)
After normalization, the labels read:
[1] "StandardWidget" "StandardWidget" "StandardWidget" "Super StandardWidget"
The \\b word-boundary anchor restricts the match to the whole phrase and prevents it from matching substrings inside longer words. ignore.case = TRUE works with both sub() and gsub() and is compatible with both TRE and PCRE2.
Enclosing part of a pattern in parentheses creates a capture group. In the replacement string, use \\1 for the first group, \\2 for the second, and so on to reuse the captured text at a new position.
# Reformat "First Last" names to "Last, First"
full_names <- c("Alice Johnson", "Bob Martinez", "Carol White")
gsub("(\\w+) (\\w+)", "\\2, \\1", full_names)
The reformatted names are:
[1] "Johnson, Alice" "Martinez, Bob" "White, Carol"
This simplified example assumes exactly two space-separated words. It does not handle middle names, hyphenated surnames, or apostrophes in names such as "O'Brien". For production use, consider a more specific pattern or a dedicated name-parsing library. Note also that character classes such as \\w behave more consistently with perl = TRUE or ICU-based engines when the input may contain Unicode characters; in TRE’s default mode, \\w matches only ASCII word characters.
Backreferences also work for reformatting structured strings such as dates.
# Reformat ISO dates (YYYY-MM-DD) to US format (MM/DD/YYYY)
iso_dates <- c("2025-03-15", "2024-11-01", "2026-06-04")
gsub("(\\d{4})-(\\d{2})-(\\d{2})", "\\2/\\3/\\1", iso_dates)
Converting to US format returns:
[1] "03/15/2025" "11/01/2024" "06/04/2026"
perl = TRUESetting perl = TRUE switches the regex engine from TRE to PCRE2.
PCRE2 provides improved Unicode handling and advanced regex features for UTF-8 text. It also unlocks features that TRE does not support: lookaheads, lookbehinds, named capture groups, possessive quantifiers, and the \U and \L case-conversion modifiers in replacement strings.
# Use \U to uppercase each word (PCRE2 case modifier, perl = TRUE required)
product_names <- c("widget pro", "super gadget", "nano device")
gsub("(\\w)(\\w*)", "\\U\\1\\L\\2", product_names, perl = TRUE)
Title-casing each word outputs:
[1] "Widget Pro" "Super Gadget" "Nano Device"
\\U\\1 uppercases the first character of each word and \\L\\2 lowercases the rest. These operators exist only in PCRE2 replacement strings and have no effect without perl = TRUE.
perl = TRUE only)Lookaheads ((?=...)) and lookbehinds ((?<=...)) are zero-width assertions that match a position based on what immediately precedes or follows it, without consuming those surrounding characters. They are PCRE2-only features and require perl = TRUE.
# Insert an underscore between a letter and a digit (e.g., "Revenue2024" to "Revenue_2024")
field_names <- c("Revenue2024", "Costs2024", "Profit2024")
gsub("(?<=[A-Za-z])(?=[0-9])", "_", field_names, perl = TRUE)
The updated field names are:
[1] "Revenue_2024" "Costs_2024" "Profit_2024"
The lookbehind (?<=[A-Za-z]) asserts that a letter precedes the current position; the lookahead (?=[0-9]) asserts that a digit follows. A comma-free separator is inserted at that position without consuming any of the surrounding characters.
A single gsub() call accepts only one pattern. When you need to replace several different patterns in the same pass, you can chain gsub() calls with Reduce(), or use stringr::str_replace_all(), which natively accepts a named vector of pattern-replacement pairs.
gsub() replacements with Reduce()Build a named character vector where each name is the search pattern and each value is its replacement, then pass it through Reduce() to apply each substitution sequentially.
# Expand SMS-style abbreviations in survey responses
survey_responses <- c("pls send info asap", "thx for ur help", "gr8 service btw")
replacements <- c(
"pls" = "please",
"asap" = "as soon as possible",
"thx" = "thanks",
"ur" = "your",
"gr8" = "great",
"btw" = "by the way"
)
result <- Reduce(
function(text, pat) gsub(pat, replacements[pat], text, fixed = TRUE),
names(replacements),
init = survey_responses
)
result
Expanding the abbreviations returns:
[1] "please send info as soon as possible"
[2] "thanks for your help"
[3] "great service by the way"
Reduce() applies each gsub() call in order, passing the result of each step as input to the next. The order of entries in replacements matters when one substitution could change a string that a later pattern would also match. The example below shows this hazard concretely:
# Hazardous order: "cat" is replaced by "dog", then "dog" is replaced by "wolf"
chained_replacements <- c("cat" = "dog", "dog" = "wolf")
Reduce(
function(text, pat) gsub(pat, chained_replacements[pat], text, fixed = TRUE),
names(chained_replacements),
init = "my cat and my dog"
)
The chained substitutions produce:
[1] "my wolf and my wolf"
Both "cat" and "dog" ended up as "wolf" because the first substitution transformed "cat" into "dog", which the second substitution then caught. To avoid this, either reorder the pairs so later patterns cannot match the output of earlier ones, or use str_replace_all() with a named vector, which applies all replacements simultaneously against the original string rather than sequentially against an evolving one.
stringr::str_replace_all() with a named vectorstringr::str_replace_all() accepts the same named vector directly, making multi-pattern replacement more concise.
library(stringr)
result <- str_replace_all(survey_responses, replacements)
result
Using str_replace_all() gives the same result:
[1] "please send info as soon as possible"
[2] "thanks for your help"
[3] "great service by the way"
Both approaches produce identical output. str_replace_all() is the more readable option when the replacement vector grows beyond two or three entries and integrates naturally into |> and %>% pipeline chains.
gsub() and sub() to data framesApplying either function to a data frame column follows the same pattern as applying it to a plain character vector: pass the column as x and assign the result back to the same column. The following sections cover single-column replacement, multi-column replacement with lapply(), and a complete real-world data cleaning example.
One important behavior to know before working with real data: both functions pass NA values through unchanged. A vector element that is NA before the call remains NA after it, without producing an error.
# NA values are preserved, not replaced or raised as errors
gsub("cat", "dog", c("cat", NA, "catfish"))
Note that NA is left unchanged:
[1] "dog" NA "dogfish"
If your data cleaning pipeline depends on detecting or imputing missing values, handle NA values before calling gsub() rather than relying on the replacement to modify them.
To apply gsub() to a data frame column, pass the column as the x argument and assign the result back to the same column.
# Clean a price column from a CSV import
sales_data <- data.frame(
product = c("Laptop", "Phone", "Tablet", "Monitor"),
price = c("$1,299", "$899", "$2,450", "$349"),
stringsAsFactors = FALSE
)
# The character class [$,] matches either $ or , in a single regex pass
sales_data$price <- as.numeric(gsub("[$,]", "", sales_data$price))
sales_data
The cleaned data frame looks like this:
product price
1 Laptop 1299
2 Phone 899
3 Tablet 2450
4 Monitor 349
gsub() across multiple columns with lapply()To apply the same transformation across several columns without repeating the call, pass a column selector to lapply().
# Strip formatting from phone number columns
contact_data <- data.frame(
primary_phone = c("(555) 123-4567", "(555) 987-6543"),
secondary_phone = c("(555) 111-2222", "(555) 333-4444"),
stringsAsFactors = FALSE
)
phone_cols <- c("primary_phone", "secondary_phone")
contact_data[phone_cols] <- lapply(contact_data[phone_cols], function(col) {
gsub("[^0-9]", "", col)
})
contact_data
Stripped phone numbers appear as:
primary_phone secondary_phone
1 5551234567 5551112222
2 5559876543 5553334444
lapply() returns a list that R automatically assigns back to the selected columns.
The following example cleans a data frame that simulates a raw CSV import with inconsistent number formatting and mixed date separators.
# Simulate a messy CSV import
orders <- data.frame(
order_id = c("ORD-001", "ORD-002", "ORD-003"),
amount = c("$1,299.00", "$ 450.50", "$3,000.00"),
order_date = c("01-15-2025", "02/03/2025", "2025.03.10"),
stringsAsFactors = FALSE
)
# Step 1: strip currency symbol, spaces, and commas; convert to numeric
orders$amount <- as.numeric(gsub("[$,\\s]", "", orders$amount))
# Step 2: normalize all date separators to hyphens
orders$order_date <- gsub("[/.]", "-", orders$order_date)
orders
After both cleaning steps, the data frame shows:
order_id amount order_date
1 ORD-001 1299.00 01-15-2025
2 ORD-002 450.50 02-03-2025
3 ORD-003 3000.00 2025-03-10
Two vectorized gsub() calls handle the normalization cleanly without explicit loops or external packages.
sub() and gsub() vs stringr: when to use whichgsub() and stringr::str_replace_all() both replace all pattern matches in a character vector. The choice between them typically comes down to dependency preferences, pipeline style, and feature needs.
gsub() vs str_replace_all()The table below covers the most common decision factors.
| Feature | gsub() |
str_replace_all() |
|---|---|---|
| Package | base R (no install) | stringr (tidyverse) |
| Multiple patterns in one call | No | Yes (named vector) |
| Regex engine (default) | TRE (POSIX ERE) | ICU via stringi |
| Regex engine (PCRE) | Yes (perl = TRUE) |
Always ICU |
| Pipe-friendly | Moderate | High (native with |> or %>%) |
| Unicode support | Good (excellent with perl = TRUE) |
Excellent (ICU always) |
| External dependency | None | stringr + stringi |
For single-pattern replacements on typical vectors, gsub() and str_replace_all() perform comparably. For high-volume workloads with fixed (non-regex) patterns, stringi::stri_replace_all_fixed() is often faster for fixed-string replacements because it is optimized for literal string matching rather than full regex interpretation.
str_replace_all() is often more convenient and may perform better than chaining multiple gsub() calls, depending on the workload. For performance-critical pipelines, always benchmark with your actual data before assuming one approach is faster.
The following example shows stringi in use for a fixed-string replacement:
library(stringi)
# High-performance fixed-string replacement using stringi
product_descriptions <- c("apple and apple pie", "apple juice", "pineapple")
stri_replace_all_fixed(product_descriptions, "apple", "pear")
The replacement returns:
[1] "pear and pear pie" "pear juice" "pinepear"
Literal replacement affects substrings inside larger words. stri_replace_all_fixed() replaced "apple" inside "pineapple", producing "pinepear". This is expected behavior for any literal-match function. Use stri_replace_all_regex() with a word-boundary pattern (\\bapple\\b) if you need to match only whole words.
Recent R versions provide significantly improved UTF-8 support across platforms.
For data containing non-ASCII characters, emoji, or text from non-Latin scripts, two options produce more predictable results: use perl = TRUE with gsub() to activate PCRE2’s Unicode support, or switch to str_replace_all(), which always uses the ICU engine through stringi.
If you encounter unexpected matching behavior with multibyte text, inspect encoding with Encoding(x) and normalize to UTF-8 with enc2utf8(x) before calling gsub().
The following subsections cover the four most common sources of unexpected behavior when using sub() and gsub(), along with concrete fixes for each.
This error appears when the pattern argument contains unescaped metacharacters or malformed syntax. Unmatched parentheses, unescaped square brackets, and unclosed quantifiers are the most common causes.
# Unmatched "(" causes an invalid regular expression error
gsub("(error", "warning", "connection (error) occurred")
R raises an error instead of returning a result:
Error in gsub("(error", "warning", "connection (error) occurred") :
invalid regular expression '(error', reason 'Missing ')''
Fix the error by escaping the parenthesis with \\(, or by setting fixed = TRUE if you want a literal string match.
# Escaped version
gsub("\\(error", "warning", "connection (error) occurred")
With the parenthesis escaped, R returns:
[1] "connection warning) occurred"
R requires double backslashes (\\) in string literals to represent a single regex backslash (\). A regex requiring a literal backslash needs four backslashes in the R source ("\\\\" becomes the two-character regex \\, which the engine interprets as one literal backslash). This same double-backslash rule applies to the replacement argument, not just the pattern. To use a backreference in a replacement string, write "\\1" in R source, which the regex engine sees as \1 and resolves to the first capture group. To insert a literal backslash character into the replacement output, write "\\\\" in R source, which the engine sees as \\ and outputs as a single \.
# Replace backslashes in a Windows-style file path with forward slashes
file_path <- "C:\\Users\\alice\\Documents"
gsub("\\\\", "/", file_path)
The normalized path is:
[1] "C:/Users/alice/Documents"
fixed = TRUE vs regex patternsWhen fixed = TRUE is set, all regex metacharacters lose their special meaning and are matched literally. Passing a regex pattern such as [0-9]+ to a call with fixed = TRUE will search for the literal string "[0-9]+" rather than digits. If a replacement produces fewer substitutions than expected, check that fixed = TRUE is not unintentionally enabled.
Unanchored patterns match anywhere in the string, including inside longer words. Use ^ and $ to anchor to the full string, or \\b for whole-word boundaries, to restrict the scope of the match. The ignore.case = TRUE argument combined with \\b is a reliable pattern for case-insensitive whole-word replacement.
sub() and gsub() in R?The primary distinction lies in how many matches are replaced per string element:
sub() replaces only the first occurrence of the specified pattern within each element of a character vector. This is useful when you want to remove or change only the initial match, such as stripping a single prefix, leaving later occurrences unchanged.gsub() replaces all occurrences of the pattern throughout each string element—making it ideal for global search-and-replace operations or thorough data cleaning.Both functions share identical syntax and accept the same arguments, so switching from one to the other is as simple as changing the function name. For most data cleaning and normalization workflows, gsub() is more commonly used due to its global matching. Choose sub() when precise control over just the first match is needed.
gsub() in R?Base R’s gsub() does not allow replacing several different patterns in one call. However, there are two commonly used strategies:
Chaining with Reduce(): Prepare a named character vector where names are patterns and values are replacements. Use Reduce() to apply a wrapper function over each pattern-replacement pair in sequence, so every match gets replaced one-by-one.
patterns <- c("foo" = "bar", "baz" = "qux")
text <- "foo and baz"
Reduce(function(x, y) gsub(y, patterns[y], x), names(patterns), init = text)
Using stringr::str_replace_all(): The stringr package’s str_replace_all() natively accepts a named vector of pattern-replacement pairs, enabling you to perform all substitutions in a single line.
library(stringr)
str_replace_all(text, patterns)
str_replace_all() is recommended when handling more than a couple of patterns, as it produces cleaner and more maintainable code, particularly within a tidyverse workflow.
gsub() case-insensitive in R?To enable case-insensitive matching in gsub(), simply set the ignore.case argument to TRUE:
gsub("hello", "hi", x, ignore.case = TRUE)
This replaces all variations of "hello" regardless of capitalization ("Hello", "HELLO", etc.). If you want to avoid matching parts of other words (e.g., replacing "hello" inside "chello"), combine ignore.case = TRUE with the word boundary anchor:
gsub("\\bhello\\b", "hi", x, ignore.case = TRUE)
This ensures only whole-word matches are replaced.
perl = TRUE do in gsub() and sub()?The perl = TRUE argument tells R to use the PCRE2 engine instead of the default TRE engine. Enabling this unlocks powerful regex features such as:
(?=...)) and lookbehinds ((?<=...))(?<name>...))*+, ++, etc.)\\U, \\L, \\E) for changing case in replacementsIf your pattern or replacement relies on advanced regex capabilities only available in PCRE2, set perl = TRUE to activate them.
gsub() replacement strings?Backreferences let you re-use matched groups from the pattern in the replacement string. To do this:
pattern argument—each pair of parentheses defines a numbered capture group: ( ...).replacement argument, reference the groups using \\1, \\2, etc., where the number corresponds to the order of the groups.For example:
gsub("(\\w+) (\\w+)", "\\2, \\1", "John Smith")
# Result: "Smith, John"
If using perl = TRUE, you can further enhance the match with case-modifying escapes: \\U\\1 uppercases a group, \\L\\2 lowercases, and so forth—enabling sophisticated string transformations.
gsub() to a column in a data frame?To use gsub() on a single column, simply pass the column and assign the result back:
df$column <- gsub("pattern", "replacement", df$column)
For multiple columns, leverage lapply() to apply the replacement across all desired columns in one step:
df[cols] <- lapply(df[cols], function(col) gsub("pattern", "replacement", col))
This approach is scalable and avoids repeatedly writing out the same gsub() call for each column. It is especially valuable for cleaning and standardizing large datasets.
gsub() or stringr::str_replace_all() in R?Choose based on your context and project requirements:
gsub() when you want a solution with no external dependencies, when relying on PCRE2-specific features via perl = TRUE, or if performance benchmarks indicate it’s optimal for your use case. It is native to base R and requires no extra installation.stringr::str_replace_all() for tidyverse-style code, when you need to handle multiple patterns simultaneously, or when you desire consistent and modern Unicode handling (ICU engine via stringi). It also typically integrates better in data pipelines and can improve readability with named vectors of replacements.Your choice should depend on your workflow preferences, desired features, and the complexity of the replacement task at hand.
gsub() not replacing my pattern as expected?Common causes why gsub() might not replace your pattern as expected:
Unescaped special regex characters: Regex metacharacters (such as ., *, +, [, ], (, ), |, ?, ^, $, etc.) have special meanings in patterns. If you intend to match them literally, either:
fixed = TRUE to treat the pattern as a plain string (no regex), or\\. to match a literal dot.Using PCRE2-only features without perl = TRUE: Some advanced regex constructs like lookahead ((?=...)), lookbehind ((?<=...)), and named capture groups are only available when perl = TRUE is set in the function call. Make sure to use perl = TRUE if your pattern relies on these features.
Encoding issues with non-ASCII strings: If your input contains non-ASCII (Unicode) characters, encoding mismatches can prevent patterns from matching. Check your string’s encoding with Encoding(x). If needed, convert it to UTF-8 using enc2utf8(x) before running gsub().
Unanchored patterns matching more text than intended: If your pattern lacks anchors, it may match inside larger words or substrings, leading to unintended replacements. Add the word boundary anchor (\\b) or string start (^) and end ($) anchors to restrict pattern matches to the desired scope.
Tip: Always test your pattern with grepl(pattern, x) first to confirm it matches exactly what you expect, before applying gsub() to perform replacements.
sub() and gsub() are foundational base R functions for string replacement that work on character vectors, data frame columns, and any object coercible to character, with no external dependencies. sub() targets the first match per element; gsub() replaces every match. Full regex support with the TRE engine covers most use cases, and perl = TRUE unlocks PCRE2 features including lookaheads, lookbehinds, and case modifiers for advanced formatting tasks. For replacing multiple patterns in a single call or for tidyverse pipeline compatibility, stringr::str_replace_all() is the natural complement.
Continue your learning with the following tutorials that cover related R string and data manipulation topics:
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
With over 6 years of experience in tech publishing, Mani has edited and published more than 75 books covering a wide range of data science topics. Known for his strong attention to detail and technical knowledge, Mani specializes in creating clear, concise, and easy-to-understand content tailored for developers.
Get paid to write technical tutorials and select a tech-focused charity to receive a matching donation.
Full documentation for every DigitalOcean product.
The Wave has everything you need to know about building a business, from raising funding to marketing your product.
Stay up to date by signing up for DigitalOcean’s Infrastructure as a Newsletter.
New accounts only. By submitting your email you agree to our Privacy Policy
Scale up as you grow — whether you're running one virtual machine or ten thousand.
From GPU-powered inference and Kubernetes to managed databases and storage, get everything you need to build, scale, and deploy intelligent applications.