Regular Expressions

This chapter will discuss regular expressions (regexp) and related features in detail. As discussed in earlier chapters:

  • /searchpattern search the given pattern in the forward direction
  • ?searchpattern search the given pattern in the backward direction
  • :range s/searchpattern/replacestring/flags search and replace
    • :s is short for the :substitute command
    • the delimiter after the replacestring portion is optional if you are not using flags

Documentation links:

info Recall that you need to add the / prefix for built-in help on regular expressions, :h /^ for example.

Flags

  • g replace all occurrences within a matching line
    • by default, only the first matching portion will be replaced
  • c ask for confirmation before each replacement
  • i ignore case for searchpattern
  • I don't ignore case for searchpattern

These flags are applicable for the substitute command but not the / or ? searches. Flags can also be combined, for example:

  • s/cat/Dog/gi replace every occurrence of cat with Dog
    • Case is ignored, so Cat, cAt, CAT, etc will also be replaced
    • Note that i doesn't affect the case of the replacement string

info See :h s_flags for a complete list of flags and more details about them.

Anchors

By default, regexp will match anywhere in the text. You can use line and word anchors to specify additional restrictions regarding the position of matches. These restrictions are made possible by assigning special meaning to certain characters and escape sequences. The characters with special meaning are known as metacharacters in regular expressions parlance. In case you need to match those characters literally, you need to escape them with a \ character (discussed in the Escaping metacharacters section later in this chapter).

  • ^ restricts the match to the start-of-line
    • ^This matches This is a sample but not Do This
  • $ restricts the match to the end-of-line
    • )$ matches apple (5) but not def greeting():
  • ^$ match empty lines
  • \<pattern restricts the match to the start of a word
    • word characters include alphabets, digits and underscore
    • \<his matches his or to-his or history but not this or _hist
  • pattern\> restricts the match to the end of a word
    • his\> matches his or to-his or this but not history or _hist
  • \<pattern\> restricts the match between the start of a word and end of a word
    • \<his\> matches his or to-his but not this or history or _hist

info End-of-line can be \r (carriage return), \n (newline) or \r\n depending on your operating system and the fileformat setting.

info See :h pattern-atoms for more details.

Dot metacharacter

  • . match any single character other than end-of-line
    • c.t matches cat or cot or c2t or c^t or c.t or c;t but not cant or act or sit
  • \_. match any single character, including end-of-line

info As seen above, matching end-of-line character requires special attention. Which is why examples and descriptions in this chapter will assume you are operating line wise unless otherwise mentioned. You'll later see how \_ is used in many more places to include end-of-line in the matches.

Greedy Quantifiers

Quantifiers can be applied to literal characters, the dot metacharacter, groups, backreferences and character classes. Basic examples are shown below, more will be discussed in the sections to follow.

  • * match zero or more times
    • abc* matches ab or abc or abccc or abcccccc but not bc
    • Error.*valid matches Error: invalid input but not valid Error
    • s/a.*b/X/ replaces table bottle bus with tXus
  • \+ match one or more times
    • abc\+ matches abc or abccc but not ab or bc
  • \? match zero or one times
    • \= can also be used, helpful if you are searching backwards with the ? command
    • abc\? matches ab or abc. This will match abccc or abcccccc as well, but only the abc portion
    • s/abc\?/X/ replaces abcc with Xc
  • \{m,n} match m to n times (inclusive)
    • ab\{1,4}c matches abc or abbc or xabbbcz but not ac or abbbbbc
    • if you are familiar with BRE, you can also use \{m,n\} (ending brace is escaped)
  • \{m,} match at least m times
    • ab\{3,}c matches xabbbcz or abbbbbc but not ac or abc or abbc
  • \{,n} match up to n times (including 0 times)
    • ab\{,2}c matches abc or ac or abbc but not xabbbcz or abbbbbc
  • \{n} match exactly n times
    • ab\{3}c matches xabbbcz but not abbc or abbbbbc

Greedy quantifiers will consume as much as possible, provided the overall pattern is also matched. That's how the Error.*valid example worked. If .* had consumed everything after Error, there wouldn't be any more characters to try to match valid. How the regexp engine handles matching varying amount of characters depends on the implementation details (backtracking, NFA, etc).

info See :h pattern-overview for more details.

info If you are familiar with other regular expression flavors like Perl, Python, etc, you'd be surprised by the use of \ in the above examples. If you use the \v very magic modifier (discussed later in this chapter), the \ won't be needed.

Non-greedy Quantifiers

Non-greedy quantifiers match as minimally as possible, provided the overall pattern is also matched.

  • \{-} match zero or more times as minimally as possible
    • s/t.\{-}a/X/g replaces that is quite a fabricated tale with XX fabricaXle
      • the matching portions are tha, t is quite a and ted ta
    • s/t.*a/X/g replaces that is quite a fabricated tale with Xle since * is greedy
  • \{-m,n} match m to n times as minimally as possible
    • m or n can be left out as seen in the previous section
    • s/.\{-2,5}/X/ replaces 123456789 with X3456789 (here . matched 2 times)
    • s/.\{-2,5}6/X/ replaces 123456789 with X789 (here . matched 5 times)

info See :h pattern-overview and stackoverflow: non-greedy matching for more details.

Character Classes

To create a custom placeholder for a limited set of characters, you can enclose them inside the [] metacharacters. Character classes have their own versions of metacharacters and provide special predefined sets for common use cases.

  • [aeiou] match any lowercase vowel character
  • [^aeiou] match any character other than lowercase vowels
  • [a-d] match any of a or b or c or d
    • the range metacharacter - can be applied between any two characters
  • \a match any alphabet character [a-zA-Z]
  • \A match other than alphabets [^a-zA-Z]
  • \l match lowercase alphabets [a-z]
  • \L match other than lowercase alphabets [^a-z]
  • \u match uppercase alphabets [A-Z]
  • \U match other than uppercase alphabets [^A-Z]
  • \d match any digit character [0-9]
  • \D match other than digits [^0-9]
  • \o match any octal character [0-7]
  • \O match other than octals [^0-7]
  • \x match any hexadecimal character [0-9a-fA-F]
  • \X match other than hexadecimals [^0-9a-fA-F]
  • \h match alphabets and underscore [a-zA-Z_]
  • \H match other than alphabets and underscore [^a-zA-Z_]
  • \w match any word character (alphabets, digits, underscore) [a-zA-Z0-9_]
    • this definition is same as seen earlier with word boundaries
  • \W match other than word characters [^a-zA-Z0-9_]
  • \s match space and tab characters [ \t]
  • \S match other than space and tab characters [^ \t]

Here are some examples with character classes:

  • c[ou]t matches cot or cut
  • \<[ot][on]\> matches oo or on or to or tn as whole words only
  • ^[on]\{2,}$ matches no or non or noon or on etc as whole lines only
  • s/"[^"]\+"/X/g replaces "mango" and "(guava)" with X and X
  • s/\d\+/-/g replaces Sample123string777numbers with Sample-string-numbers
  • s/\<0*[1-9]\d\{2,}\>/X/g replaces 0501 035 26 98234 with X 035 26 X (numbers >=100 with optional leading zeros)
  • s/\W\+/ /g replaces load2;err_msg--\ant with load2 err_msg ant

info To include the end-of-line character, use \_ instead of \ for any of the above escape sequences. For example, \_s will help you match across lines. Similarly, use \_[] for bracketed classes.

warning info The above escape sequences do not have special meaning within bracketed classes. For example, [\d\s] will only match \ or d or s. You can use named character sets in such scenarios. For example, [[:digit:][:blank:]] to match digits or space or tab characters. See :h :alnum: for full list and more details.

info The predefined sets are also better in terms of performance compared to bracketed versions. And there are more such sets than the ones discussed above. See :h character-classes for more details.

Alternation and Grouping

Alternation helps you to match multiple terms and they can have their own anchors as well (since each alternative is a regexp pattern). Often, there are some common things among the regular expression alternatives. In such cases, you can group them using a pair of parentheses metacharacters. Similar to a(b+c)d = abd+acd in maths, you get a(b|c)d = abd|acd in regular expressions.

  • \| match either of the specified patterns
    • min\|max matches min or max
    • one\|two\|three matches one or two or three
    • \<par\>\|er$ matches the whole word par or a line ending with er
  • \(pattern\) group a pattern to apply quantifiers, create a terser regexp by taking out common elements, etc
    • a\(123\|456\)b is equivalent to a123b\|a456b
    • hand\(y\|ful\) matches handy or handful
    • hand\(y\|ful\)\? matches hand or handy or handful
    • \(to\)\+ matches to or toto or tototo and so on
    • re\(leas\|ceiv\)\?ed matches reed or released or received

There can be tricky situations when using alternation. Say, you want to match are or spared — which one should get precedence? The bigger word spared or the substring are inside it or based on something else? The alternative which matches earliest in the input gets precedence, irrespective of the order of the alternatives.

  • s/are\|spared/X/g replaces rare spared area with rX X Xa
    • s/spared\|are/X/g will also give the same result

In case of matches starting from the same location, for example spa and spared, the leftmost alternative gets precedence. Sort by longest term first if don't want shorter terms to take precedence.

  • s/spa\|spared/**/g replaces spared spare with **red **re
  • s/spared\|spa/**/g replaces spared spare with ** **re

Backreference

The groupings seen in the previous section are also known as capture groups. The string captured by these groups can be referred later using a backreference \N where N is the capture group you want. Backreferences can be used in both search and replacement sections.

  • \(pattern\) capture group for later use via backreferences
  • \%(pattern\) non-capturing group
  • leftmost group is 1, second leftmost group is 2 and so on (maximum 9 groups)
  • \1 backreference to the first capture group
  • \2 backreference to the second capture group
  • \9 backreference to the ninth capture group
  • & or \0 backreference to the entire matched portion

Here are some examples:

  • \(\a\)\1 matches two consecutive repeated alphabets like ee, TT, pp and so on
    • recall that \a refers to [a-zA-Z]
  • \(\a\)\1\+ matches two or more consecutive repeated alphabets like ee, ttttt, PPPPPPPP and so on
  • s/\d\+/(&)/g replaces 52 apples 31 mangoes with (52) apples (31) mangoes (surround digits with parentheses)
  • s/\(\w\+\),\(\w\+\)/\2,\1/g replaces good,bad 42,24 with bad,good 24,42 (swap words separated by comma)
  • s/\(_\)\?_/\1/g replaces _fig __123__ _bat_ with fig _123_ bat (reduce __ to _ and delete if it is a single _)
  • s/\(\d\+\)\%(abc\)\+\(\d\+\)/\2:\1/ replaces 12abcabcabc24 with 24:12 (match digits separated by one or more abc sequences, swap the numbers with : as the separator)
    • note the use of non-capturing group for abc since it isn't needed later
    • s/\(\d\+\)\(abc\)\+\(\d\+\)/\3:\1/ does the same if only capturing groups are used

Referring to the text matched by a capture group with a quantifier will give only the last match, not the entire match. Use a capture group around the grouping and quantifier together to get the entire matching portion. In such cases, the inner grouping is an ideal candidate to use non-capturing group.

  • s/a \(\d\{3}\)\+/b (\1)/ replaces a 123456789 with b (789)
    • a 4839235 will be replaced with b (923)5
  • s/a \(\%(\d\{3}\)\+\)/b (\1)/ replaces a 123456789 with b (123456789)
    • a 4839235 will be replaced with b (483923)5

Lookarounds

Lookarounds help to create custom anchors and add conditions within the searchpattern. These assertions are also known as zero-width patterns because they add restrictions similar to anchors and are not part of the matched portions.

info Vim's syntax is different than those usually found in programming languages like Perl, Python and JavaScript. The syntax starting with \@ is always added as a suffix to the pattern atom used in the assertion. For example, (?!\d) and (?<=pat.*) in other languages are specified as \d\@! and \(pat.*\)\@<= respectively in Vim.

  • \@! negative lookahead assertion
    • ice\d\@! matches ice as long as it is not immediately followed by a digit character, for example ice or iced! or icet5 or ice.123 but not ice42 or ice123
    • s/ice\d\@!/X/g replaces iceiceice2 with XXice2
    • s/par\(.*\<par\>\)\@!/X/g replaces par with X as long as whole word par is not present later in the line, for example parse and par and sparse is converted to parse and X and sXse
    • at\(\(go\)\@!.\)*par matches cat,dog,parrot but not cat,god,parrot (i.e. match at followed by par as long as go isn't present in between, this is an example of negating a grouping)
  • \@<! negative lookbehind assertion
    • _\@<!ice matches ice as long as it is not immediately preceded by a _ character, for example ice or _(ice) or 42ice but not _ice
    • \(cat.*\)\@<!dog matches dog as long as cat is not present earlier in the line, for example fox,parrot,dog,cat but not fox,cat,dog,parrot
  • \@= positive lookahead assertion
    • ice\d\@= matches ice as long as it is immediately followed by a digit character, for example ice42 or ice123 but not ice or iced! or icet5 or ice.123
    • s/ice\d\@=/X/g replaces ice ice_2 ice2 iced with ice ice_2 X2 iced
  • \@<= positive lookbehind assertion
    • _\@<=ice matches ice as long as it is immediately preceded by a _ character, for example _ice or (_ice) but not ice or _(ice) or 42ice

info info info You can also specify the number of bytes to search for lookbehind patterns. This will significantly speed up the matching process. You have to specify the number between the @ and < characters. For example, _\@1<=ice will lookback only one byte before ice for matching purposes. \(cat.*\)\@10<!dog will lookback only ten bytes before dog to check the given assertion.

Atomic Grouping

As discussed earlier, both greedy and non-greedy quantifiers will try to satisfy the overall pattern by varying the amount of characters matched by the quantifiers. You can use atomic grouping to safeguard a pattern from further backtracking. Similar to lookarounds, you need to use \@> as a suffix, for example \(pattern\)\@>.

  • s/\(0*\)\@>\d\{3,\}/(&)/g replaces only numbers >= 100 irrespective of any number of leading zeros, for example 0501 035 154 is converted to (0501) 035 (154)
    • \(0*\)\@> matches the 0 character zero or more times, but it will not give up this portion to satisfy overall pattern
    • s/0*\d\{3,\}/(&)/g replaces 0501 035 154 with (0501) (035) (154) (here 035 is matched because 0* will match zero times to satisfy the overall pattern)
  • s/\(::.\{-}::\)\@>par// replaces fig::1::spar::2::par::3 with fig::1::spar::3
    • \(::.\{-}::\)\@> will match only from :: to the very next ::
    • s/::.\{-}::par// replaces fig::1::spar::2::par::3 with fig::3 (matches from the first :: to the first occurrence of ::par)

Set start and end of the match

Some of the positive lookbehind and lookahead usage can be replaced with \zs and \ze respectively.

  • \zs set the start of the match (portion before \zs won't be part of the match)
    • s/\<\w\zs\w*\W*//g replaces sea eat car rat eel tea with secret
    • same as s/\(\<\w\)\@<=\w*\W*//g or s/\(\<\w\)\w*\W*/\1/g
  • \ze set the end of the match (portion after \ze won't be part of the match)
    • s/ice\ze\d/X/g replaces ice ice_2 ice2 iced with ice ice_2 X2 iced
    • same as s/ice\d\@=/X/g or s/ice\(\d\)/X\1/g

info As per :h \zs and :h \ze, these "Can be used multiple times, the last one encountered in a matching branch is used."

Magic modifiers

These escape sequences change certain aspects of the syntax and behavior of the search pattern that comes after such a modifier. You can use multiple such modifiers as needed for particular sections of the pattern.

Magic and nomagic

  • \m magic mode (this is the default setting)
  • \M nomagic mode
    • ., * and ~ are no longer metacharacters (compared to magic mode)
    • \., \* and \~ will make them to behave as metacharacters
    • ^ and $ would still behave as metacharacters
    • \Ma.b matches only a.b
    • \Ma\.b matches a.b as well as a=b or a<b or acb etc

Very magic

The default syntax of Vim regexp has only a few metacharacters like ., *, ^ and $. If you are familiar with regexp usage in programming languages such as Perl, Python and JavaScript, you can use \v to get a similar syntax in Vim. This will allow the use of more metacharacters such as (), {}, +, ? and so on without having to prefix them with a \ metacharacter. From :h magic documentation:

Use of \v means that after it, all ASCII characters except 0-9, a-z, A-Z and _ have special meaning

  • \v<his> matches his or to-his but not this or history or _hist
  • a<b.*\v<end> matches c=a<b #end but not c=a<b #bending
    • note that \v is used after a<b to avoid having to escape the first <
  • \vone|two|three matches one or two or three
  • \vabc+ matches abc or abccc but not ab or bc
  • s/\vabc?/X/ replaces abcc with Xc
  • s/\vt.{-}a/X/g replaces that is quite a fabricated tale with XX fabricaXle
  • \vab{3}c matches xabbbcz but not abbc or abbbbbc
  • s/\v(\w+),(\w+)/\2,\1/g replaces good,bad 42,24 with bad,good 24,42
    • compare this to the default mode: s/\(\w\+\),\(\w\+\)/\2,\1/g

Very nomagic

From :h magic documentation:

Use of \V means that after it, only a backslash and terminating character (usually / or ?) have special meaning

  • \V^.*{}$ matches ^.*{}$ literally
  • \V^.*{}$\.\*abcd matches ^.*{}$ literally only if abcd is found later in the line
    • \V^.*{}$\m.*abcd can also be used
  • \V\^This matches This is a sample but not Do This
  • \V)\$ matches apple (5) but not def greeting():

Case sensitivity

These will override flags and settings, if any. Unlike the magic modifiers, you cannot apply \c or \C for a specific portion of the pattern.

  • \c case insensitive search
    • \cthis matches this or This or THIs and so on
      • th\cis or this\c and so on will also result in the same behavior
  • \C case sensitive search
    • \Cthis match exactly this but not This or THIs and so on
      • th\Cis or this\C and so on will also result in the same behavior
  • s/\Ccat/dog/gi replaces cat Cat CAT with dog Cat CAT since the i flag gets overridden

Changing Case

These can be used in the replacement section:

  • \u Uppercases the next character
  • \U UPPERCASES the following characters
  • \l lowercases the next character
  • \L lowercases the following characters
  • \e or \E will end further case changes
  • \L or \U will also override any existing conversion

Examples:

  • s/\<\l/\u&/g replaces hello. how are you? with Hello. How Are You?
    • recall that \l in the search section is equivalent to [a-z]
  • s/\<\L/\l&/g replaces HELLO. HOW ARE YOU? with hELLO. hOW aRE yOU?
    • recall that \L in the search section is equivalent to [A-Z]
  • s/\v(\l)_(\l)/\1\u\2/g replaces aug_price next_line with augPrice nextLine
  • s/.*/\L&/ replaces HaVE a nICe dAy with have a nice day
  • s/\a\+/\u\L&/g replaces HeLLo:bYe gOoD:beTTEr with Hello:Bye Good:Better
    • s/\a\+/\L\u&/g can also be used in this case
  • s/\v(\a+)(:\a+)/\L\1\U\2/g replaces Hi:bYe gOoD:baD with hi:BYE good:BAD

Alternate delimiters

From :h substitute documentation:

Instead of the / which surrounds the pattern and replacement string, you can use any other single-byte character, but not an alphanumeric character, \, " or |. This is useful if you want to include a / in the search pattern or replacement string.

  • s#/home/learnbyexample/#\~/# replaces /home/learnbyexample/reports with ~/reports
    • compare this with s/\/home\/learnbyexample\//\~\//

Escape sequences

Certain characters like tab, carriage return, newline, etc have escape sequences to represent them. Additionally, any character can be represented using their codepoint value in decimal, octal and hexadecimal formats. Unlike character set escape sequences like \w, these can be used inside character classes as well. If the escape sequences behave differently in searchpattern and replacestring portions, they'll be highlighted in the descriptions below.

  • \t tab character
  • \b backspace character
  • \r matches carriage return for searchpattern, produces newline for replacestring
  • \n matches end-of-line for searchpattern, produces ASCII NUL for replacestring
    • \n can also match \r or \r\n (where \r is carriage return) depending upon the fileformat setting
  • \%d matches character specified by decimal digits
    • \%d39 matches the single quote character
  • \%o matches character specified by octal digits
    • \%o47 matches the single quote character
  • \%x matches character specified by hexadecimal digits (max 2 digits)
    • \%x27 matches the single quote character
  • \%u matches character specified by hexadecimal digits (max 4 digits)
  • \%U matches character specified by hexadecimal digits (max 8 digits)

info Using \% sequences to insert characters in replacestring hasn't been implemented yet. See vi.stackexchange: Replace with hex character for workarounds.

info See ASCII code table for a handy cheatsheet with all the ASCII characters and conversion tables. See codepoints for Unicode characters.

Escaping metacharacters

To match the metacharacters literally (including character class metacharacters like -), i.e. to remove their special meaning, prefix those characters with a \ (backslash) character. To indicate a literal \ character, use \\. Depending on the pattern, you can also use a different magic modifier to reduce the need for escaping. Assume default magicness for the below examples unless otherwise specified.

  • ^ and $ do not require escaping if they are used out of position
    • b^2 matches a^2 + b^2 - C*3
    • $4 matches this ebook is priced $40
    • \^super matches ^superscript (you need the \ here since ^ is at the customary position)
  • [ and ] do not require escaping if only one of them is used
    • b[1 matches ab[123
    • 42] matches xyz42] =
    • b\[123] or b[123\] matches ab[123] = d
  • [ in the substitute command requires careful consideration
    • s/b[1/X/ replaces b[1/X/ with nothing
    • s/b\[1/X/ replaces ab[123 with aX23
  • \Va*b.c or a\*b\.c matches a*b.c
  • & in the replacement section requires escaping to represent it literally
    • s/and/\&/ replaces apple and mango with apple & mango

The following can be used to match character class metacharacters literally in addition to escaping them with a \ character:

  • - can be specified at the start or end of the list, for example [-0-5] and [a-z-]
  • ^ should be other than the first character, for example [+a^.]
  • ] should be the first character, for example []a-z] and [^]a]

Replacement expressions

  • \= when replacestring starts with \=, it is treated as an expression
  • s/date:\zs/\=strftime("%Y-%m-%d")/ appends the current date
    • for example, changes date: to date:2024-06-25
  • s/\d\+/\=submatch(0)*2/g multiplies matching numbers by 2
    • for example, changes 4 and 10 to 8 and 20
    • submatch() function is similar to backreferences, 0 gives the entire matched string, 1 refers to the first capture group and so on
  • s/\(.*\)\zs/\=" = " . eval(submatch(1))/ appends result of an expression
    • for example, changes 10 * 2 - 3 to 10 * 2 - 3 = 17
    • . is the string concatenation operator
    • eval() here executes the contents of the first capture group as an expression
  • s/"[^"]\+"/\=substitute(submatch(0), '[aeiou]', '\u&', 'g')/g affects vowels only inside double quotes
    • for example, changes "mango" and "guava" to "mAngO" and "gUAvA"
    • substitute() function works similarly to the s command
    • first argument is the text to work on
    • second argument is similar to searchpattern
    • third argument is similar to replacestring
    • fourth argument is flags, use an empty string if not required
    • see :h substitute() for more details and differences compared to the s command
  • perldo s/\d+/$&*2/ge changes 4 and 10 to 8 and 20
    • useful if you are familiar with Perl regular expressions and the perl interface is available with your Vim installation
    • note that the default range is 1,$ (the s command works only on the current line by default)
    • see :h perldo for restrictions and more details

info See :h usr_41.txt for details about Vim script.

info See :h sub-replace-expression for more details.

info See also stackoverflow: find all occurrences and replace with user input.

Miscellaneous

  • \%V match inside the visual area only
    • s/\%V10/20/g replaces 10 with 20 only inside the visual area
    • without \%V, the replacement would happen anywhere on the lines covered by the visual selection
  • \%[set] match zero or more of these characters in the same order, as much as possible
    • spa\%[red] matches spa or spar or spare or spared (longest match wins)
      • same as \vspa(red|re|r)? or \vspa(red?|r)? and so on
    • ap\%[[pt]ly] matches ap or app or appl or apply or apt or aptl or aptly
  • \_^ and \_$ restrict the match to start-of-line and end-of-line respectively, useful for multiline patterns
  • \%^ and \%$ restrict the match to start-of-file and end-of-file respectively
  • ~ represents the last replacement string
    • s/apple/banana/ followed by /~ will search for banana
    • s/apple/banana/ followed by s/fig/(~)/ will use (banana) as the replacement string

Further Reading