BRE/ERE Regular Expressions
This chapter covers Basic and Extended Regular Expressions as implemented in GNU sed
. Unless otherwise indicated, examples and descriptions will assume ASCII input.
By default, sed
treats the search pattern as Basic Regular Expression (BRE). The -E
option enables Extended Regular Expression (ERE). Older sed
versions used -r
for ERE, which can still be used, but -E
is more portable. In GNU sed
, BRE and ERE only differ in how metacharacters are represented, there are no feature differences.
See also POSIX specification for BRE and ERE.
The example_files directory has all the files used in the examples.
Line Anchors
Instead of matching anywhere in the line, restrictions can be specified. These restrictions are made possible by assigning special meaning to certain characters and escape sequences. The characters with special meaning are known as metacharacters in regular expressions parlance. In case you need to match those characters literally, you need to escape them with a \
character (discussed in the Matching the metacharacters section).
There are two line anchors:
^
metacharacter restricts the match to the start of the line$
metacharacter restricts the match to the end of the line
$ cat anchors.txt
sub par
spar
apparent effort
two spare computers
cart part tart mart
# lines starting with 's'
$ sed -n '/^s/p' anchors.txt
sub par
spar
# lines ending with 'rt'
$ sed -n '/rt$/p' anchors.txt
apparent effort
cart part tart mart
# change only whole line 'par'
$ printf 'spared no one\npar\nspar\n' | sed 's/^par$/PAR/'
spared no one
PAR
spar
The anchors can be used by themselves as a pattern too. Helps to insert text at the start/end of a input line, emulating string concatenation operations. This might not feel like a useful capability, but combined with other features they become quite a handy tool.
# add '* ' at the start of every input line
$ printf 'spared no one\npar\nspar\n' | sed 's/^/* /'
* spared no one
* par
* spar
# append '.' only if a line doesn't contain space characters
$ printf 'spared no one\npar\nspar\n' | sed '/ /! s/$/./'
spared no one
par.
spar.
Word Anchors
The second type of restriction is word anchors. A word character is any alphabet (irrespective of case), digit and the underscore character. You might wonder why there are digits and underscores as well, why not only alphabets? This comes from variable and function naming conventions — typically alphabets, digits and underscores are allowed. So, the definition is more programming oriented than natural language.
The escape sequence \b
denotes a word boundary. This works for both the start of word and the end of word anchoring. Start of word means either the character prior to the word is a non-word character or there is no character (start of line). Similarly, end of word means the character after the word is a non-word character or no character (end of line). This implies that you cannot have word boundaries without a word character. Here are some examples:
$ cat anchors.txt
sub par
spar
apparent effort
two spare computers
cart part tart mart
# words starting with 'par'
$ sed -n '/\bpar/p' anchors.txt
sub par
cart part tart mart
# words ending with 'par'
$ sed -n '/par\b/p' anchors.txt
sub par
spar
# replace only whole word 'par'
$ sed -n 's/\bpar\b/***/p' anchors.txt
sub ***
Alternatively, you can use
\<
to indicate the start of word anchor and\>
to indicate the end of word anchor. Using\b
is preferred as it is more commonly used in other regular expression implementations and has\B
as its opposite.
\bREGEXP\b
behaves a bit differently than\<REGEXP\>
. See the Word boundary differences section for details.
Opposite Word Anchor
The word boundary has an opposite anchor too. \B
matches wherever \b
doesn't match. This duality will be seen later with some other escape sequences too.
# match 'par' if it is surrounded by word characters
$ sed -n '/\Bpar\B/p' anchors.txt
apparent effort
two spare computers
# match 'par' but not at the start of a word
$ sed -n '/\Bpar/p' anchors.txt
spar
apparent effort
two spare computers
# match 'par' but not at the end of a word
$ sed -n '/par\B/p' anchors.txt
apparent effort
two spare computers
cart part tart mart
$ echo 'copper' | sed 's/\b/:/g'
:copper:
$ echo 'copper' | sed 's/\B/:/g'
c:o:p:p:e:r
Negative logic is handy in many text processing situations. But use it with care, you might end up matching things you didn't intend.
Alternation
Many a times, you'd want to search for multiple terms. In a conditional expression, you can use the logical operators to combine multiple conditions. With regular expressions, the |
metacharacter is similar to logical OR. The regular expression will match if any of the patterns separated by |
is satisfied.
The |
metacharacter syntax varies between BRE and ERE. Quoting from the manual:
In GNU sed, the only difference between basic and extended regular expressions is in the behavior of a few special characters:
?
,+
, parentheses, braces ({}
), and|
.
Here are some examples:
# BRE vs ERE
$ sed -n '/two\|sub/p' anchors.txt
sub par
two spare computers
$ sed -nE '/two|sub/p' anchors.txt
sub par
two spare computers
# match 'cat' or 'dog' or 'fox'
# note the use of 'g' flag for multiple replacements
$ echo 'cats dog bee parrot foxed' | sed -E 's/cat|dog|fox/--/g'
--s -- bee parrot --ed
Here's an example of alternate patterns with their own anchors:
# lines with whole word 'par' or lines ending with 's'
$ sed -nE '/\bpar\b|s$/p' anchors.txt
sub par
two spare computers
Alternation precedence
There are some tricky corner cases when using alternation. If it is used for filtering a line, there is no ambiguity. However, for use cases like substitution, it depends on a few factors. Say, you want to replace are
or spared
— which one should get precedence? The bigger word spared
or the substring are
inside it or based on something else?
The alternative which matches earliest in the input gets precedence.
# here, the output will be same irrespective of alternation order
# note that 'g' flag isn't used here, so only the first match gets replaced
$ echo 'cats dog bee parrot foxed' | sed -E 's/bee|parrot|at/--/'
c--s dog bee parrot foxed
$ echo 'cats dog bee parrot foxed' | sed -E 's/parrot|at|bee/--/'
c--s dog bee parrot foxed
In case of matches starting from the same location, for example spar
and spared
, the longest matching portion gets precedence. Unlike other regular expression implementations, left-to-right priority for alternation comes into play only if the length of the matches are the same. See Longest match wins and Backreferences sections for more examples. See regular-expressions: alternation for more information on this topic.
$ echo 'spared party parent' | sed -E 's/spa|spared/**/g'
** party parent
$ echo 'spared party parent' | sed -E 's/spared|spa/**/g'
** party parent
# other regexp flavors like Perl have left-to-right priority
$ echo 'spared party parent' | perl -pe 's/spa|spared/**/'
**red party parent
Grouping
Often, there are some common things among the regular expression alternatives. It could be common characters or qualifiers like the anchors. In such cases, you can group them using a pair of parentheses metacharacters. Similar to a(b+c)d = abd+acd
in maths, you get a(b|c)d = abd|acd
in regular expressions.
# without grouping
$ printf 'red\nreform\nread\ncrest\n' | sed -nE '/reform|rest/p'
reform
crest
# with grouping
$ printf 'red\nreform\nread\ncrest\n' | sed -nE '/re(form|st)/p'
reform
crest
# without grouping
$ sed -nE '/\bpar\b|\bpart\b/p' anchors.txt
sub par
cart part tart mart
# taking out common anchors
$ sed -nE '/\b(par|part)\b/p' anchors.txt
sub par
cart part tart mart
# taking out common characters as well
# you'll later learn a better technique instead of using empty alternate
$ sed -nE '/\bpar(|t)\b/p' anchors.txt
sub par
cart part tart mart
Matching the metacharacters
You have already seen a few metacharacters and escape sequences that help compose a regular expression. To match the metacharacters literally, i.e. to remove their special meaning, prefix those characters with a \
character. To indicate a literal \
character, use \\
. Some of the metacharacters, like the line anchors, lose their special meaning when not used in their customary positions with BRE syntax. If there are many metacharacters to be escaped, try to work out if the command can be simplified by switching between ERE and BRE.
# line anchors aren't special away from customary positions with BRE
$ printf 'a^2 + b^2 - C*3\nd = c^2' | sed -n '/b^2/p'
a^2 + b^2 - C*3
# but you'll have to escape them with ERE: sed -nE '/\$b/p'
$ printf '$a = $b + $c\n$x = 4' | sed -n '/$b/p'
$a = $b + $c
# here $ requires escaping even with BRE
$ echo '$a = $b + $c' | sed 's/\$//g'
a = b + c
# BRE vs ERE
$ printf '(a/b) + c\n3 + (a/b) - c\n' | sed -n '/^(a\/b)/p'
(a/b) + c
$ printf '(a/b) + c\n3 + (a/b) - c\n' | sed -nE '/^\(a\/b\)/p'
(a/b) + c
Handling the replacement section metacharacters will be discussed in the Backreferences section.
Using different delimiters
The /
character is idiomatically used as the REGEXP delimiter. But any character other than \
and the newline character can be used instead. This helps to avoid or reduce the need for escaping delimiter characters. The syntax is simple for substitution and transliteration commands, just use a different character instead of /
.
# instead of this
$ echo '/home/learnbyexample/reports' | sed 's/\/home\/learnbyexample\//~\//'
~/reports
# use a different delimiter
$ echo '/home/learnbyexample/reports' | sed 's#/home/learnbyexample/#~/#'
~/reports
$ echo 'a/b/c/d' | sed 'y/a\/d/1-4/'
1-b-c-4
$ echo 'a/b/c/d' | sed 'y,a/d,1-4,'
1-b-c-4
For address matching, syntax is a bit different — the first delimiter has to be escaped. For address ranges, start and end REGEXP can have different delimiters, as they are independent.
$ printf '/home/joe/1\n/home/john/1\n'
/home/joe/1
/home/john/1
# here ; is used as the delimiter
$ printf '/home/joe/1\n/home/john/1\n' | sed -n '\;/home/joe/;p'
/home/joe/1
See also a bit of history on why / is commonly used as the delimiter.
The dot meta character
The dot metacharacter serves as a placeholder to match any character (including the newline character). Later you'll learn how to define your own custom placeholder for a limited set of characters.
# 3 character sequence starting with 'c' and ending with 't'
$ echo 'tac tin cot abc:tyz excited' | sed 's/c.t/-/g'
ta-in - ab-yz ex-ed
# any character followed by 3 and again any character
$ printf '42\t3500\n' | sed 's/.3.//'
4200
# N command is handy here to show that . matches \n as well
$ printf 'abc\nxyz\n' | sed 'N; s/c.x/ /'
ab yz
Quantifiers
Alternation helps you match one among multiple patterns. Combining the dot metacharacter with quantifiers (and alternation if needed) paves a way to perform logical AND between patterns. For example, to check if a string matches two patterns with any number of characters in between. Quantifiers can be applied to characters, groupings and some more constructs that'll be discussed later. Apart from the ability to specify exact quantity and bounded range, these can also match unbounded varying quantities.
First up, the ?
metacharacter which quantifies a character or group to match 0
or 1
times. This helps to define optional patterns and build terser patterns.
# same as: sed -E 's/\b(fe.d|fed)\b/X/g'
# BRE version: sed 's/fe.\?d\b/X/g'
$ echo 'fed fold fe:d feeder' | sed -E 's/\bfe.?d\b/X/g'
X fold X feeder
# same as: sed -nE '/\bpar(|t)\b/p'
$ sed -nE '/\bpart?\b/p' anchors.txt
sub par
cart part tart mart
# same as: sed -E 's/part|parrot/X/g'
$ echo 'par part parrot parent' | sed -E 's/par(ro)?t/X/g'
par X X parent
# same as: sed -E 's/part|parrot|parent/X/g'
$ echo 'par part parrot parent' | sed -E 's/par(en|ro)?t/X/g'
par X X X
# matches '<' or '\<' and they are both replaced with '\<'
$ echo 'apple \< fig ice < apple cream <' | sed -E 's/\\?</\\</g'
apple \< fig ice \< apple cream \<
The *
metacharacter quantifies a character or group to match 0
or more times.
# 'f' followed by zero or more of 'e' followed by 'd'
$ echo 'fd fed fod fe:d feeeeder' | sed 's/fe*d/X/g'
X X fod fe:d Xer
# zero or more of '1' followed by '2'
$ echo '3111111111125111142' | sed 's/1*2/-/g'
3-511114-
The +
metacharacter quantifies a character or group to match 1
or more times.
# 'f' followed by one or more of 'e' followed by 'd'
# BRE version: sed 's/fe\+d/X/g'
$ echo 'fd fed fod fe:d feeeeder' | sed -E 's/fe+d/X/g'
fd X fod fe:d Xer
# one or more of '1' followed by optional '4' and then '2'
$ echo '3111111111125111142' | sed -E 's/1+4?2/-/g'
3-5-
You can specify a range of integer numbers, both bounded and unbounded, using {}
metacharacters. There are four ways to use this quantifier as listed below:
Quantifier | Description |
---|---|
{m,n} | match m to n times |
{m,} | match at least m times |
{,n} | match up to n times (including 0 times) |
{n} | match exactly n times |
# note that stray characters like space are not allowed anywhere within {}
# BRE version: sed 's/ab\{1,4\}c/X/g'
$ echo 'ac abc abbc abbbc abbbbbbbbc' | sed -E 's/ab{1,4}c/X/g'
ac X X X abbbbbbbbc
$ echo 'ac abc abbc abbbc abbbbbbbbc' | sed -E 's/ab{3,}c/X/g'
ac abc abbc X X
$ echo 'ac abc abbc abbbc abbbbbbbbc' | sed -E 's/ab{,2}c/X/g'
X X X abbbc abbbbbbbbc
$ echo 'ac abc abbc abbbc abbbbbbbbc' | sed -E 's/ab{3}c/X/g'
ac abc abbc X abbbbbbbbc
With ERE, you have escape
{
to represent it literally. Unlike)
, you don't have to escape the}
character.$ echo 'a{5} = 10' | sed -E 's/a\{5}/x/' x = 10 $ echo 'report_{a,b}.txt' | sed -E 's/_{a,b}/_c/' sed: -e expression #1, char 12: Invalid content of \{\} $ echo 'report_{a,b}.txt' | sed -E 's/_\{a,b}/_c/' report_c.txt
Conditional AND
Next up, constructing AND conditional using dot metacharacter and quantifiers.
# match 'Error' followed by zero or more characters followed by 'valid'
$ echo 'Error: not a valid input' | sed -n '/Error.*valid/p'
Error: not a valid input
To allow matching in any order, you'll have to bring in alternation as well.
# 'cat' followed by 'dog' or 'dog' followed by 'cat'
$ echo 'two cats and a dog' | sed -E 's/cat.*dog|dog.*cat/pets/'
two pets
$ echo 'two dogs and a cat' | sed -E 's/cat.*dog|dog.*cat/pets/'
two pets
Longest match wins
You've already seen an example where the longest matching portion was chosen if the alternatives started from the same location. For example spar|spared
will result in spared
being chosen over spar
. The same applies whenever there are two or more matching possibilities from same starting location. For example, f.?o
will match foo
instead of fo
if the input string to match is foot
.
# longest match among 'foo' and 'fo' wins here
$ echo 'foot' | sed -E 's/f.?o/X/'
Xt
# everything will match here
$ echo 'car bat cod map scat dot abacus' | sed 's/.*/X/'
X
# longest match happens when (1|2|3)+ matches up to '1233' only
# so that '12apple' can match as well
$ echo 'fig123312apple' | sed -E 's/g(1|2|3)+(12apple)?/X/'
fiX
# in other implementations like Perl, that is not the case
# precedence is left-to-right for greedy quantifiers
$ echo 'fig123312apple' | perl -pe 's/g(1|2|3)+(12apple)?/X/'
fiXapple
While determining the longest match, the overall regular expression matching is also considered. That's how Error.*valid
example worked. If .*
had consumed everything after Error
, there wouldn't be any more characters to try to match valid
. So, among the varying quantity of characters to match for .*
, the longest portion that satisfies the overall regular expression is chosen. Something like a.*b
will match from the first a
in the input string to the last b
. In other implementations, like Perl, this is achieved through a process called backtracking. These approaches have their own advantages and disadvantages and have cases where the pattern can result in exponential time consumption.
# from the start of line to the last 'b' in the line
$ echo 'car bat cod map scat dot abacus' | sed 's/.*b/-/'
-acus
# from the first 'b' to the last 't' in the line
$ echo 'car bat cod map scat dot abacus' | sed 's/b.*t/-/'
car - abacus
# from the first 'b' to the last 'at' in the line
$ echo 'car bat cod map scat dot abacus' | sed 's/b.*at/-/'
car - dot abacus
# here 'm*' will match 'm' zero times as that gives the longest match
$ echo 'car bat cod map scat dot abacus' | sed 's/a.*m*/-/'
c-
Character classes
To create a custom placeholder for limited set of characters, enclose them inside []
metacharacters. It is similar to using single character alternations inside a grouping, but with added flexibility and features. Character classes have their own versions of metacharacters and provide special predefined sets for common use cases. Quantifiers are also applicable to character classes.
# same as: sed -nE '/cot|cut/p' and sed -nE '/c(o|u)t/p'
$ printf 'cute\ncat\ncot\ncoat\ncost\nscuttle\n' | sed -n '/c[ou]t/p'
cute
cot
scuttle
# same as: sed -nE '/.(a|e|o)t/p'
$ printf 'meeting\ncute\nboat\nat\nfoot\n' | sed -n '/.[aeo]t/p'
meeting
boat
foot
# same as: sed -E 's/\b(s|o|t)(o|n)\b/X/g'
$ echo 'no so in to do on' | sed 's/\b[sot][on]\b/X/g'
no X in X do X
# lines made up of letters 'o' and 'n', line length at least 2
# words.txt contains dictionary words, one word per line
$ sed -nE '/^[on]{2,}$/p' words.txt
no
non
noon
on
Character class metacharacters
Character classes have their own metacharacters to help define the sets succinctly. Metacharacters outside of character classes like ^
, $
, ()
etc either don't have special meaning or have a completely different one inside the character classes.
First up, the -
metacharacter that helps to define a range of characters instead of having to specify them all individually.
# same as: sed -E 's/[0123456789]+/-/g'
$ echo 'Sample123string42with777numbers' | sed -E 's/[0-9]+/-/g'
Sample-string-with-numbers
# whole words made up of lowercase alphabets and digits only
$ echo 'coat Bin food tar12 best' | sed -E 's/\b[a-z0-9]+\b/X/g'
X Bin X X X
# whole words made up of lowercase alphabets, starting with 'p' to 'z'
$ echo 'road i post grip read eat pit' | sed -E 's/\b[p-z][a-z]*\b/X/g'
X i X grip X eat X
Character classes can also be used to construct numeric ranges. However, it is easy to miss corner cases and some ranges are complicated to construct.
# numbers between 10 to 29
$ echo '23 154 12 26 34' | sed -E 's/\b[12][0-9]\b/X/g'
X 154 X X 34
# numbers >= 100 with optional leading zeros
$ echo '0501 035 154 12 26 98234' | sed -E 's/\b0*[1-9][0-9]{2,}\b/X/g'
X 035 X 12 26 X
Next metacharacter is ^
which has to specified as the first character of the character class. It negates the set of characters, so all characters other than those specified will be matched. As highlighted earlier, handle negative logic with care, you might end up matching more than you wanted.
# replace all non-digit characters
$ echo 'Sample123string42with777numbers' | sed -E 's/[^0-9]+/-/g'
-123-42-777-
# delete last two columns
$ echo 'apple:123:banana:cherry' | sed -E 's/(:[^:]+){2}$//'
apple:123
# sequence of characters surrounded by double quotes
$ echo 'I like "mango" and "guava"' | sed -E 's/"[^"]+"/X/g'
I like X and X
# sometimes it is simpler to positively define a set than negation
# same as: sed -n '/^[^aeiou]*$/p'
$ printf 'tryst\nfun\nglyph\npity\nwhy\n' | sed '/[aeiou]/d'
tryst
glyph
why
Escape sequence sets
Some commonly used character sets have predefined escape sequences:
\w
matches all word characters[a-zA-Z0-9_]
(recall the description for word boundaries)\W
matches all non-word characters (recall duality seen earlier, like\b
and\B
)\s
matches all whitespace characters: tab, newline, vertical tab, form feed, carriage return and space\S
matches all non-whitespace characters
These escape sequences cannot be used inside character classes. Also, as mentioned earlier, these definitions assume ASCII input.
# match all non-word characters
$ echo 'load;err_msg--\nant,r2..not' | sed -E 's/\W+/-/g'
load-err_msg-nant-r2-not
# replace all sequences of whitespaces with a single space
$ printf 'hi \v\f there.\thave \ra nice\t\tday\n' | sed -E 's/\s+/ /g'
hi there. have a nice day
# \w would simply match \ and w inside character classes
$ echo 'w=y\x+9*3' | sed 's/[\w=]//g'
yx+9*3
sed
doesn't support\d
and\D
, commonly featured in other implementations as a shortcut for all the digits and non-digits.# \d will match just the 'd' character $ echo '42\d123' | sed -E 's/\d+/-/g' 42\-123 # \d here matches all digit characters $ echo '42\d123' | perl -pe 's/\d+/-/g' -\d-
Named character sets
A named character set is defined by a name enclosed between [:
and :]
and has to be used within a character class []
, along with other characters as needed.
Named set | Description |
---|---|
[:digit:] | [0-9] |
[:lower:] | [a-z] |
[:upper:] | [A-Z] |
[:alpha:] | [a-zA-Z] |
[:alnum:] | [0-9a-zA-Z] |
[:xdigit:] | [0-9a-fA-F] |
[:cntrl:] | control characters — first 32 ASCII characters and 127th (DEL) |
[:punct:] | all the punctuation characters |
[:graph:] | [:alnum:] and [:punct:] |
[:print:] | [:alnum:] , [:punct:] and space |
[:blank:] | space and tab characters |
[:space:] | whitespace characters, same as \s |
Here are some examples:
$ s='err_msg xerox ant m_2 P2 load1 eel'
$ echo "$s" | sed -E 's/\b[[:lower:]]+\b/X/g'
err_msg X X m_2 P2 load1 X
$ echo "$s" | sed -E 's/\b[[:lower:]_]+\b/X/g'
X X X m_2 P2 load1 X
$ echo "$s" | sed -E 's/\b[[:alnum:]]+\b/X/g'
err_msg X X m_2 X X X
$ echo ',pie tie#ink-eat_42' | sed -E 's/[^[:punct:]]+//g'
,#-_
Matching character class metacharacters literally
Specific placement is needed to match character class metacharacters literally.
-
should be the first or the last character.
# same as: sed -E 's/[-a-z]{2,}/X/g'
$ echo 'ab-cd gh-c 12-423' | sed -E 's/[a-z-]{2,}/X/g'
X X 12-423
]
should be the first character.
# no match
$ printf 'int a[5]\nfig\n1+1=2\n' | sed -n '/[=]]/p'
# correct usage
$ printf 'int a[5]\nfig\n1+1=2\n' | sed -n '/[]=]/p'
int a[5]
1+1=2
[
can be used anywhere in the character set, but not combinations like [.
or [:
. Using [][]
will match both [
and ]
.
$ echo 'int a[5]' | sed -n '/[x[.y]/p'
sed: -e expression #1, char 9: unterminated address regex
$ echo 'int a[5]' | sed -n '/[x[y.]/p'
int a[5]
^
should be other than the first character.
$ echo 'f*(a^b) - 3*(a+b)/(a-b)' | sed 's/a[+^]b/c/g'
f*(c) - 3*(c)/(a-b)
As seen in the examples above, combinations like
[.
or[:
cannot be used together to mean two individual characters, as they have special meaning within[]
. See Character Classes and Bracket Expressions section ininfo sed
for more details.
Escape sequences
Certain ASCII characters like tab \t
, carriage return \r
, newline \n
, etc have escape sequences to represent them. Additionally, any character can be represented using their ASCII value in decimal \dNNN
or octal \oNNN
or hexadecimal \xNN
formats. Unlike character set escape sequences like \w
, these can be used inside character classes. As \
is special inside character class, use \\
to represent it literally (technically, this is only needed if the combination of \
and the character(s) that follows is a valid escape sequence).
# \t represents the tab character
$ printf 'apple\tbanana\tcherry\n' | sed 's/\t/ /g'
apple banana cherry
$ echo 'a b c' | sed 's/ /\t/g'
a b c
# these escape sequence work inside character class too
$ printf 'a\t\r\fb\vc\n' | sed -E 's/[\t\v\f\r]+/:/g'
a:b:c
# representing single quotes
# use \d039 and \o047 for decimal and octal respectively
$ echo "universe: '42'" | sed 's/\x27/"/g'
universe: "42"
$ echo 'universe: "42"' | sed 's/"/\x27/g'
universe: '42'
If a metacharacter is specified using the ASCII value format in the search section, it will still act as the metacharacter. However, metacharacters specified using the ASCII value format in the replacement section acts as a literal character. Undefined escape sequences (both search and replacement section) will be treated as the character it escapes, for example, \e
will match e
(not \
and e
).
# \x5e is ^ character, acts as line anchor here
$ printf 'cute\ncot\ncat\ncoat\n' | sed -n '/\x5eco/p'
cot
coat
# & metacharacter in replacement will be discussed in the next section
# it represents the entire matched portion
$ echo 'hello world' | sed 's/.*/"&"/'
"hello world"
# \x26 is & character, acts as a literal character here
$ echo 'hello world' | sed 's/.*/"\x26"/'
"&"
See sed manual: Escapes for full list and details such as precedence rules. See also stackoverflow: behavior of ASCII value format inside character classes.
Backreferences
The grouping metacharacters ()
are also known as capture groups. Similar to variables in programming languages, the portion captured by ()
can be referred later using backreferences. The syntax is \N
where N
is the capture group you want. Leftmost (
in the regular expression is \1
, next one is \2
and so on up to \9
. Backreferences can be used in both the search and replacement sections.
# whole words that have at least one consecutive repeated character
# word boundaries are not needed here as longest match wins
$ echo 'effort flee facade oddball rat tool' | sed -E 's/\w*(\w)\1\w*/X/g'
X X facade X rat X
# reduce \\ to \ and delete if it is a single \
$ echo '\[\] and \\w and \[a-zA-Z0-9\_\]' | sed -E 's/(\\?)\\/\1/g'
[] and \w and [a-zA-Z0-9_]
# remove two or more duplicate words separated by spaces
# \b prevents false matches like 'the theatre', 'sand and stone' etc
$ echo 'aa a a a 42 f_1 f_1 f_13.14' | sed -E 's/\b(\w+)( \1)+\b/\1/g'
aa a 42 f_1 f_13.14
# 8 character lines having the same 3 lowercase letters at the start and end
$ sed -nE '/^([a-z]{3})..\1$/p' words.txt
mesdames
respires
restores
testates
\0
or &
represents the entire matched string in the replacement section.
# duplicate the first column value and add it as the final column
# same as: sed -E 's/^([^,]+).*/\0,\1/'
$ echo 'one,2,3.14,42' | sed -E 's/^([^,]+).*/&,\1/'
one,2,3.14,42,one
# surround the entire line with double quotes
$ echo 'hello world' | sed 's/.*/"&"/'
"hello world"
$ echo 'hello world' | sed 's/.*/Hi. &. Have a nice day/'
Hi. hello world. Have a nice day
If a quantifier is applied on a pattern grouped inside ()
metacharacters, you'll need an outer ()
group to capture the matching portion. Other regular expression engines like PCRE (Perl Compatible Regular Expressions) provide non-capturing groups to handle such cases. In sed
you'll have to consider the extra capture groups.
# uppercase the first letter of the first column (\u will be discussed later)
# surround the third column with double quotes
# note the numbers used in the replacement section
$ echo 'one,2,3.14,42' | sed -E 's/^(([^,]+,){2})([^,]+)/\u\1"\3"/'
One,2,"3.14",42
Here's an example where alternation order matters when the matching portions have the same length. Aim is to delete all whole words unless it starts with g
or p
and contains y
. See stackoverflow: Non greedy matching in sed for another use case.
$ s='tryst,fun,glyph,pity,why,group'
# all words get deleted because \b\w+\b gets priority here
$ echo "$s" | sed -E 's/\b\w+\b|(\b[gp]\w*y\w*\b)/\1/g'
,,,,,
# capture group gets priority here, so words in the capture group are retained
$ echo "$s" | sed -E 's/(\b[gp]\w*y\w*\b)|\b\w+\b/\1/g'
,,glyph,pity,,
As \
and &
are special characters in the replacement section, use \\
and \&
respectively for literal representation.
$ echo 'apple and fig' | sed 's/and/[&]/'
apple [and] fig
$ echo 'apple and fig' | sed 's/and/[\&]/'
apple [&] fig
$ echo 'apple and fig' | sed 's/and/\\/'
apple \ fig
Backreference will provide the string that was matched, not the pattern that was inside the capture group. For example, if
([0-9][a-f])
matches3b
, then backreferencing will give3b
and not any other valid match like8f
,0a
etc. This is akin to how variables behave in programming, only the expression result stays after variable assignment, not the expression itself.
Known Bugs
Visit sed bug list for known issues.
Here's an issue for certain usage of backreferences and quantifier that was filed by yours truly.
# takes some time and results in no output
# aim is to get words having two occurrences of repeated characters
# works if you use perl -ne 'print if /^(\w*(\w)\2\w*){2}$/'
$ sed -nE '/^(\w*(\w)\2\w*){2}$/p' words.txt | head -n5
# works when nesting is unrolled
$ sed -nE '/^\w*(\w)\1\w*(\w)\2\w*$/p' words.txt | head -n5
Abbott
Annabelle
Annette
Appaloosa
Appleseed
unix.stackexchange: Why doesn't this sed command replace the 3rd-to-last "and"? shows another interesting bug when word boundaries and group repetition are involved. Some examples are shown below. Again, workaround is to expand the group.
# wrong output
$ echo 'cocoa' | sed -nE '/(\bco){2}/p'
cocoa
# correct behavior, no output
$ echo 'cocoa' | sed -nE '/\bco\bco/p'
# wrong output, there's only 1 whole word 'it' after 'with'
$ echo 'it line with it here sit too' | sed -E 's/with(.*\bit\b){2}/XYZ/'
it line XYZ too
# correct behavior, input isn't modified
$ echo 'it line with it here sit too' | sed -E 's/with.*\bit\b.*\bit\b/XYZ/'
it line with it here sit too
Changing word boundaries to \<
and \>
results in a different issue:
# this correctly doesn't modify the input
$ echo 'it line with it here sit too' | sed -E 's/with(.*\<it\>){2}/XYZ/'
it line with it here sit too
# this correctly modifies the input
$ echo 'it line with it here it too' | sed -E 's/with(.*\<it\>){2}/XYZ/'
it line XYZ too
# but this one fails to modify the input
# expected output: it line XYZ too sit
$ echo 'it line with it here it too sit' | sed -E 's/with(.*\<it\>){2}/XYZ/'
it line with it here it too sit
Cheatsheet and summary
Note | Description |
---|---|
BRE | Basic Regular Expression, enabled by default |
ERE | Extended Regular Expression, enabled with the -E option |
Note: only ERE syntax is covered below | |
metacharacters | characters with special meaning in REGEXP |
^ | restricts the match to the start of the line |
$ | restricts the match to the end of the line |
\b | restricts the match to start/end of words |
word characters: alphabets, digits, underscore | |
\B | matches wherever \b doesn't match |
\< | start of word anchor |
\> | end of word anchor |
pat1|pat2 | combine multiple patterns as conditional OR |
each alternative can have independent anchors | |
alternative which matches earliest in the input gets precedence | |
and the leftmost longest portion wins in case of a tie | |
() | group pattern(s) |
a(b|c)d | same as abd|acd |
\^ | prefix metacharacters with \ to match them literally |
\\ | to match \ literally |
switching between ERE and BRE helps in some cases | |
/ | idiomatically used as the delimiter for REGEXP |
any character except \ and newline character can also be used | |
. | match any character (including the newline character) |
? | match 0 or 1 times |
* | match 0 or more times |
+ | match 1 or more times |
{m,n} | match m to n times |
{m,} | match at least m times |
{,n} | match up to n times (including 0 times) |
{n} | match exactly n times |
pat1.*pat2 | any number of characters between pat1 and pat2 |
pat1.*pat2|pat2.*pat1 | match both pat1 and pat2 in any order |
[ae;o] | match any of these characters once |
quantifiers are applicable to character classes too | |
[3-7] | range of characters from 3 to 7 |
[^=b2] | match other than = or b or 2 |
[a-z-] | - should be the first/last character to match literally |
[+^] | ^ shouldn't be the first character |
[]=] | ] should be the first character |
combinations like [. or [: have special meaning | |
\w | similar to [a-zA-Z0-9_] for matching word characters |
\s | similar to [ \t\n\r\f\v] for matching whitespace characters |
use \W and \S for their opposites respectively | |
[:digit:] | named character set, same as [0-9] |
\xNN | represent a character using its ASCII value in hexadecimal |
use \dNNN for decimal and \oNNN for octal | |
\N | backreference, gives matched portion of Nth capture group |
applies to both the search and replacement sections | |
possible values: \1 , \2 up to \9 | |
\0 or & | represents entire matched string in the replacement section |
Regular expressions is a feature that you'll encounter in multiple command line programs and programming languages. It is a versatile tool for text processing. Although the features provided by BRE/ERE implementation are less compared to those found in programming languages, they are sufficient for most of the tasks you'll need for command line usage. It takes a lot of time to get used to syntax and features of regular expressions, so I'll encourage you to practice a lot and maintain notes. It'd also help to consider it as a mini-programming language in itself for its flexibility and complexity. In the next chapter, you'll learn about flags that add more features to regular expressions usage.
Exercises
The exercises directory has all the files used in this section.
1) For the input file patterns.txt
, display all lines that start with den
or end with ly
.
$ sed ##### add your solution here
2 lonely
dent
lovely
2) For the input file patterns.txt
, replace all occurrences of 42
with [42]
unless it is at the edge of a word. Display only the modified lines.
$ sed ##### add your solution here
Hi[42]Bye nice1[42]3 bad42
eqn2 = pressure*3+42/5-1[42]56
cool_[42]a 42fake
_[42]_
3) For the input file patterns.txt
, add []
around words starting with s
and containing e
and t
in any order. Display only the modified lines.
$ sed ##### add your solution here
[sets] tests Sauerkraut
[site] cite kite bite [store_2]
[subtle] sequoia
a [set]
4) From the input file patterns.txt
, display lines having the same first and last word character.
$ sed ##### add your solution here
Not a pip DOWN
y
1 dentist 1
_42_
5) For the input file patterns.txt
, display lines containing *[5]
literally.
$ sed ##### add your solution here
(9-2)*[5]
6) sed -nE '/\b[a-z](on|no)[a-z]\b/p'
is same as sed -nE '/\b[a-z][on]{2}[a-z]\b/p'
. True or False? Sample input shown below might help to understand the differences, if any.
$ printf 'known\nmood\nknow\npony\ninns\n'
known
mood
know
pony
inns
7) For the input file patterns.txt
, display all lines starting with hand
and ending immediately with s
or y
or le
or no further characters.
$ sed ##### add your solution here
handle
handy
hands
hand
8) For the input file patterns.txt
, replace 42//5
or 42/5
with 8
. Display only the modified lines.
$ sed ##### add your solution here
eqn3 = r*42-5/3+42///5-83+a
eqn1 = a+8-c
eqn2 = pressure*3+8-14256
9) For the given quantifiers, what would be the equivalent form using the {m,n}
representation?
?
is same as*
is same as+
is same as
10) True or False? In ERE, (a*|b*)
is same as (a|b)*
.
11) For the input file patterns.txt
, construct two different REGEXPs to get the outputs as shown below. Display only the modified lines.
# delete from '(' till next ')'
$ sed ##### add your solution here
a/b + c%d
*[5]
def factorial
12- *4)
Hi there. Nice day
# delete from '(' till next ')' but not if there is '(' in between
$ sed ##### add your solution here
a/b + c%d
*[5]
def factorial
12- (e+*4)
Hi there. Nice day(a
12) For the input file anchors.txt
, convert markdown anchors to corresponding hyperlinks as shown below.
$ cat anchors.txt
# <a name="regular-expressions"></a>Regular Expressions
## <a name="subexpression-calls"></a>Subexpression calls
## <a name="the-dot-meta-character"></a>The dot meta character
$ sed ##### add your solution here
[Regular Expressions](#regular-expressions)
[Subexpression calls](#subexpression-calls)
[The dot meta character](#the-dot-meta-character)
13) For the input file patterns.txt
, replace the space character that occurs after a word ending with a
or r
with a newline character, only if the line also contains an uppercase letter. Display only the modified lines.
$ sed ##### add your solution here
par
car
tar
far
Cart
Not a
pip DOWN
14) Surround all whole words with ()
. Additionally, if the whole word is imp
or ant
, delete them. Can you do it with a single substitution?
$ words='tiger imp goat eagle ant important'
$ echo "$words" | sed ##### add your solution here
(tiger) () (goat) (eagle) () (important)
15) For the input file patterns.txt
, display lines containing car
but not as a whole word.
$ sed ##### add your solution here
scar
care
a huge discarded pile of books
scare
part cart mart
16) Will the ERE pattern ^a\w+([0-9]+:fig)?
match the same characters for the input apple42:banana314
and apple42:fig100
? If not, why not?
17) For the input file patterns.txt
, display lines starting with 4
or -
or u
or sub
or care
.
$ sed ##### add your solution here
care
4*5]
-handy
subtle sequoia
unhand
18) For the given input string, replace all occurrences of digit sequences with only the unique non-repeating sequence. For example, 232323
should be changed to 23
and 897897
should be changed to 897
. If there are no repeats (for example 1234
) or if the repeats end prematurely (for example 12121
), it should not be changed.
$ s='1234 2323 453545354535 9339 11 60260260'
$ echo "$s" | sed ##### add your solution here
1234 23 4535 9339 1 60260260
19) Replace sequences made up of words separated by :
or .
by the first word of the sequence. Such sequences will end when :
or .
is not followed by a word character.
$ ip='wow:Good:2_two.five: hi-2 bye kite.777:water.'
$ echo "$ip" | sed ##### add your solution here
wow hi-2 bye kite
20) Replace sequences made up of words separated by :
or .
by the last word of the sequence. Such sequences will end when :
or .
is not followed by a word character.
$ ip='wow:Good:2_two.five: hi-2 bye kite.777:water.'
$ echo "$ip" | sed ##### add your solution here
five hi-2 bye water
21) Replace all whole words with X
unless it is preceded by a (
character.
$ s='guava (apple) berry) apple (mango) (grape'
$ echo "$s" | sed ##### add your solution here
X (apple) X) X (mango) (grape
22) Surround whole words with []
only if they are followed by :
or ,
or -
.
$ ip='Poke,on=-=so_good:ink.to/is(vast)ever2-sit'
$ echo "$ip" | sed ##### add your solution here
[Poke],on=-=[so_good]:ink.to/is(vast)[ever2]-sit
23) The fields.txt
file has fields separated by the :
character. Delete :
and the last field if there is a digit character anywhere before the last field.
$ cat fields.txt
42:cat
twelve:a2b
we:be:he:0:a:b:bother
apple:banana-42:cherry:
dragon:unicorn:centaur
$ sed ##### add your solution here
42
twelve:a2b
we:be:he:0:a:b
apple:banana-42:cherry
dragon:unicorn:centaur
24) Are the commands sed -n '/a^b/p'
and sed -nE '/a^b/p'
equivalent?
25) What characters can be used as REGEXP delimiters?