Regexp gotcha 1: grouping common portions
Similar to a(b+c)d = abd+acd
in maths, you get a(b|c)d = abd|acd
in regular expressions. However, you'll have to be careful if quantifiers are involved.
For example, (a*|b*)
isn't the same as (a|b)*
. Can you reason out why? Here's a railroad diagram to help you out:
Credit: debuggex.com
The difference is that (a*|b*)
only matches same letter sequences like a
, bb
, aaaaaa
, etc. But (a|b)*
can match mixed sequences like ababbba
too. You can also simplify (a|b)*
to [ab]*
since it is just single character alternation in this particular example.
Here's an illustration using Python:
>>> import re
>>> test = ['aa', 'abbaba', 'aaabbb', 'bbbbb', 'abc']
>>> [s for s in test if re.fullmatch(r'(a*|b*)', s)]
['aa', 'bbbbb']
>>> [s for s in test if re.fullmatch(r'(a|b)*', s)]
['aa', 'abbaba', 'aaabbb', 'bbbbb']
Want to learn regular expressions from the basics with plenty of examples and exercises? I've written regexp ebooks for Python, JavaScript, Ruby and CLI tools.