Text processing
This chapter will primarily focus on str
methods to solve a wide variety of text processing tasks. You'll also see a few examples using the string
and re
modules.
join
The join()
method is similar to what the print()
function does with the sep
option, except that you get a str
object as the result. The iterable you pass to join()
must only have string elements.
>>> print(1, 2)
1 2
>>> ' '.join((1, 2))
Traceback (most recent call last):
File "<python-input-1>", line 1, in <module>
' '.join((1, 2))
~~~~~~~~^^^^^^^^
TypeError: sequence item 0: expected str instance, int found
>>> ' '.join(('1', '2'))
'1 2'
>>> c = ' :: '
>>> c.join(['This', 'is', 'a', 'sample', 'string'])
'This :: is :: a :: sample :: string'
As an exercise, check what happens if you pass multiple string values separated by comma to join()
instead of an iterable.
The
print()
method uses an object's__str__()
method to get its string representation. The__repr__()
method is used as a fallback.
Transliteration
The translate()
method accepts a table of codepoints (numerical value of a character) mapped to another character/codepoint or None
(if the character has to be deleted). You can use the ord() built-in function to get the codepoint of characters. Or, you can use the str.maketrans()
method to generate the mapping for you.
>>> ord('a')
97
>>> ord('A')
65
>>> str.maketrans('aeiou', 'AEIOU')
{97: 65, 101: 69, 105: 73, 111: 79, 117: 85}
>>> greeting = 'have a nice day'
>>> greeting.translate(str.maketrans('aeiou', 'AEIOU'))
'hAvE A nIcE dAy'
The string module has a collection of constants that are often useful in text processing. Here's an example of deleting punctuation characters.
>>> import string
>>> string.punctuation
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
>>> para = '"Hi", there! How *are* you? All fine here.'
>>> para.translate(str.maketrans('', '', string.punctuation))
'Hi there How are you All fine here'
>>> chars_to_delete = ''.join(set(string.punctuation) - set('.!?'))
>>> para.translate(str.maketrans('', '', chars_to_delete))
'Hi there! How are you? All fine here.'
As an exercise, read the documentation for features covered in this section. See also stackoverflow: character translation examples.
Removing leading and trailing characters
The strip()
method removes consecutive characters from the start/end of the given string. By default this method removes whitespace characters, which you can change by passing a str
argument. You can use the lstrip()
and rstrip()
methods to work only on the leading and trailing characters respectively.
>>> greeting = ' \t\r\n have a nice \t day \f\v\r\t\n '
>>> greeting.strip()
'have a nice \t day'
>>> greeting.lstrip()
'have a nice \t day \x0c\x0b\r\t\n '
>>> greeting.rstrip()
' \t\r\n have a nice \t day'
>>> '"Hi". How are you!?'.strip(string.punctuation)
'Hi". How are you'
The removeprefix()
and removesuffix()
methods will delete a substring from the start/end of the input string.
>>> 'spare'.removeprefix('sp')
'are'
>>> 'free'.removesuffix('e')
'fre'
# difference between remove and strip
>>> 'cared'.removesuffix('de')
'cared'
# strip uses given argument as a set of characters to be removed in any order
>>> 'cared'.rstrip('de')
'car'
Dealing with case
Here are five different methods for changing the case of characters. Word level transformation is determined by consecutive occurrences of alphabets, not limited to separation by whitespace characters.
>>> sentence = 'thIs iS a saMple StrIng'
>>> sentence.capitalize()
'This is a sample string'
>>> sentence.title()
'This Is A Sample String'
>>> sentence.lower()
'this is a sample string'
>>> sentence.upper()
'THIS IS A SAMPLE STRING'
>>> sentence.swapcase()
'THiS Is A SAmPLE sTRiNG'
The string.capwords()
method is similar to title()
but also allows a specific word separator (whose default is whitespace).
>>> phrase = 'this-IS-a:colon:separated,PHRASE'
>>> phrase.title()
'This-Is-A:Colon:Separated,Phrase'
>>> string.capwords(phrase, ':')
'This-is-a:Colon:Separated,phrase'
is methods
The islower()
, isupper()
and istitle()
methods check if the given string conforms to the specific case pattern. Characters other than alphabets do not influence the result, but at least one alphabet needs to be present for a True
output.
>>> 'αλεπού'.islower()
True
>>> '123'.isupper()
False
>>> 'ABC123'.isupper()
True
>>> 'Today is Sunny'.istitle()
False
Here are some examples with the isnumeric()
and isascii()
methods. As an exercise, read the documentation for the rest of the is methods.
# checks if the string has numeric characters only (at least one)
>>> '153'.isnumeric()
True
>>> ''.isnumeric()
False
>>> '1.2'.isnumeric()
False
>>> '-1'.isnumeric()
False
# False if any character codepoint is outside the range 0x00 to 0x7F
>>> '123—456'.isascii()
False
>>> 'happy learning!'.isascii()
True
Substring and count
The in
operator checks if the LHS string is a substring of the RHS string.
>>> sentence = 'This is a sample string'
>>> 'is a' in sentence
True
>>> 'this' in sentence
False
>>> 'this' in sentence.lower()
True
>>> 'test' not in sentence
True
The count()
method gives the number of times the given substring is present (non-overlapping).
>>> sentence = 'This is a sample string'
>>> sentence.count('is')
2
>>> sentence.count('w')
0
>>> word = 'phototonic'
>>> word.count('oto')
1
Match at the start and end of strings
The startswith()
and endswith()
methods check for the presence of substrings only at the start and end of an input string.
>>> sentence = 'This is a sample string'
>>> sentence.startswith('This')
True
>>> sentence.startswith('is')
False
>>> sentence.endswith('ing')
True
>>> sentence.endswith('ly')
False
If you need to check for multiple conditions, pass a tuple
argument.
>>> words = ['refuse', 'impossible', 'present', 'read']
>>> prefix = ('im', 're')
>>> for w in words:
... if w.startswith(prefix):
... print(w)
...
refuse
impossible
read
split
The split()
method splits a string based on the given substring and returns a list
. By default, whitespace characters are used for splitting. You can also control the number of splits.
>>> greeting = ' \t\r\n have a nice \t day \f\v\r\t\n '
# note that the leading/trailing whitespaces do not create empty elements
>>> greeting.split()
['have', 'a', 'nice', 'day']
# note that the empty elements are preserved here
>>> ':car::jeep::'.split(':')
['', 'car', '', 'jeep', '', '']
>>> 'apple<=>grape<=>mango<=>fig'.split('<=>', maxsplit=1)
['apple', 'grape<=>mango<=>fig']
As an exercise, read the documentation for the rsplit()
, partition()
and rpartition()
methods.
replace
Use the replace()
method for substitution operations. An optional count
keyword argument allows you to specify the number of replacements to be made.
>>> phrase = '2 be or not 2 be'
>>> phrase.replace('2', 'to')
'to be or not to be'
>>> phrase.replace('2', 'to', count=1)
'to be or not 2 be'
# recall that string is immutable, you'll need to re-assign if needed
>>> phrase
'2 be or not 2 be'
>>> phrase = phrase.replace('2', 'to')
>>> phrase
'to be or not to be'
re module
Regular Expressions is a versatile tool for text processing. Here are some common use cases:
- Sanitizing a string to ensure that it satisfies a known set of rules. For example, to check if a given string matches password rules.
- Filtering or extracting portions on an abstract level like alphabets, digits, punctuation and so on.
- Qualified string replacement. For example, at the start or the end of a string, only whole words, based on surrounding text, etc.
You can use the built-in re
module to perform such tasks. Here are some examples:
>>> import re
# extract non-colon character sequences
>>> ip = ':car::jeep::'
# using the 'split' method will result in possible empty elements
>>> ip.split(':')
['', 'car', '', 'jeep', '', '']
# with regular expressions, you can choose to match only the non-empty portions
# [^:] is a character class to match non : characters
# + is a quantifier that matches the preceding element one or more times
>>> re.findall(r'[^:]+', ip)
['car', 'jeep']
# replace only whole words 'par' OR 'hand' with 'X'
# \b is an anchor to restrict the matching to the start/end of words
# () has many uses, helps to group common elements here
# similar to 'a(b+c)d = abd+acd' in maths, you get 'a(b|c)d = abd|acd'
# | is similar to the 'or' operator
>>> ip = 'par spare part hand handy unhanded'
>>> re.sub(r'\b(par|hand)\b', 'X', ip)
'X spare part X handy unhanded'
See my book Understanding Python re(gex)? for a detailed guide on regular expressions. You'll also get to learn the third-party
regex
module.
Exercises
Write a function that checks if two strings are anagrams irrespective of case. Assume that the input is made up of alphabets only.
>>> anagram('god', 'Dog') True >>> anagram('beat', 'table') False >>> anagram('Beat', 'abet') True
Read the documentation and implement these formatting examples with the equivalent
str
methods.>>> fruit = 'apple' >>> f'{fruit:=>10}' '=====apple' >>> f'{fruit:=<10}' 'apple=====' >>> f'{fruit:=^10}' '==apple===' >>> f'{fruit:^10}' ' apple '
Write a function that returns a
list
of words present in the input string.>>> words('"Hi", there! How *are* you? All fine here.') ['Hi', 'there', 'How', 'are', 'you', 'All', 'fine', 'here'] >>> words('This-Is-A:Colon:Separated,Phrase') ['This', 'Is', 'A', 'Colon', 'Separated', 'Phrase']