Python String Processing Cheatsheet
Try this string processing primer cheatsheet to gain an understanding of using Python to manipulate and process strings at a basic level.
String Processing in Python
Natural language processing and text analytics are hot areas of research and application at the moment. These fields entail all sorts of specific skills and concepts requiring thorough understanding before moving into meaningful practice. Prior to getting to that point, however, basic string manipulation and processing is a must.
There are 2 distinct types of broad computational string processing skills that need to be broached, in my opinion. The first of these is regular expressions, a pattern-based approach to text matching. There are numerous great introductions to regular expressions one can search out, but visual learners may appreciate the fast.ai Code-First Intro to Natural Language Processing course video on the topic.
The other distinct computational string processing skill is being able to leverage a given programming language's standard library for basic string manipulation. As such, this article is a short Python string processing primer.
Note that meaningful text analytics go way beyond string processing, and the core of these more advanced techniques may not require you to manipulate text on your own very often. However, text data processing is an important and time-consuming part of a successful text analytics project, and these above-mentioned string processing skills will be invaluable here. Understanding the computational processing of text at a basic level is conceptually very important to understanding more advanced text analytics techniques as well.
Many of the following examples make use of the Python standard library string module, and so having it handy for reference is a good idea.
This handy cheatsheet contains all of the code in this downloadable PDF.
Stripping Whitepsace
Stripping whitespace is an elementary string processing requirement. You can strip leading whitespace with the lstrip()
method (left), trailing whitespace with rstrip()
(right), and both leading and trailing with strip()
.
s = ' This is a sentence with whitespace. \n' print('Strip leading whitespace: {}'.format(s.lstrip())) print('Strip trailing whitespace: {}'.format(s.rstrip())) print('Strip all whitespace: {}'.format(s.strip()))
Strip leading whitespace: This is a sentence with whitespace. Strip trailing whitespace: This is a sentence with whitespace. Strip all whitespace: This is a sentence with whitespace.
Interested in stripping characters other than whitespace? The same methods are helpful, and are used by passing in the character(s) you want stripped.
s = 'This is a sentence with unwanted characters.AAAAAAAA' print('Strip unwanted characters: {}'.format(s.rstrip('A')))
Strip unwanted characters: This is a sentence with unwanted characters.
Don't forget to check out the string format()
documentation if necessary.
Splitting Strings
Splitting strings into lists of smaller substrings is often useful and easily accomplished in Python with the split()
method.
s = 'KDnuggets is a fantastic resource' print(s.split())
['KDnuggets', 'is', 'a', 'fantastic', 'resource']
By default, split()
splits on whitespace, but other character(s) sequences can be passed in as well.
s = 'these,words,are,separated,by,comma' print('\',\' separated split -> {}'.format(s.split(','))) s = 'abacbdebfgbhhgbabddba' print('\'b\' separated split -> {}'.format(s.split('b')))
',' separated split -> ['these', 'words', 'are', 'separated', 'by', 'comma'] 'b' separated split -> ['a', 'ac', 'de', 'fg', 'hhg', 'a', 'dd', 'a']
Joining List Elements Into a String
Need the opposite of the above operation? You can join list element strings into a single string in Python using the join()
method.
s = ['KDnuggets', 'is', 'a', 'fantastic', 'resource'] print(' '.join(s))
KDnuggets is a fantastic resource
Ain't that the truth! And if you want to join list elements with something other than whitespace in between? This thing may be a little bit stranger, but also easily accomplished.
s = ['Eleven', 'Mike', 'Dustin', 'Lucas', 'Will'] print(' and '.join(s))
Eleven and Mike and Dustin and Lucas and Will
Reversing a String
Python does not have a built-in string reverse method. However, given that strings can be sliced like lists, reversing one can be done in the same succinct fashion that a list's elements can be reversed.
s = 'KDnuggets' print('The reverse of KDnuggets is {}'.format(s[::-1]))
The reverse of KDnuggets is: steggunDK
Converting Uppercase and Lowercase
Converting between cases can be done with the upper()
, lower()
, and swapcase()
methods.
s = 'KDnuggets' print('\'KDnuggets\' as uppercase: {}'.format(s.upper())) print('\'KDnuggets\' as lowercase: {}'.format(s.lower())) print('\'KDnuggets\' as swapped case: {}'.format(s.swapcase()))
'KDnuggets' as uppercase: KDNUGGETS 'KDnuggets' as lowercase: kdnuggets 'KDnuggets' as swapped case: kdNUGGETS
Checking for String Membership
The easiest way to check for string membership in Python is using the in
operator. The syntax is very natural language-like.
s1 = 'perpendicular' s2 = 'pen' s3 = 'pep' print('\'pen\' in \'perpendicular\' -> {}'.format(s2 in s1)) print('\'pep\' in \'perpendicular\' -> {}'.format(s3 in s1))
'pen' in 'perpendicular' -> True 'pep' in 'perpendicular' -> False
If you are more interested in finding the location of a substring within a string (as opposed to simply checking whether or not the substring is contained), the find() string method can be more helpful.
s = 'Does this string contain a substring?' print('\'string\' location -> {}'.format(s.find('string'))) print('\'spring\' location -> {}'.format(s.find('spring')))
'string' location -> 10 'spring' location -> -1
find()
returns the index of the first character of the first occurrence of the substring by default, and returns -1
if the substring is not found. Check the documentation for available tweaks to this default behavior.
Replacing Substrings
What if you want to replace substrings, instead of just find them? The Python replace()
string method will take care of that.
s1 = 'The theory of data science is of the utmost importance.' s2 = 'practice' print('The new sentence: {}'.format(s1.replace('theory', s2)))
The new sentence: The practice of data science is of the utmost importance.
An optional count argument can specify the maximum number of successive replacements to make if the same substring occurs multiple times.
Combining the Output of Multiple Lists
Have multiple lists of strings you want to combine together in some element-wise fashion? No problem with the zip()
function.
countries = ['USA', 'Canada', 'UK', 'Australia'] cities = ['Washington', 'Ottawa', 'London', 'Canberra'] for x, y in zip(countries, cities): print('The capital of {} is {}.'.format(x, y))
The capital of USA is Washington. The capital of Canada is Ottawa. The capital of UK is London. The capital of Australia is Canberra.
Checking for Anagrams
Want to check if a pair of strings are anagrams of one another? Algorithmically, all we need to do is count the occurrences of each letter for each string and check if these counts are equal. This is straightforward using the Counter
class of the collections
module.
from collections import Counter def is_anagram(s1, s2): return Counter(s1) == Counter(s2) s1 = 'listen' s2 = 'silent' s3 = 'runner' s4 = 'neuron' print('\'listen\' is an anagram of \'silent\' -> {}'.format(is_anagram(s1, s2))) print('\'runner\' is an anagram of \'neuron\' -> {}'.format(is_anagram(s3, s4)))
'listen' an anagram of 'silent' -> True 'runner' an anagram of 'neuron' -> False
Checking for Palindromes
How about if you want to check whether a given word is a palindrome? Algorithmically, we need to create a reverse of the word and then use the ==
operator to check if these 2 strings (the original and the reverse) are equal.
def is_palindrome(s): reverse = s[::-1] if (s == reverse): return True return False s1 = 'racecar' s2 = 'hippopotamus' print('\'racecar\' a palindrome -> {}'.format(is_palindrome(s1))) print('\'hippopotamus\' a palindrome -> {}'.format(is_palindrome(s2)))
'racecar' is a palindrome -> True 'hippopotamus' is a palindrome -> False
Make sure you download the Python String Processing Cheatsheet here.