KDnuggets Home » News » 2020 » Jan » Tutorials, Overviews » 10 Python String Processing Tips & Tricks ( 20:n03 )

10 Python String Processing Tips & Tricks


Pursuing a text analytics path but don't know where to start? Try this string processing primer to first gain an understanding of using Python to manipulate and process strings at a basic level.



Figure

 

Natural language processing and text analytics are hot areas of research and application at the moment. These fields entail all sorts of specific skills and concepts requiring thorough understanding before moving into meaningful practice. Prior to getting to that point, however, basic string manipulation and processing is a must.

There are 2 distinct types of broad computational string processing skills that need to be broached, in my opinion. The first of these is regular expressions, a pattern-based approach to text matching. There are numerous great introductions to regular expressions one can search out, but visual learners may appreciate the fast.ai Code-First Intro to Natural Language Processing course video on the topic.

The other distinct computational string processing skill is being able to leverage a given programming language's standard library for basic string manipulation. As such, this article is a short Python string processing primer for those toying with the idea of pursuing a more in-depth text analytics career.

Want to unlock value and understanding in all of that text your company has? You better make sure you understand the most basic of basics first. Have a look at these beginner tricks for insight.

Note that meaningful text analytics go way beyond string processing, and the core of these more advanced techniques may not require you to manipulate text on your own very often. However, text data pre-processing is an important and time-consuming part of a successful text analytics project, and these above-mentioned string processing skills will be invaluable here. Understanding the computational processing of text at a basic level is conceptually very important to understanding more advanced text analytics techniques as well.

Many of the following examples make use of the Python standard library string module, and so having it handy for reference is a good idea.

 

1. Stripping Whitepsace

 
Stripping whitespace is an elementary string processing requirement. You can strip leading whitespace with the lstrip() method (left), trailing whitespace with rstrip() (right), and both leading and trailing with strip().

s = '   This is a sentence with whitespace.       \n'

print('Strip leading whitespace: {}'.format(s.lstrip()))
print('Strip trailing whitespace: {}'.format(s.rstrip()))
print('Strip all whitespace: {}'.format(s.strip()))
Strip leading whitespace: This is a sentence with whitespace.       

Strip trailing whitespace:    This is a sentence with whitespace.
Strip all whitespace: This is a sentence with whitespace.

 

Interested in stripping characters other than whitespace? The same methods are helpful, and are used by passing in the character(s) you want stripped.

s = 'This is a sentence with unwanted characters.AAAAAAAA'

print('Strip unwanted characters: {}'.format(s.rstrip('A')))
Strip unwanted characters: This is a sentence with unwanted characters.

 

Don't forget to check out the string format() documentation if necessary.

 

2. Splitting Strings

 
Splitting strings into lists of smaller substrings is often useful and easily accomplished in Python with the split() method.

s = 'KDnuggets is a fantastic resource'

print(s.split())
['KDnuggets', 'is', 'a', 'fantastic', 'resource']

 

By default, split() splits on whitespace, but other character(s) sequences can be passed in as well.

s = 'these,words,are,separated,by,comma'
print('\',\' separated split -> {}'.format(s.split(',')))

s = 'abacbdebfgbhhgbabddba'
print('\'b\' separated split -> {}'.format(s.split('b')))
',' separated split -> ['these', 'words', 'are', 'separated', 'by', 'comma']
'b' separated split -> ['a', 'ac', 'de', 'fg', 'hhg', 'a', 'dd', 'a']

 

 

3. Joining List Elements Into a String

 
Need the opposite of the above operation? You can join list element strings into a single string in Python using the join() method.

s = ['KDnuggets', 'is', 'a', 'fantastic', 'resource']

print(' '.join(s))
KDnuggets is a fantastic resource

 

Ain't that the truth! And if you want to join list elements with something other than whitespace in between? This thing may be a little bit stranger, but also easily accomplished.

s = ['Eleven', 'Mike', 'Dustin', 'Lucas', 'Will']

print(' and '.join(s))
Eleven and Mike and Dustin and Lucas and Will

 

 

4. Reversing a String

 
Python does not have a built-in string reverse method. However, given that strings can be sliced like lists, reversing one can be done in the same succinct fashion that a list's elements can be reversed.

s = 'KDnuggets'

print('The reverse of KDnuggets is {}'.format(s[::-1]))
The reverse of KDnuggets is: steggunDK

 

 

5. Converting Uppercase and Lowercase

 
Converting between cases can be done with the upper(), lower(), and swapcase() methods.

s = 'KDnuggets'

print('\'KDnuggets\' as uppercase: {}'.format(s.upper()))
print('\'KDnuggets\' as lowercase: {}'.format(s.lower()))
print('\'KDnuggets\' as swapped case: {}'.format(s.swapcase()))
'KDnuggets' as uppercase: KDNUGGETS
'KDnuggets' as lowercase: kdnuggets
'KDnuggets' as swapped case: kdNUGGETS

 

 

6. Checking for String Membership

 
The easiest way to check for string membership in Python is using the in operator. The syntax is very natural language-like.

s1 = 'perpendicular'
s2 = 'pen'
s3 = 'pep'

print('\'pen\' in \'perpendicular\' -> {}'.format(s2 in s1))
print('\'pep\' in \'perpendicular\' -> {}'.format(s3 in s1))
'pen' in 'perpendicular' -> True
'pep' in 'perpendicular' -> False

 

If you are more interested in finding the location of a substring within a string (as opposed to simply checking whether or not the substring is contained), the find() string method can be more helpful.

s = 'Does this string contain a substring?'

print('\'string\' location -> {}'.format(s.find('string')))
print('\'spring\' location -> {}'.format(s.find('spring')))
'string' location -> 10
'spring' location -> -1

 

find() returns the index of the first character of the first occurrence of the substring by default, and returns -1 if the substring is not found. Check the documentation for available tweaks to this default behavior.

 

7. Replacing Substrings

 
What if you want to replace substrings, instead of just find them? The Python replace() string method will take care of that.

s1 = 'The theory of data science is of the utmost importance.'
s2 = 'practice'

print('The new sentence: {}'.format(s1.replace('theory', s2)))
The new sentence: The practice of data science is of the utmost importance.

 

An optional count argument can specify the maximum number of successive replacements to make if the same substring occurs multiple times.

 

8. Combining the Output of Multiple Lists

 
Have multiple lists of strings you want to combine together in some element-wise fashion? No problem with the zip() function.

countries = ['USA', 'Canada', 'UK', 'Australia']
cities = ['Washington', 'Ottawa', 'London', 'Canberra']

for x, y in zip(countries, cities):
  print('The capital of {} is {}.'.format(x, y))
The capital of USA is Washington.
The capital of Canada is Ottawa.
The capital of UK is London.
The capital of Australia is Canberra.

 

 

9. Checking for Anagrams

 
Want to check if a pair of strings are anagrams of one another? Algorithmically, all we need to do is count the occurrences of each letter for each string and check if these counts are equal. This is straightforward using the Counter class of the collections module.

from collections import Counter
def is_anagram(s1, s2):
  return Counter(s1) == Counter(s2)

s1 = 'listen'
s2 = 'silent'
s3 = 'runner'
s4 = 'neuron'

print('\'listen\' is an anagram of \'silent\' -> {}'.format(is_anagram(s1, s2)))
print('\'runner\' is an anagram of \'neuron\' -> {}'.format(is_anagram(s3, s4)))
'listen' an anagram of 'silent' -> True
'runner' an anagram of 'neuron' -> False

 

 

10. Checking for Palindromes

 
How about if you want to check whether a given word is a palindrome? Algorithmically, we need to create a reverse of the word and then use the == operator to check if these 2 strings (the original and the reverse) are equal.

def is_palindrome(s):
  reverse = s[::-1]
  if (s == reverse):
    return True
  return False

s1 = 'racecar'
s2 = 'hippopotamus'

print('\'racecar\' a palindrome -> {}'.format(is_palindrome(s1)))
print('\'hippopotamus\' a palindrome -> {}'.format(is_palindrome(s2)))
'racecar' is a palindrome -> True
'hippopotamus' is a palindrome -> False

 

These string processing "tricks" won't make you a text analytics or natural language processing expert on their own, but they may give someone the interest in pursuing these fields and learning the techniques which would be necessary for eventually becoming just such an expert.

 
Related:


Sign Up

By subscribing you accept KDnuggets Privacy Policy