Python 3

home

Regular Expressions: Text Matching and Extraction

Text Matching: Why is it so Useful?

Short answer: validation and extraction of formatted text.


Either we want to see whether a string contains the pattern we seek, or we want to pull selected information from the string. In the past, we've isolated or extracted text using basic Python features.


In the case of fixed-width text, we have been able to use a slice.

line = '19340903  3.4 0.9'
year = line[0:4]                 # year == 1934

In the case of delimited text, we have been able to use split()

line = '19340903,3.4,0.9'
els = line.split(',')
yearmonthday = els[0]            # 193400903
MktRF = els[1]                   # 3.4

In the case of formatted text, there is no obvious way to do it.

# how would we extract 'Jun' (the month) from this string?
log_line = '66.108.19.165 - - [09/Jun/2003:19:56:33 -0400] "GET /~jjk265/cd.jpg HTTP/1.1" 200 175449'

We may be able to use split() and slicing in some combination to get what we want, but it would be awkward and time consuming. So we're going to learn how to use regular expressions.





Preview: a complex example

Here we demonstrate what regexes look like and how they're used.


Just as an example to show you what we'll be doing with regexes, the following regex pattern could be used to pull the IP address and 'Jun' (the month) from the log line:

import re

log_line = '66.108.19.165 - - [09/Jun/2003:19:56:33 -0400] "GET /~jjk265/cd.jpg HTTP/1.1" 200 175449'

reg = re.search(r'(\d{2,3}\.\d{2,3}\.\d{2,3}\.\d{2,3}) - - \[\d\d\/(\w{3})\/\d{4}', log_line)

print(reg.group(1))   # 66.108.19.165
print(reg.group(2))   # Jun

Reading from left to right, the pattern (shown in the r'' string) says this: "2-3 digits, followed by a period, followed by 2-3 digits, followed by a period, followed by 2-3 digits, followed by a period, followed by 2-3 digits, followed by a space, dash, space, dash, followed by a square bracket, followed by 2 digits, followed by a forward slash, followed by 3 "word" characters (and this text grouped for extraction), followed by a slash, followed by 4 digit characters." Now, that may seem complex. Many of our regexes will be much simpler, although this one isn't terribly unusual. But the power of this tool is in allowing us to describe a complex string in terms of the pattern of the text -- not the specific text -- and either pull parts of it out or make sure it matches the pattern we're looking for. This is the purpose of regular expressions. The parentheses within the pattern are called groupings and allow us to extract (i.e., copy out) the text that matched the pattern. We have parentheses around the IP address and the month, so we can use .group(1) and .group(2) to copy out these values, based on the order of parentheses in the pattern.





Python's re module and re.search()

import re makes regular expressions available to us. Everything we do with regular expressions is done through re.


re.search() takes two arguments: the pattern string which is the regex pattern, and the string to be searched. It can be used in an if expression, and will evaluate to True if the pattern matched.

# weblog contains string lines like this:
  '66.108.19.165 - - [09/Jun/2003:19:56:33 -0400] "GET /~jjk265/cd.jpg HTTP/1.1" 200 175449'
  '66.108.19.165 - - [09/Jun/2003:19:56:44 -0400] "GET /~dbb212/mysong.mp3 HTTP/1.1" 200 175449'
  '66.108.19.165 - - [09/Jun/2003:19:56:45 -0400] "GET /~jjk265/cd2.jpg HTTP/1.1" 200 175449'

# script snippet:
for line in weblog.readlines():
    if re.search(r'~jjk265', line):
        print(line)                      # prints 2 of the above lines




not for negating a search

As with any if test, the test can be negated with not. Now we're saying "if this pattern does not match".


# again, a weblog:
    '66.108.19.165 - - [09/Jun/2003:19:56:33 -0400] "GET /~jjk265/cd.jpg HTTP/1.1" 200 175449'
    '66.108.19.165 - - [09/Jun/2003:19:56:44 -0400] "GET /~dbb212/mysong.mp3 HTTP/1.1" 200 175449'
    '66.108.19.165 - - [09/Jun/2003:19:56:45 -0400] "GET /~jjk265/cd2.jpg HTTP/1.1" 200 175449'

# script snippet:
for line in weblog.readlines():
    if not re.search(r'~jjk265', line):
        print(line)                      # prints 1 of the above lines -- the one without jjk265




The raw string (r'')

The raw string is like a normal string, but it does not process escapes. An escaped character is one preceded by a backslash, that turns the combination into a special character. \n is the one we're famliar with - the escaped n converts to a newline character, which marks the end of a line in a multi-line string.


A raw string wouldn't process the escape, so r'\n' is literally a backslash followed by an n.

var = "\n"            # one character, a newline
var2 = r'\n'          # two characters, a backslash followed by an n




The re Bestiary

We call it a "bestiary" because it contains many strange animals. Each of these animals takes the shape of a special character (like $, ^, |); or a regular text character that has been escaped (like \w, \d, \s); or a combination of characters in a group (like {2,3}, [aeiou]). Our bestiary can be summarized thusly:


Anchor Characters and the Boundary Character
(matches at start or end of string or of a word)
$, ^, \b
Character Classes
(match any of a group of characters)
\w, \d, \s, \W, \S, \D
Custom Character Classes
(a user-defined group of characters to match)
[aeiou], [a-zA-Z]
The Wildcard
(matches on any character except newline)
.
Quantifiers
(specify how many characters to match)
+, *, ?
Custom Quantifiers
(user-defined how many characters to match)
{2,3}, {2,}, {2}
Groupings
(extract text, quantify group, or match on alternates)
(parentheses groups)





Patterns can match anywhere, but must match on consecutive characters

Patterns can match any string of consecutive characters. A match can occur anywhere in the string.


import re

str1 = 'hello there'
str2 = 'why hello there'
str3 = 'hel lo'

if re.search(r'hello', str1):  print('matched')   # matched
if re.search(r'hello', str2):  print('matched')   # matched
if re.search(r'hello', str3):  print('matched')   # does not match

Note that 'hello' matches at the start of the first string and the middle of the second string. But it doesn't match in the third string, even though all the characters we are looking for are there. This is because the space in str3 is unaccounted for - always remember - matches take place on consecutive characters.





Anchors ^ and $

Our match may require that the search text appear at the beginning or end of a string. The anchor characters can require this.


This program uses the $ end-anchor to list only those files in the directory that end in .txt:

import os, re
for filename in os.listdir(r'/path/to/directory'):
    if re.search(r'\.txt$', filename):     # look for '.txt' at end of filename
        print(filename)

This program uses the ^ start-anchor prints all the lines in the file that don't begin with a hash mark:

for text_line in open(r'/path/to/file.py'):
    if not re.search(r'^#', text_line):        # look for '#' at start of filename
        print(text_line)

When they are used as anchors, we will always expect ^ to appear at the start of our pattern, and $ to appear at the end.





The \b Word Boundary

A "word boundary" is a space or punctuation character (or start/end of string).


We can use \b to search for a whole word.


sen_with_is = 'Regex is good.'

print(re.search(r'\bis\b', sen_with_is))       # True


sen_without_is = 'This regex fails.'

print(re.search(r'\bis\b', sen_without_is))    # False

Keep in mind however that puncutation also counts as word boundary. So even middle-of-word charaters like ' (apostrophe) will match a boundary:

sen_with_can = 'Yes we can!'

print(re.search(r'\bcan\b', sen_with_can))     # True


sen_without_can = 'No we can't!'

print(re.search(r'\bcan\b', sen_without_can))  # True




Built-In Character Classes

A character class is a special regex entity that will match on any of a set of characters. The three built-in character classes are these:


\d
[0-9] (Digits)
\w
[a-zA-Z0-9_] ("Word" characters -- letters, numbers or underscores)
\s
[ \n\t] ('Whitespace' characters -- spaces, newlines, or tabs)


So a \d will match on a 5, 9, 3, etc.; a \w will match on any of those, or on a, Z, _ (underscore). Keep in mind that although they match on any of several characters, a single instance of a character class matches on only one character. For example, a \d will match on a single number like '5', but it won't match on both characters in '55'. To match on 55, you could say \d\d.





Built-in Character Classes: digit

The \d character class matches on any digit. This example lists only those files with names formatted with a particular syntax -- YYYY-MM-DD.txt:


import re
dirlist = ('.', '..', '2010-12-15.txt', '2010-12-16.txt', 'testfile.txt')
for filename in dirlist:
    if re.search(r'^\d\d\d\d-\d\d-\d\d\.txt$', filename):
        print(filename)

Here's another example, validation: this regex uses the pattern ^\d\d\d\d$ to check to see that the user entered a four-digit year:

import re
answer = input("Enter your birth year in the form YYYY\n")
if re.search(r'^\d\d\d\d$', answer):
    print("Your birth year is ", answer)
else:
    print("Sorry, that was not YYYY")




Built-in Character Classes: "word" characters

A "word" character casts a wider net: it will match on any number, letter or underscore.


In this example, we require the user to enter a username with any "word" characters:

username = input('Please enter a username: ')
if not re.search(r'^\w\w\w\w\w$', username):
    print("use five numbers, letters, or underscores\n")

As you can see, the anchors prevent the input from exceeding 5 characters.





Built-in Character Classes: spaces

A space character class matches on any of three characters: a space (' '), a newline ('\n') or a tab ('\t'). This program searches for a space anywhere in the string and if it finds it, the match is successful - which means the input isn't successful:


new_password = input('Please enter a password (no spaces):  ')
if re.search(r'\s', new_password):
    print("password must not contain spaces")

Note in particular that the regex pattern \s is not anchored anywhere. So the regex will match if a space occurs anywhere in the string. You may also reflect that we treat spaces pretty roughly - always stripping them off. They always get in the way! And they're invisible, too, and still we feel the need to persecute them. What a nuisance.





Inverse Character Classes

These are more aptly named inverse character classes - they match on anything that is not in the usual character set.


Not a digit: \D


So \D matches on letters, underscores, special characters - anything that is not a digit. This program checks for a non-digit in the user's account number:

account_number = input('Account number:  ')
if re.search(r'\D', account_number):
    print("account number must be all digits!")

Not a "word" character: \W


Here's a username checker, which simply looks for a non-"word":

account_number = input('Account number: ')
if re.search(r'\W', account_number):
    print("account number must be only letters, numbers, and underscores")

Not a space character: \S


These two regexes check for a non-space at the start and end of the string:

sentence = input('Enter a sentence: ')
if re.search(r'^\S', sentence) and re.search(r'\S$', sentence):
    print("the sentence does not begin or end with a space, tab or newline.")




Custom Character Classes

Consider this table of character classes and the list of characters they match on:


digit class \d [0123456789] or [0-9]
"word" class \w [abcdefghijklmnopqrstuvwxyz
ABCDEFGHIJKLMNOOPQRSTUVWXYZ0123456789_] or
[a-zA-Z0-9_]
space class \s [ \t\n]


In fact, the bracketed ranges can be used to create our own character classes. We simply place members of the class within the brackets and use it in the same way we might use \d or the others.


A custom class can contain a range of characters. This example looks for letters only (there is no built-in class for letters):

import re
import sys
ui = input("please enter a username, starting with a letter:  ")
if not re.search(r'^[a-zA-Z]', ui):
    sysexit("invalid user name entered")

This custom class [.,;:?!] matches on any one of these punctuation characters, and this example identifies single punctuation characters and removes them:

import re
text_line = 'Will I?  I will.  Today, tomorrow; yesterday and before that.'
for word in text_line.split():
    while re.search(r'[.,;:?!-]$', word):
        word = word[:-1]
    print(word)




Inverse Custom Character Classes

Like \S for \s, the inverse character class matches on anything not in the list. It is designated with a carrot just inside the open bracket:


import re
for text_line in open('unknown_text.txt'):
    for word in text_line.split():
        while re.search(r'[^a-zA-Z]$', word):
            word = word[:-1]
        print(word)

It would be easy to confuse the carrot at the start of a string with the carrot at the start of a custom character class -- just keep in mind that one appears at the very start of the string, and the other at the start of the bracketed list.





The Wildcard (.)

The ultimate character class, it matches on every character except for a newline. (We might surmise this is because we are often working with line-oriented input, with pesky newlines at the end of every line. Not matching on them means we never have to worry about stripping or watching out for newlines.)

import re
username = input('5-digit username only please: ')
if not re.search(r'^.....$', username):   # five dots here
    print("you can use any characters except newline, but there must \
    be five of them.\n")




Quantifiers: specifies how many to match on

A quantifier appears immediately after a character, character class, or grouping (coming up). It describes how many of the preceding characters there may be in our matched text.


We can say three digits (\d{3}), between 1 and 3 "word" characters (\w{1,3}), one or more letters [a-zA-Z]+, zero or more spaces (\s*), one or more x's (x+). Anything that matches on a character can be quantified.

+      : 1 or more
*      : 0 or more
?      : 0 or 1
{3,10} : between 3 and 10

In this example directory listing, we are interested only in files with the pattern config_ followed by an integer of any size. We know that there could be a config_1.txt, a config_12.txt, or a config_120.txt. So, we simply specify "one or more digits":

import re
filenames = ['config_1.txt', 'config_10.txt', 'notthis.txt', '.', '..']
wanted_files = []
for file in filenames:
    if re.search(r'^config_\d+\.txt$', file):
        wanted_files.append(file)

Here, we validate user input to make sure it matches the pattern for valid NYU ID. The pattern for an NYU Net ID is: two or three letters followed by one or more numbers:

import re
ui = input("please enter your net id:  ")
if not re.search(r'^[A-Za-z]{2,3}\d+$', ui):
    print("that is not valid NYU Net ID!")

A simple email address is one or more "word" characters followed by an @ sign, followed by a period, followed by 2-4 letters:

import re
email_address = input('Email address:  ')
if re.search(r'^\w+@\w+\.[A-Za-z]{2,}$', email_address):
    print("email address validated")

Of course email addresses can be more complicated than this - but for this exercise it works well.





flags: re.IGNORECASE

We can modify our matches with qualifiers called flags. The re.IGNORECASE flag will match any letters, whether upper or lowercase. In this example, extensions may be upper or lowercase - this file matcher doesn't care!


import re
dirlist = ('thisfile.jpg', 'thatfile.txt', 'otherfile.mpg', 'myfile.TXT')
for file in dirlist:
    if re.search(r'\.txt$', file, re.IGNORECASE):   #'.txt' or '.TXT'
        print(file)

The flag is passed as the third argument to search, and can also be passed to other re search methods.





re.search(), re.compile() and the compile object

re.search() is the one-step method we've been using to test matching. Actually, regex matching is done in two steps: compiling and searching. re.search() conveniently puts the two together.


In some cases, a pattern should be compiled first before matching begins. This would be useful if the pattern is to be matched on a great number of strings, as in this weblog example:

import re
access_log = '/home1/d/dbb212/public_html/python/examples/access_log'
weblog = open(access_log)
patternobj = re.compile(r'edg205')
for line in weblog.readlines():
    if patternobj.search(line):
        print(line, end=' ')
weblog.close()

The pattern object is returned from re.compile, and can then be called with search. Here we're calling search repeatedly, so it is likely more efficient to compile once and then search with the compiled object.





Grouping for Alternates: Vertical Bar

We can group several characters together with parentheses. The parentheses do not affect the match, but they do designate a part of the matched string to be handled later. We do this to allow for alternate matches, for quantifying a portion of the pattern, or to extract text.


Inside a group, the vertical bar can indicate allowable matches. In this example, a string will match on any of these words, and because of the anchors will not allow any other characters:

import re
import sys

program_arg = sys.argv[1]
if not re.search(r'^Q(1|2|3|4)\-\d{4}$', program_arg):
    sys.exit("quarter argument must match the pattern 'Q[num]-YYYY' "
         "where [num] is 1-4 and YYYY is a 4-digit year")




Grouping for Quantifying

Let's expand our email address pattern and make it possible to match on any of these examples:

good_emails = [
    'joe@apex.com',
    'joe.wilson@apex.com',
    'joe.wilson@eng.apex.com',
    'joe.g.zebulon.wilson@my.subdomain.eng.apex.com'
]

And let's make sure our regex fails on any of these:

bad_emails = [
    '.joe@apex.com',          # leading period
    'joe.wilson@apex.com.',   # trailing period
    'joe..wilson@apex.com'    # two periods together
]

How can we include the period while making sure it doesn't appear at the start or end, or repeated, as it does in the bad_emails list?


Look for a repeating pattern of groups of characters in the good_emails. In these combinations, we are attempting to account for subdomains, which could conceivably be chained togtehter. In this case, there is a pattern joe., that we can match with \w+\. (a period, the wildcard, must be escaped). If we see that this may repeat, we can group the pattern and apply a quantifier to it:

import re
for address in good_emails + bad_emails:                # concatenates two lists
    if re.search(r'^(\w+\.)*\w+@(\w+\.)+[A-Za-z]{2,}$', address):
        print("{0}:  good".format(address))
    else:
        print("{0}:  bad".format(address))




Grouping for Extraction: the matchobject group() method.

We use the group() method of the match object to extract the text that matched the group.


Here's an example, using our log file. What if we wanted to capture the last two numbers (the status code and the number of bytes served), and place the values into structures?

log_lines = [
'66.108.19.165 - - [09/Jun/2003:19:56:33 -0400] "GET /~jjk265/cd.jpg HTTP/1.1" 200 175449',
'216.39.48.10 - - [09/Jun/2003:19:57:00 -0400] "GET /~rba203/about.html HTTP/1.1" 200 1566',
'216.39.48.10 - - [09/Jun/2003:19:57:16 -0400] "GET /~dd595/frame.htm HTTP/1.1" 400 1144'
]

import re
bytes_sum = 0
for line in log_lines:
    matchobj = re.search(r'(\d+) (\d+)$', line) # last two numbers in line
    status_code = matchobj.group(1)
    bytes = matchobj.group(2)
    bytes_sum += int(bytes)                     # sum the bytes




Understanding "AttributeError: 'NoneType' object has no attribute 'group'"

It means your match was not successful.


re.search() is expected to return an re.Match object; we can then call .group() on the object to retrieve extracted text.


In the below example, note that the 2nd line does not have a number at the end, so the regex will not match there and re.search() will return None:

lines = [
'"GET /~jjk265/cd.jpg HTTP/1.1" 200 175449',
'"GET /~rba203/about.html HTTP/1.1" 200 -',    (no number at the end)
'"GET /~dd595/frame.htm HTTP/1.1" 400 1144'
]

for line in log_lines:
    matchobj = re.search(r'\d+\s+(\d+)$', line)     # if not match, returns None
    bytes = matchobj.group(1)                       # AttributeError:  'NoneType' has no attribute 'group'

If re.search() does not match and returns None, calling .group() on None will raise an AttributeError (in other words, None does not have a .group() method). This can be particularly confusing when matching on numerous lines of text, since most may match. For the one line that does not match, however, the exception will be raised.


It can be difficult to pinpoint the unmatching line until we use a print statement:

for line in log_lines:
    print(f'about to match on line:  {line}')
    matchobj = re.search(r'\d+\s+(\d+)$', line)

We may see many lines printed, but the last one printed before the error occurs will be the first one that did not match.





groups()

groups() returns all grouped matches.


If you wish to grab all the matches into a tuple rather than call them by number, use groups(). You can then read variables from the tuple, or assign groups() to named variables.


In this example, the Salesforce Customer Relationship Management system has a field in one of its objects that discounts certain revenue and explains the reason. Our job is to extract the code and the reason from the string:

import re
my_GL_codes = [
    '12520 - Allowance for Customer Concessions',
    '12510 - Allowance for Unmet Mins',
    '40000 - Platform Revenue',
    '21130 - Pre-paid Revenue',
    '12500 - Allowance for Doubtful Accounts'
]
for field in my_GL_codes:
    codeobj = re.search(r'^(\d+)\s*\-\s*(.+)$', field)      # GL code
    name_tuple = codeobj.groups()
    print 'tuple from groups:  ', name_tuple
    code, reason = name_tuple
    print "extracted:  '{0}':  '{1}'".format(code, reason)
    print




findall() for multiple matches

findall() matches the same pattern repeatedly, returning all matched text within a string.


findall() with a groupless pattern


Usually re tries to a match a pattern once -- and after it finds the first match, it quits searching. But we may want to find as many matches as we can -- and return the entire set of matches in a list. findall() lets us do that:

text = "There are seven words in this sentence";
words = re.findall(r'\w+', text)
print(words)  # ['There', 'are', 'seven', 'words', 'in', 'this', 'sentence']

This program prints each of the words on a separate line. The pattern \b\w+\b is applied again and again, each time to the text remaining after the last match. This pattern could be used as a word counting algorithm (we would count the elements in words), except for words with punctuation. findall() with groups


When a match pattern contains more than one grouping, findall returns multiple tuples:

text = "High: 33, low: 17"
temp_tuples = re.findall(r'(\w+):\s+(\d+)', text)
print(temp_tuples)                       # [('High', '33'), ('low', '17')]




re.sub() for substitutions

re.sub() replaces matched text with replacement text.


Regular expressions are used for matching so that we may inspect text. But they can also be used for substitutions, meaning that they have the power to modify text as well.


This example replaces Microsoft '\r\n' line ending codes with Unix '\n'.

text = re.sub(r'\r\n', '\n', text)

Here's another simple example:

string = "My name is David"
string = re.sub('David', 'John', string)

print(string)                            # 'My name is John'




re.split(): split on a pattern of characters

Sometimes the 'split characters' have variations to consider.


In the example below, the user has been asked to enter numbers separated by commas. However, we don't know whether they will introduce spaces between them. The most elegant way to handle this is to split on a pattern:


import re

user_list = input('please enter numbers separated by commas: ')  # str, '5, 9,3,  10'

# split on zero or more spaces, followed by comma, followed by zero or more spaces
numbers = re.split(r'\s*,\s*', user_list)                        # ['5', '9', '3', '10']




"Whole File" Matching: Matching on Contents of a File

An entire file as a single string opens up additional matching possibilities.


This example opens and reads a web page (which we might have retrieved with a module like urlopen), then looks to see if the word "advisory" appears in the text. If it does, it prints the page:


file = open('weather-ny.html')
text = file.read()
if re.search(r'advisory', text, re.I):
    print("weather advisory:  ", text)




"Whole File" Matching: re.MULTILINE (^ and $ can match at start or end of line)

Within a file of many lines, we can specify start or end of a single line.


We have been working with text files primarily in a line-oriented (or, in database terminology, record-oriented way, and regexes are no exception - most file data is oriented in this way. However, it can be useful to dispense with looping and use regexes to match within an entire file - read into a string variable with read().


In this example, we surely can use a loop and split() to get the info we want. But with a regex we can grab it straight from the file in one line:

# passwd file:
nobody:*:-2:-2:Unprivileged User:/var/empty:/usr/bin/false
root:*:0:0:System Administrator:/var/root:/bin/sh
daemon:*:1:1:System Services:/var/root:/usr/bin/false

# python script:
import re
passwd_text = open('/etc/passwd').read()
mobj =  re.search(r'^root:[^:]+:[^:]+:[^:]:([^:]+):([^:]+)', passwd_text, re.MULTILINE)
if mobj:
    info = mobj.groups()
    print("root:  Name %s, Home Dir %s" % (info[0], info[1]))

We can even use findall to extract all the information rfrom a file - keep in mind, this is still being done in two lines:

import re
passwd_text = open('/etc/passwd').read()
lot = re.findall(r'^(\w+):[^:]+:[^:]+:[^:]+:[^:]+:([^:]+)', passwd_text, re.MULTILINE)

mydict = dict(lot)

print(mydict)




"Whole File" Matching: re.DOTALL(allow the wildcard (.) to match on newlines)

Matching the wildcard on newlines may be needed for a multi-line file string.


Normally, the wildcard doesn't match on newlines. When working with whole files, we may want to grab text that spans multiple lines, using a wildcard.


# search file sample.txt
some text we don't want
==start text==
this is some text that we do want.
the extracted text should continue,
including just about any character,
until we get to
==end text==
other text we don't want

# python script:
import re
text = open('sample.txt').read()
matchobj = re.search(r'==start text==(.+)==end text==', text, re.DOTALL)
print(matchobj.group(1))




[pr]