Python - RegEx

A RegEx, or Regular Expression is a sequence of characters that defines a search pattern. It is used to check whether a string contains specified search pattern or not. See the below mentioned search pattern:

^P....n$

The above search pattern can be used to check whether a string contains six characters which starts with P and ends with n.

Please note that Python has a built-in RegEx module called re which need to be imported to work with Regular Expression.

Example:

In the example below, ^p....n$ search pattern is checked for its presence in the given string called MyString.

import re

MyString = "Python"
x = re.search("^P....n$", MyString)
if(x):
  print("Pattern found.")
else:
  print("Pattern not found.")


MyString = "Python!."
x = re.search("^P....n$", MyString)
if(x):
  print("Pattern found.")
else:
  print("Pattern not found.")

The output of the above code will be:

Pattern found.
Pattern not found.

MetaCharacters

Metacharacters are the special characters which are interpreted in a different way by RegEx engine. The metacharacters are:

Character	Description	Example
[]	To specify a set to characters	"[a-z]"
.	To specify any character except new line	"He..o"
^	To specify starts with character(s)	"^Hello"
$	To specify ends with character(s)	"World$"
*	To check zero or more occurrences of specified character(s)	"Helx*"
+	To check one or more occurrences of specified character(s)	"Helx+"
{}	To check the specified number of occurrences of specified character(s)	"Hel{2}"
?	To check zero or one occurrences of specified character(s)	"He?l"
\|	To specify either or	"go\|come"
()	To group sub-patterns	"(x\|y\|z)abc"
\	To escape various characters including all metacharacters	"\$"

Special Sequences

Metacharacters are the special characters which are interpreted in a different way by RegEx engine. The metacharacters are:

Character	Description	Example
\A	Matches if the specified characters are at the beginning of the string.	"\AThe"
\b	Matches if the specified characters are at the beginning or at the end of a word.	"\bain" "ain\b"
\B	Matches if the specified characters are present, but NOT at the beginning (or at the end) of a word.	"\Bain" "ain\B"
\d	Matches if the string contains digits (numbers from 0-9).	"\d"
\D	Matches if the string DOES NOT contain digits.	"\D"
\s	Matches if the string contains a white space character.	"\s"
\S	Matches if the string DOES NOT contain a white space character.	"\S"
\w	Matches if the string contains any word characters (characters from a to Z, digits from 0-9, and the underscore _ character).	"\w"
\W	Matches if the string DOES NOT contain any word characters.	"\W"
\Z	Matches if the specified characters are at the end of the string.	"rain\Z"

Sets

A set is a collection of characters inside a pair of square brackets [] with a special meaning:

Set	Description
[abc]	Matches if one of the specified characters (a, b, or c) are present.
[a-d]	Matches if any lower case character, alphabetically between a and d is present.
[^abc]	Matches for any character EXCEPT a, b, and c.
[123]	Matches if any of the specified digits (1, 2, or 3) are present.
[0-9]	Matches for any digit between 0 and 9.
[1-8][0-9]	Matches for any two-digit numbers from 10 and 89.
[a-zA-Z]	Matches for any character alphabetically between a and z, lower case or upper case.
[+]	In sets, +, *, ., \|, (), $,{} has no special meaning, so [+] means: return a match for any + character in the string.

The findall() Function

The findall() function returns a list containing all matches. The list contains the matches in the order they are found. If no matches are found, an empty list is returned.

Example:

In the example below, the findall() function is used to find all matches of comma (,) and ampersand (&) in the given string.

import re

MyString = "31 January, 28 February, 31 March"

#find all matches of comma (,)
x = re.findall(",", MyString)
print(x)

#find all matches of ampersand (&)
y = re.findall("&", MyString)
print(y)

The output of the above code will be:

[',', ',']
[]

The search() Function

The search() function is used to search the string for a match, and returns a Match object if there is a match. If there is more than one match, only the first occurrence of the match is returned. In case of no match, None is returned.

Example:

In the example below, the search() function is used to find first match of comma (,) and ampersand (&) in the given string.

import re

MyString = "31 January, 28 February, 31 March"

#find first match of comma (,)
x = re.search(",", MyString)
print("First comma starting point:", x.start())

#find first match of ampersand (&)
x = re.search("&", MyString)
print("First ampersand starting point:", x)

The output of the above code will be:

First comma starting point: 10
First ampersand starting point: None

The split() Function

The split() function returns a list where the string has been split at each match. The number of split can be controlled by specifying maxsplit parameter.

Example:

In the example below, the split() function returns a list where the string has been split at each match.

import re

MyString = "31 January, 28 February, 31 March"

#create list containing elements spitted using comma (,)
x = re.split(",", MyString)
print("The List contains: ", x)

#create list containing elements spitted using comma (,)
#maximum number of split is specified as 1 
y = re.split(",", MyString, 1)
print("The List contains: ", y)

The output of the above code will be:

The List contains:  ['31 January', ' 28 February', ' 31 March']
The List contains:  ['31 January', ' 28 February, 31 March']

The sub() Function

The sub() function is used to replace the matches with the specified text. The number of replacement can be controlled by specifying count parameter.

Example:

In the example below, the sub() function is used to replace the comma (,) with asterisk (*).

import re

MyString = "31 January, 28 February, 31 March"

#replacing comma (,) with asterisk (*)
x = re.sub(",", "*", MyString)
print("The String contains: ", x)

#replacing comma (,) with asterisk (*)
#maximum number of replacement is specified as 1 
y = re.sub(",", "*", MyString, 1)
print("The String contains: ", y)

The output of the above code will be:

The String contains:  31 January* 28 February* 31 March
The String contains:  31 January* 28 February, 31 March