Python - RegEx
A RegEx, or Regular Expression is a sequence of characters that defines a search pattern. It is used to check whether a string contains specified search pattern or not. See the below mentioned search pattern:
^P....n$
The above search pattern can be used to check whether a string contains six characters which starts with P and ends with n.
Please note that Python has a built-in RegEx module called re which need to be imported to work with Regular Expression.
Example:
In the example below, ^p....n$ search pattern is checked for its presence in the given string called MyString.
import re MyString = "Python" x = re.search("^P....n$", MyString) if(x): print("Pattern found.") else: print("Pattern not found.") MyString = "Python!." x = re.search("^P....n$", MyString) if(x): print("Pattern found.") else: print("Pattern not found.")
The output of the above code will be:
Pattern found. Pattern not found.
MetaCharacters
Metacharacters are the special characters which are interpreted in a different way by RegEx engine. The metacharacters are:
Character | Description | Example |
---|---|---|
[] | To specify a set to characters | "[a-z]" |
. | To specify any character except new line | "He..o" |
^ | To specify starts with character(s) | "^Hello" |
$ | To specify ends with character(s) | "World$" |
* | To check zero or more occurrences of specified character(s) | "Helx*" |
+ | To check one or more occurrences of specified character(s) | "Helx+" |
{} | To check the specified number of occurrences of specified character(s) | "Hel{2}" |
? | To check zero or one occurrences of specified character(s) | "He?l" |
| | To specify either or | "go|come" |
() | To group sub-patterns | "(x|y|z)abc" |
\ | To escape various characters including all metacharacters | "\$" |
Special Sequences
Metacharacters are the special characters which are interpreted in a different way by RegEx engine. The metacharacters are:
Character | Description | Example |
---|---|---|
\A | Matches if the specified characters are at the beginning of the string. | "\AThe" |
\b | Matches if the specified characters are at the beginning or at the end of a word. | "\bain" "ain\b" |
\B | Matches if the specified characters are present, but NOT at the beginning (or at the end) of a word. | "\Bain" "ain\B" |
\d | Matches if the string contains digits (numbers from 0-9). | "\d" |
\D | Matches if the string DOES NOT contain digits. | "\D" |
\s | Matches if the string contains a white space character. | "\s" |
\S | Matches if the string DOES NOT contain a white space character. | "\S" |
\w | Matches if the string contains any word characters (characters from a to Z, digits from 0-9, and the underscore _ character). | "\w" |
\W | Matches if the string DOES NOT contain any word characters. | "\W" |
\Z | Matches if the specified characters are at the end of the string. | "rain\Z" |
Sets
A set is a collection of characters inside a pair of square brackets [] with a special meaning:
Set | Description |
---|---|
[abc] | Matches if one of the specified characters (a, b, or c) are present. |
[a-d] | Matches if any lower case character, alphabetically between a and d is present. |
[^abc] | Matches for any character EXCEPT a, b, and c. |
[123] | Matches if any of the specified digits (1, 2, or 3) are present. |
[0-9] | Matches for any digit between 0 and 9. |
[1-8][0-9] | Matches for any two-digit numbers from 10 and 89. |
[a-zA-Z] | Matches for any character alphabetically between a and z, lower case or upper case. |
[+] | In sets, +, *, ., |, (), $,{} has no special meaning, so [+] means: return a match for any + character in the string. |
The findall() Function
The findall() function returns a list containing all matches. The list contains the matches in the order they are found. If no matches are found, an empty list is returned.
Example:
In the example below, the findall() function is used to find all matches of comma (,) and ampersand (&) in the given string.
import re MyString = "31 January, 28 February, 31 March" #find all matches of comma (,) x = re.findall(",", MyString) print(x) #find all matches of ampersand (&) y = re.findall("&", MyString) print(y)
The output of the above code will be:
[',', ','] []
The search() Function
The search() function is used to search the string for a match, and returns a Match object if there is a match. If there is more than one match, only the first occurrence of the match is returned. In case of no match, None is returned.
Example:
In the example below, the search() function is used to find first match of comma (,) and ampersand (&) in the given string.
import re MyString = "31 January, 28 February, 31 March" #find first match of comma (,) x = re.search(",", MyString) print("First comma starting point:", x.start()) #find first match of ampersand (&) x = re.search("&", MyString) print("First ampersand starting point:", x)
The output of the above code will be:
First comma starting point: 10 First ampersand starting point: None
The split() Function
The split() function returns a list where the string has been split at each match. The number of split can be controlled by specifying maxsplit parameter.
Example:
In the example below, the split() function returns a list where the string has been split at each match.
import re MyString = "31 January, 28 February, 31 March" #create list containing elements spitted using comma (,) x = re.split(",", MyString) print("The List contains: ", x) #create list containing elements spitted using comma (,) #maximum number of split is specified as 1 y = re.split(",", MyString, 1) print("The List contains: ", y)
The output of the above code will be:
The List contains: ['31 January', ' 28 February', ' 31 March'] The List contains: ['31 January', ' 28 February, 31 March']
The sub() Function
The sub() function is used to replace the matches with the specified text. The number of replacement can be controlled by specifying count parameter.
Example:
In the example below, the sub() function is used to replace the comma (,) with asterisk (*).
import re MyString = "31 January, 28 February, 31 March" #replacing comma (,) with asterisk (*) x = re.sub(",", "*", MyString) print("The String contains: ", x) #replacing comma (,) with asterisk (*) #maximum number of replacement is specified as 1 y = re.sub(",", "*", MyString, 1) print("The String contains: ", y)
The output of the above code will be:
The String contains: 31 January* 28 February* 31 March The String contains: 31 January* 28 February, 31 March