This article will let you know how to use metacharacters or operators in your Python regular expression. We will walk you through each metacharacter (sign) by providing short and clear examples of using them in your code.
We can use both the special and ordinary characters inside a regular expression. Most ordinary characters, like ‘A'
, 'p'
, are the simplest regular expressions; they match themselves. For example, you can concatenate ordinary characters, so the pattern "PYnative"
matches the string ‘PYnative’.
Apart from this we also have special characters called metacharacters. Each metacharacter is equally important and may turn out to be very helpful for achieving your goals when solving your programming tasks using a regular expression.
Table of contents
What is Metacharacter in a Regular Expression?
In Python, Metacharacters are special characters that affect how the regular expressions around them are interpreted. Metacharacters don’t match themselves. Instead, they indicate that some rules. Characters or sign like |
, +
, or *
, are special characters. For example, ^
(Caret) metacharacter used to match the regex pattern only at the start of the string.
Metacharacters also called as operators, sign, or symbols.
First, let’s see the list of regex metacharacters we can use in Python and their meaning.
Metacharacter | Description |
---|---|
. (DOT) | Matches any character except a newline. |
^ (Caret) | Matches pattern only at the start of the string. |
$ (Dollar) | Matches pattern at the end of the string |
* (asterisk) | Matches 0 or more repetitions of the regex. |
+ (Plus) | Match 1 or more repetitions of the regex. |
? (Question mark) | Match 0 or 1 repetition of the regex. |
[] (Square brackets) | Used to indicate a set of characters. Matches any single character in brackets. For example, [abc] will match either a, or, b, or c character |
| (Pipe) | used to specify multiple patterns. For example, P1|P2 , where P1 and P2 are two different regexes. |
\ (backslash) | Use to escape special characters or signals a special sequence. For example, If you are searching for one of the special characters you can use a \ to escape them |
[^...] | Matches any single character not in brackets. |
(...) | Matches whatever regular expression is inside the parentheses. For example, (abc) will match to substring 'abc' |
Regex .
dot metacharacter
Inside the regular expression, a dot operators represents any character except the newline character, which is \n
. Any character means letters uppercase or lowercase, digits 0 through 9, and symbols such as the dollar ($) sign or the pound (#) symbol, punctuation mark (!) such as the question mark (?) commas (,
) or colons (:) as well as whitespaces.
Let’s write a basic pattern to verify that the DOT matches any character except the new line.
Example
import re
target_string = "Emma loves \n Python"
# dot(.) metacharacter to match any character
result = re.search(r'.', target_string)
print(result.group())
# Output 'E'
# .+ to match any string except newline
result = re.search(r'.+', target_string)
print(result.group())
# Output 'Emma loves '
Explanation
So here, I used the search() method to search for the pattern specified in the first argument. Notice that I used the dot (.
) and then the plus (+
) sign over here. The plus sign is the repetition operator in regular expressions, and it means that the preceding character or pattern should repeat one or more times.
This means that we are looking to match a sequence of at least one character except for the new line.
Next, we used the group() method to see the result. As you can notice, the substring till the newline (\n
) is returned because the DOT character matches any character except the new line.
DOT to match a newline character
If you want the DOT to match the newline character as well, use the re.DOTALL
or re.S
flag as an argument inside the search()
method. Let’s try this also.
Example
import re
str1 = "Emma is a Python developer \n She also knows ML and AI"
# dot(.) characters to match newline
result = re.search(r".+", str1, re.S)
print(result.group())
Output
Emma is a Python developer She also knows ML and AI
Regex ^
caret metacharacter
target_string = "Emma is a Python developer and her salary is 5000$ \n Emma also knows ML and AI"
In Python, the caret operator or sign is used to match a pattern only at the beginning of the line. For example, considering our target string, we found two things.
- We have a new line inside the string.
- Secondly, the string starts with the word Emma which is a four-letter word.
So assuming we wanted to match any four-letter word at the beginning of the string, we would use the caret (^
) metacharacter. Let’s test this.
Example
import re
target_string = "Emma is a Python developer \n Emma also knows ML and AI"
# caret (^) matches at the beginning of a string
result = re.search(r"^\w{4}", target_string)
print(result.group())
# Output 'Emma'
Explanation
So in this line of code, we are using the search()
method, and inside the regular expression pattern, we are using the carrot first.
To match a four-letter word at the beginning of the string, I used the \w
special sequence, which matches any alphanumeric characters such as letters both lowercase and uppercase, numbers, and the underscore character.
The 4 inside curly braces say that the alphanumeric character must occur precisely four times in a row. i.e. Emma
caret ( ^ ) to match a pattern at the beginning of each new line
Normally the carat sign is used to match the pattern only at the beginning of the string as long as it is not a multiline string meaning the string does not contain any newlines.
However, if you want to match the pattern at the beginning of each new line, then use the re.M
flag. The re.M
flag is used for multiline matching.
As you know, our string contains a newline in the middle. Let’s test this.
Example
import re
str1 = "Emma is a Python developer and her salary is 5000$ \nEmma also knows ML and AI"
# caret (^) matches at the beginning of each new line
# Using re.M flag
result = re.findall(r"^\w{4}", str1, re.M)
print(result)
# Output ['Emma', 'Emma']
Regex $
dollar metacharacter
This time we are going to have a look at the dollar sign metacharacter, which does the exact opposite of the caret (^
) .
In Python, The dollar ($
) operator or sign matches the regular expression pattern at the end of the string. Let’s test this by matching word AI which is present at the end of the string, using a dollar ($) metacharacter.
Example
import re
str1 = "Emma is a Python developer \nEmma also knows ML and AI"
# dollar sign($) to match at the end of the string
result = re.search(r"\w{2}$", str1)
print(result.group())
# Output 'AI'
Regex *
asterisk/star metacharacter
Another very useful and widely used metacharacter in regular expression patterns is the asterisk (*). In Python, The asterisk operator or sign inside a pattern means that the preceding expression or character should repeat 0 or more times with as many repetitions as possible, meaning it is a greedy repetition.
When we say *
asterisk is greedy, it means zero or more repetitions of the preceding expression.
Let’s see the example to match all the numbers from the following string using an asterisk (*) metacharacter.
target_string = "Numbers are 8,23, 886, 4567, 78453"
Patter to match: \d\d*
Let’s understand this pattern first.
As you can see, the pattern is made of two consecutive \d
. The \d
special sequences represent any digit.
The most important thing to keep in mind here is that the asterisk (*) at the end of the pattern means zero or more repetitions of the preceding expression. And in this case, the preceding expression is the last \d
, not all two of them.
This means that we are basically searching for numbers with a minimum of 1 digit and possibly any integer.
We may get the following possible matches
- A single digit, meaning 0 repetitions according to the asterisk Or
- The two-digit number, meaning 1 repetition according to the asterisk Or
- we may have the three-digit number meaning two repetitions of the last
\d
, or - The four-digit number as well.
There is no upper limit of repetitions enforced by the *
(asterisk) metacharacter. However, the lower limit is zero.
So \d\d*
means that the re.findall()
method should return all the numbers from the target string.
Example
import re
str1 = "Numbers are 8,23, 886, 4567, 78453"
# asterisk sign(*) to match 0 or more repetitions
result = re.findall(r"\d\d*", str1)
print(result)
# Output ['8', '23', '886', '4567', '78453']
Regex +
Plus metacharacter
Another very useful and widely used metacharacter in regular expression patterns is the plus (+
). In Python, The plus operator (+
) inside a pattern means that the preceding expression or character should repeat one or more times with as many repetitions as possible, meaning it is a greedy repetition.
When we say plus is greedy, it means 1 or more repetitions of the preceding expression.
Let’s see the same example to match two or more digit numbers from a string using a plus (+
) metacharacter.
Patter to match: \d\d+
This means that we are basically searching for numbers with a minimum of 2 digits and possibly any integer.
We can get the following possible matches
- We may get the two-digit number, meaning 1 repetition according to the plus (
+
) Or - we may have the three-digit number meaning two repetitions of the last
\d
, or - we may have the four-digit number as well.
There is no upper limit of repetitions enforced by the *
(asterisk) metacharacter. However, the lower limit is 1.
So \d\d+
means that the re.findall()
method should Return all the numbers with a minimum of two digits from the target string.
Example
import re
str1 = "Numbers are 8,23, 886, 4567, 78453"
# Plus sign(+) to match 1 or more repetitions
result = re.findall(r"\d\d+", str1)
print(result)
# Output ['23', '886', '4567', '78453']
The ?
question mark metacharacter
In Python, the question mark operator or sign (?
) inside a regex pattern means the preceding character or expression to repeat either zero or one time only. This means that the number of possible repetitions is strictly limited on both ends.
Let’s see the example to compare the ?
with *
and +
metacharacters to handle repetitions.
Pattern to match: \d\d\d\d\d?
As you know, the question mark enables the repetition of the preceding character, either zero or one time.
we have five\d
, which means that we want to match numbers having at least four digits while the fifth \d
may repeat 0 or 1 times, meaning it doesn’t exist at all or one time.
Example
import re
target_string = "Numbers are 8,23, 886, 4567, 78453"
# Question mark sign(?) to match 0 or 1 repetitions
result = re.findall(r"\d\d\d\d\d?", target_string)
print(result)
# Output ['4567', '78453']
We have set a limit of four for the total number of digits in the match. And indeed, the result contains only collections of four-digit and five-digit numbers.
The \
backslash metacharacter
In Python, the backslash metacharacter has two primary purposes inside regex patterns.
- It can signal a special sequence being used, for example, \d for matching any digits from 0 to 9.
- If your expression needs to search for one of the special characters, you can use a backslash (
\
) to escape them
For example, you want to search for the question mark (?) inside the string. You can use a backslash to escaping such special characters because the question mark has a special meaning inside a regular expression pattern.
Let’s understand each of these two scenarios, one by one.
To indicate a special sequence
- \d for any digits
- \w for any alphanumeric character
- \s for space
Escape special character using a backslash (\
)
Let’ take the DOT metacharacter as you’ve seen thus far. The DOT has a special meaning when used inside a regular expression. It matches any character except the new line.
However, In the string, the DOT is used to end the sentence. So the question is how to precisely match an actual dot inside a string using regex patterns. But the DOT already has a special meaning when used inside a pattern.
Well, the solution is to use the backslash, and it is called Escaping. You can use the backslash to escape the dot inside the regular expression pattern. And this way, you can match the actual dot inside the target string and remove its special meaning.
Let’s take the example of the same
import re
str1 = "Emma is a Python developer. Emma salary is 5000$. Emma also knows ML and AI."
# escape dot
res = re.findall(r"\.", str1)
print(res)
# Output ['.', '.', '.']
The []
square brackets metacharacter
The square brackets are beneficial when used in the regex pattern because they represent sets of characters and character classes.
Let’s say we wanted to look for any occurrences of letters E, d, k letters inside our target string. Or, in simple terms, match any of these letters inside the string. We can use the square brackets to represent sets of characters like [Edk]
.
import re
str1 = "Emma is a Python developer. Emma also knows ML and AI."
res = re.findall(r"[edk]", str1)
print(res)
# Output 'd', 'e', 'e', 'e', 'k', 'd']
Note: Please note that the operation here is or meaning this is equivalent to saying I am looking for any occurrences of E or d or k. The result is a list containing all the matches that were found inside the target string.
This operation can be beneficial when you want to search for several characters at the same time inside a string without knowing that any or all of them are part of the string.
We can also use the square brackets to specify an interval or a range of characters and use a dash in-between the two ends of the range.
For instance, let’s say that we want to match any letter from m to p inside our target string, to do this we can write regex like [m-p]
Mean all the occurrences of the letters m, n, o, p.
Not to be nit-picky, but your example of Square Bracket Metacharacter is using a lower case e and you cited the example with an upper case E. Different results…