In this article, we will see how to locate the position of a regex match in a string using the start()
, end()
, and span()
methods of the Python re.Match
object.
We will solve the following three scenarios
- Get the start and end position of a regex match in a string
- Find the indexes of all regex matches
- Get the positions and values of each match
Note: Python re module offers us the search(), match(), and finditer() methods to match the regex pattern, which returns us the Match object instance if a match found. Use this Match object to extract the information about the matching string using the start()
, end()
, and span()
method.
These Match object methods are used to access the index positions of the matching string.
start()
returns the starting position of the matchend()
return the ending position of the matchspan()
return a tuple containing the(start, end)
positions of the match
Table of contents
Example to get the position of a regex match
In this example, we will search any 4 digit number inside the string. To achieve this, we must first write the regular expression pattern.
Pattern to match any 4 digit number: \d{4}
Steps:
- Search the pattern using the search() method.
- Next, we can extract the match value using
group()
- Now, we can use the
start()
andend()
methods to get the starting and ending index of the match. - Also, we can use the
span()
method() to get both start and end indexes in a single tuple.
import re
target_string = "Abraham Lincoln was born on February 12, 1809,"
# \d to match digits
res = re.search(r'\d{4}', target_string)
# match value
print(res.group())
# Output 1809
# start and end position
print(res.span())
# Output (41, 45)
# start position
print(res.start())
# Output 41
# end position
print(res.end())
# Output 45
Access matching string using start(), and end()
Now, you can save these positions and use them whenever you want to retrieve a matching string from the target string. We can use string slicing to access the matching string directly using the index positions obtained from the start()
, end()
method.
Example
import re
target_string = "Abraham Lincoln was born on February 12, 1809,"
res = re.search(r'\d{4}', target_string)
print(res.group())
# Output 1809
# save start and end positions
start = res.start()
end = res.end()
print(target_string[start:end])
# Output 1809
Find the indexes of all regex matches
Assume you are finding all matches to the regular expression in Python, apart from all match values you also want the indexes of all regex matches. In such cases, we need to use the finditer()
method of Python re module instead of findall()
.
Because the findall()
method returns all matches in the form of a Python list, on the other hand, finditer(
) returns an iterator yielding match objects matching the regex pattern. Later, we iterate each Match object to extract all matches along with their positions.
In this example, we will find all 5-letter words inside the following string and also print their start and end positions.
import re
target_string = "Jessa scored 56 and Kelly scored 65 marks"
count = 0
# \w matches any alphanumeric character
# \b indicate word boundary
# {5} indicate five-letter word
for match in re.finditer(r'\b\w{5}\b', target_string):
count += 1
print("match", count, match.group(), "start index", match.start(), "End index", match.end())
Output
match 1 Jessa start index 0 End index 5 match 2 Kelly start index 20 End index 25 match 3 marks start index 36 End index 41
find all the indexes of all the occurrences of a word in a string
Example
import re
target_string = "Emma knows Python. Emma knows ML and AI"
# find all occurrences of word emma
# index of each occurrences
cnt = 0
for match in re.finditer(r'emma', target_string, re.IGNORECASE):
cnt += 1
print(cnt, "st match start index", match.start(), "End index", match.end())
Output
1 st match start index 0 End index 4 2 nd match start index 19 End index 23
Points to be remembered while using the start() method
Since the re.match()
method only checks if the regular expression matches at the start of a string, start()
will always be zero.
However, the re.search()
method scans through the entire target string and looks for occurrences of the pattern that we want to find, so the match may not start at zero in that case.
Now let’s match any ten consecutive alphanumeric characters in the target string using both match()
and search()
method.
Example
import re
target_string = "Emma is a basketball player who was born on June 17, 1993"
# match method with pattern and target string using match()
result = re.match(r"\w{10}", target_string)
# printing match
print("Match: ", result) # None
# using search()
result = re.search(r"\w{10}", target_string)
# printing match
print("Match value: ", result.group()) # basketball
print("Match starts at", result.start()) # index 10