Python regular expressions (Regex) are a powerful way to search, extract, validate, and manipulate text in Python. From cleaning datasets and parsing logs to validating user input, mastering Python’s built-in re module improves coding efficiency.
This Regex exercise set helps you build hands-on experience with pattern matching through 30 exercises, progressing from basic matching to advanced text extraction and manipulation.
Each coding challenge includes a Practice Problem, Hint, Solution code, and detailed Explanation, ensuring you don’t just copy code, but genuinely practice and understand how and why it works.
- All solutions have been fully tested on Python 3.
- Read Python Regex: Python’s RE module for pattern matching with regular expressions.
- Use our Online Code Editor to solve these exercises in real time.
What you’ll practice:
- Basic Matching & Quantifiers: Character classes, sets, and repetition (
*,+,?,{m,n}) - Word & Character Boundaries: Using
^,$,\b, and\B - Data Validation & Cleaning: Validating IDs, formatting, and standardizing data
- Search & Extraction: Using
re.search(),re.findall(), andre.finditer() - String Manipulation: Performing advanced replacements with
re.sub()
Who is this for?
Beginner to intermediate Python developers with basic knowledge of Python strings who want practical experience with the re module.
+ Table of Contents (30 Exercises)
Table of contents
- Exercise 1: Check Allowed Characters
- Exercise 2: Match Zero or More
- Exercise 3: Match One or More
- Exercise 4: Match Optional Characters
- Exercise 5: Match Exact Occurrences
- Exercise 6: Match Range of Occurrences
- Exercise 7: Find Underscore Joined Lowercase
- Exercise 8: PascalCase Match
- Exercise 9: Match Start and End
- Exercise 10: Match Word at Start
- Exercise 11: Match Word at End
- Exercise 12: Find a Specific Letter
- Exercise 13: Find Letter in Middle
- Exercise 14: Match Adjacent Words
- Exercise 15: Filter by Starting Letter
- Exercise 16: Validate Alphanumeric ID
- Exercise 17: Check Starting Number
- Exercise 18: Number at End
- Exercise 19: Clean IP Addresses
- Exercise 20: Convert Date Format
- Exercise 21: Extract 1-3 Digit Numbers
- Exercise 22: Search Literal Strings
- Exercise 23: Find Pattern Location
- Exercise 24: Find All Substrings
- Exercise 25: Iterate Matches
- Exercise 26: Extract Date from URL
- Exercise 27: Extract All Numbers
- Exercise 28: Extract Email Addresses
- Exercise 29: Swap Characters
- Exercise 30: Replace Multiple Delimiters
Exercise 1: Check Allowed Characters
Problem Statement: Write a Python program to verify that a string contains only alphanumeric characters (a-z, A-Z, and 0-9).
Purpose: This exercise helps you practice using regular expressions to validate input strings. Checking for allowed characters is a foundational technique used in form validation, data sanitisation, and security-sensitive input handling.
Given Input: text = "Hello123"
Expected Output: Valid: contains only alphanumeric characters
▼ Hint
- Import the
remodule at the top of your program. - Use
re.fullmatch()to check whether the entire string matches a pattern, not just part of it. - The pattern
[a-zA-Z0-9]+matches one or more alphanumeric characters. re.fullmatch()returns a match object if the whole string matches, orNoneif it does not.
▼ Solution & Explanation
Explanation:
import re: Loads Python’s built-in regular expression module, which is required for allrefunctions.[a-zA-Z0-9]: A character class that matches any single uppercase letter, lowercase letter, or digit.+: A quantifier meaning one or more of the preceding character class, so the string must not be empty.re.fullmatch(): Requires the pattern to match the entire string from start to finish. This is stricter thanre.search(), which would match even a partial substring.
Exercise 2: Match Zero or More
Problem Statement: Write a Python program to match a string that has an a followed by zero or more bs (e.g., a, ab, abb).
Purpose: This exercise introduces the * quantifier, one of the most commonly used tools in regular expressions. Understanding zero-or-more matching is essential for parsing optional repeated elements in text processing and pattern recognition.
Given Input: test_strings = ["a", "ab", "abb", "abbb", "b", "ba"]
See: Python Regex Metacharacters and Operators
Expected Output:
a -> Match ab -> Match abb -> Match abbb -> Match b -> No match ba -> No match
▼ Hint
- The pattern
ab*means: the lettera, followed by zero or morebs. - Use
re.fullmatch()so that strings likebaorabcare not incorrectly accepted. - Loop over a list of test strings and print whether each one matches or not.
▼ Solution & Explanation
Explanation:
ab*: Matches a literalafollowed by zero or more occurrences ofb. The*quantifier means thebis entirely optional but can repeat any number of times.re.fullmatch(): Ensures the entire string is evaluated against the pattern. Without it,re.search()would match theainsidebaand produce a false positive.f"{s:<6}": A format specifier that left-aligns the string in a field of width 6, making the output easier to read in a column.- Why
bandbafail:bhas no leadinga, andbahas the letters in the wrong order, so neither satisfies theab*pattern.
Exercise 3: Match One or More
Problem Statement: Write a Python program to match a string that has an a followed by one or more bs (e.g., ab, abb, but not a).
Purpose: This exercise demonstrates the + quantifier, which enforces that at least one occurrence of a character must be present. It is a small but critical distinction from * and is widely used when a repeated element is required rather than optional.
Given Input: test_strings = ["a", "ab", "abb", "abbb", "b", "ba"]
Expected Output:
a -> No match ab -> Match abb -> Match abbb -> Match b -> No match ba -> No match
▼ Hint
- The pattern
ab+means: the lettera, followed by one or morebs. - This is identical in structure to Exercise 2, but swapping
*for+makes thebmandatory. - Use
re.fullmatch()to reject strings likebawhere the order is incorrect.
▼ Solution & Explanation
Explanation:
ab+: Matches the letterafollowed by one or morebs. The+quantifier requires at least onebto be present, unlike*which allows zero.- Why
anow fails: The loneamatched in Exercise 2 because*allowed zerobs. Here,+demands at least one, soaalone is rejected. re.fullmatch(): Continues to play an important role by preventing partial matches. Without it,re.search(r"ab+", "abXYZ")would incorrectly return a match.- Key distinction:
*means zero or more;+means one or more. This single character difference changes whether the repeated element is optional or required.
Exercise 4: Match Optional Characters
Problem Statement: Write a Python program to match a string that has an a followed by zero or one b (i.e., exactly a or ab, nothing else).
Purpose: This exercise introduces the ? quantifier, which marks a character as optional but non-repeating. It is commonly used when parsing elements that may or may not appear, such as an optional sign in a number, an optional prefix, or an optional suffix in a word.
Given Input: test_strings = ["a", "ab", "abb", "abbb", "b", "ba"]
Expected Output:
a -> Match ab -> Match abb -> No match abbb -> No match b -> No match ba -> No match
▼ Hint
- The pattern
ab?means: the lettera, followed by zero or oneb. - The
?quantifier does not allow repetition. It only permits the character to appear once at most. - Use
re.fullmatch()so thatabband longer strings are correctly rejected.
▼ Solution & Explanation
Explanation:
ab?: Matches the letterafollowed by an optional singleb. The?quantifier means zero or one occurrence, so onlyaandabare valid.- Why
abbfails: The?quantifier allows at most oneb. Two or morebs exceed the limit, soabbandabbbdo not match when usingre.fullmatch(). - Comparison with
*and+: All three quantifiers are closely related.?is 0-1,*is 0 to infinity, and+is 1 to infinity. Choosing the right one depends on how many repetitions are acceptable. - Common use case: The
?quantifier is frequently used in real-world patterns, for examplehttps?matches bothhttpandhttpsin URL validation.
Exercise 5: Match Exact Occurrences
Problem Statement: Write a Python program to match a string that has an a followed by exactly three bs (i.e., only abbb is a valid match).
Purpose: This exercise introduces curly-brace quantifiers, which allow you to specify an exact number of repetitions. Exact-count matching is useful in tasks such as validating fixed-length codes, parsing structured data fields, and enforcing strict formatting rules.
Given Input: test_strings = ["a", "ab", "abb", "abbb", "abbbb", "b"]
Expected Output:
a -> No match ab -> No match abb -> No match abbb -> Match abbbb -> No match b -> No match
▼ Hint
- Use curly braces to specify an exact count:
b{3}means exactly threebs. - The full pattern
ab{3}matches anafollowed by exactly threebs. - Use
re.fullmatch()to ensure strings with more than threebs (likeabbbb) are rejected. - You can also use the range form
{m,n}to match betweenmandnrepetitions if a range is ever needed.
▼ Solution & Explanation
Explanation:
b{3}: A curly-brace quantifier that matches the characterbrepeated exactly three times. It is equivalent to writingbbbexplicitly, but is more readable and easier to adjust.ab{3}: The{3}applies only to the immediately preceding element, which isb. Theais still a single literal character.- Why
abbbbfails:re.fullmatch()requires the entire string to match the pattern. Fourbs exceed the exact count of three, so the match fails. - Range variant:
{m,n}matches betweenmandnrepetitions inclusive. For example,b{2,4}would matchbb,bbb, orbbbb. Omittingnas inb{2,}means two or more, which behaves like a bounded version of+.
Exercise 6: Match Range of Occurrences
Problem Statement: Write a Python program to match a string that has an a followed by two to three bs (i.e., abb or abbb).
Purpose: This exercise introduces the range form of curly-brace quantifiers, {m,n}, which lets you set a lower and upper bound on repetitions. Range quantifiers are useful when validating fields that must fall within a length window, such as short codes, postal abbreviations, or bounded identifiers.
Given Input: test_strings = ["a", "ab", "abb", "abbb", "abbbb", "b"]
Expected Output:
a -> No match ab -> No match abb -> Match abbb -> Match abbbb -> No match b -> No match
▼ Hint
- The pattern
ab{2,3}means: the letterafollowed by two to threebs. - The
{m,n}quantifier is inclusive on both ends, so bothabb(twobs) andabbb(threebs) are valid. - Use
re.fullmatch()to ensure strings with only onebor more than threebs are correctly rejected. - Do not add a space between the comma and the numbers inside the curly braces: write
{2,3}, not{2, 3}.
▼ Solution & Explanation
Explanation:
b{2,3}: A range quantifier that matchesbrepeated at least twice and at most three times. Both bounds are inclusive, so two and three are both accepted.- Why
abfails: A singlebfalls below the minimum of two required by{2,3}, soabdoes not satisfy the pattern. - Why
abbbbfails: Fourbs exceed the upper bound of three. Becausere.fullmatch()requires the entire string to be consumed, the extrabcauses the match to fail. - Relationship to other quantifiers:
{2,3}is the bounded middle ground between exact matching ({3}) and open-ended matching ({2,}, which means two or more). Choosing the right form depends on how tightly you need to constrain the input.
Exercise 7: Find Underscore Joined Lowercase
Problem Statement: Write a Python program to find sequences of lowercase letters joined with an underscore (e.g., hello_world).
Purpose: This exercise practises matching multi-part patterns that involve a separator character between word segments. Recognising underscore-joined identifiers is directly applicable to parsing Python variable names, snake_case tokens in configuration files, and structured log fields.
Given Input: test_strings = ["hello_world", "foo_bar", "hello", "hello_", "_world", "Hello_world", "hello_World"]
Expected Output:
hello_world -> Match foo_bar -> Match hello -> No match hello_ -> No match _world -> No match Hello_world -> No match hello_World -> No match
▼ Hint
- The pattern needs to match one or more lowercase letters, then a literal underscore, then one or more lowercase letters.
- Use
[a-z]+to match a sequence of lowercase letters only. This deliberately excludes uppercase letters and digits. - Combine the parts as
[a-z]+_[a-z]+and usere.fullmatch()to reject strings with a leading or trailing underscore or any uppercase letters.
▼ Solution & Explanation
Explanation:
[a-z]+: A character class that matches one or more lowercase ASCII letters. The+ensures at least one letter must appear on each side of the underscore._: A literal underscore acting as the required separator between the two lowercase word segments.- Why
hello_and_worldfail: The pattern requires at least one lowercase letter both before and after the underscore. A trailing or leading underscore with nothing on the other side leaves one side unsatisfied. - Why
Hello_worldandhello_Worldfail: The character class[a-z]matches only lowercase letters. An uppercase letter anywhere in the string causesre.fullmatch()to return no match.
Exercise 8: PascalCase Match
Problem Statement: Write a Python program to find sequences of one uppercase letter followed by lowercase letters (e.g., Hello, World, Python).
Purpose: This exercise practises combining character classes to enforce a strict positional rule: one thing here, something else there. Matching PascalCase or title-case words is a common requirement when parsing names, class identifiers, or capitalised tokens in natural language processing.
Given Input: test_strings = ["Hello", "World", "python", "HELLO", "Hello123", "H", "Ha"]
Expected Output:
Hello -> Match World -> Match python -> No match HELLO -> No match Hello123 -> No match H -> No match Ha -> Match
▼ Hint
- Split the pattern into two parts: one character class for the single uppercase letter, and another for the one or more lowercase letters that follow.
- Use
[A-Z]to match exactly one uppercase letter, and[a-z]+to match one or more lowercase letters. - Use
re.fullmatch()to reject strings likeHello123where digits appear after the lowercase letters, andHwhere no lowercase letters follow.
▼ Solution & Explanation
Explanation:
[A-Z]: Matches exactly one uppercase ASCII letter. No quantifier is attached, so it cannot match zero or two uppercase letters.[a-z]+: Matches one or more lowercase letters immediately after the uppercase one. The+ensures the string cannot end at just the capital letter.- Why
Hfails: The+on[a-z]requires at least one lowercase letter to follow the capital. A lone uppercase letter does not satisfy the pattern. - Why
HELLOfails: After matching the firstHwith[A-Z], the pattern expects lowercase letters. The remaining charactersELLOare uppercase, so[a-z]+finds nothing to match and the overall match fails. - Why
Hello123fails:re.fullmatch()requires the entire string to be consumed. After matchingHello, the digits123remain unmatched, causing the full match to fail.
Exercise 9: Match Start and End
Problem Statement: Write a Python program to match a string that starts with a, ends with b, and has any characters in between (e.g., a123b, axyzb).
Purpose: This exercise introduces the dot . wildcard and the use of anchors ^ and $ together with re.match() and re.fullmatch(). Matching by a known start and end while allowing arbitrary content in between is a practical technique used in file extension checks, protocol parsing, and delimiter-bounded field extraction.
Given Input: test_strings = ["a123b", "axyzb", "ab", "a b", "ab ", "b123a", "a123"]
Refer: Python regex re.match() for pattern matching
Expected Output:
a123b -> Match axyzb -> Match ab -> Match a b -> Match ab -> No match (trailing space) b123a -> No match a123 -> No match
▼ Hint
- The dot
.in a regular expression matches any single character except a newline by default. - Combine
.with*to allow zero or more characters betweenaandb. This means the stringabwith nothing in between is also a valid match. - The pattern
a.*bused withre.fullmatch()will anchor it to the full string automatically, so explicit^and$anchors are not needed in this case.
▼ Solution & Explanation
Explanation:
.: The dot wildcard matches any single character except a newline. It does not represent a literal period; to match a literal dot you would need to escape it as\...*: Combines the dot with the*quantifier to match zero or more of any character. This allows the middle section of the string to be empty, one character, or arbitrarily long.- Why
abmatches: The.*portion matches zero characters, soaimmediately followed bybsatisfies the pattern. - Why
ab(trailing space) fails:re.fullmatch()requires the entire string to be consumed. The trailing space is not part of the pattern, so the match fails. This highlights howre.fullmatch()is stricter thanre.search()orre.match()for boundary checking. - Greedy behaviour: By default
.*is greedy and will match as many characters as possible while still allowing the overall pattern to succeed. In this pattern it consumes everything up to the lastbin the string.
Exercise 10: Match Word at Start
Problem Statement: Write a Python program to match a specific word only if it appears at the very beginning of a string.
Purpose: This exercise introduces the caret anchor ^, which asserts that a match must occur at the start of the string. Start-of-string anchoring is essential in command parsing, log processing, and any situation where the position of a token in a line carries meaning.
Given Input: test_strings = ["Hello world", "Hello", "Say Hello", "hello world", "HelloWorld"]
Expected Output:
Refer: Python regex search
Hello world -> Match Hello -> Match Say Hello -> No match hello world -> No match HelloWorld -> No match
▼ Hint
- Place
^at the start of the pattern to anchor it to the beginning of the string. - Use a word boundary
\bafter the word to ensure you are matching the whole word and not just a prefix (e.g., so thatHelloWorldis not accepted as a match forHello). - Use
re.match()orre.search()with^rather thanre.fullmatch(), because the string may contain additional content after the target word.
▼ Solution & Explanation
Explanation:
^: The start-of-string anchor. It does not consume any characters; it simply asserts that the next part of the pattern must begin at position zero of the string.Hello: A literal sequence of five characters. The match is case-sensitive by default, sohello(all lowercase) does not satisfy this part of the pattern.\b: A word boundary assertion. It matches the position between a word character and a non-word character, ensuring thatHellois treated as a complete word. Without it,HelloWorldwould also match becauseHelloappears at the start.- Why
re.search()is used here instead ofre.fullmatch(): The goal is only to check the beginning of the string. The string may legitimately contain more content after the word, as inHello world. Usingre.fullmatch()would incorrectly reject those valid strings. - Why
Say Hellofails: AlthoughHelloappears in the string, it is not at the start. The^anchor fails at position zero because the string begins withS, notH.
Exercise 11: Match Word at End
Problem Statement: Write a Python program to match a specific word only if it appears at the end of a string, ignoring any optional trailing punctuation.
Purpose: This exercise introduces the dollar anchor $ and combines it with an optional character class to handle real-world strings that may end with punctuation. End-of-string anchoring is commonly used in sentence parsing, command validation, and log line analysis where the final token carries meaning.
Given Input: test_strings = ["I love Python", "Python is great", "I love Python!", "python", "I love Python."]
Expected Output:
I love Python -> Match Python is great -> No match I love Python! -> Match python -> No match I love Python. -> Match
▼ Hint
- Place
$at the end of the pattern to anchor it to the end of the string. - To allow optional trailing punctuation, add
[.,!?]?just before the$. The?makes the punctuation character optional. - Use a word boundary
\bbefore the target word to avoid matching it as a suffix of a longer word. - Use
re.search()rather thanre.fullmatch(), as the word may be preceded by other content in the string.
▼ Solution & Explanation
Explanation:
\b: A word boundary assertion placed beforePythonto ensure the match begins at a word edge and does not accidentally match a longer token likeCPython.Python: A literal, case-sensitive sequence. The lowercase stringpythondoes not match because regular expressions are case-sensitive by default.[.,!?]?: A character class covering common punctuation marks, made optional by?. This allows the word to be followed by at most one punctuation character before the end of the string.$: The end-of-string anchor. It asserts that nothing may follow the matched content, soPython is greatfails becausePythonis not at the end.- Why
re.search()is used: The target word is preceded by other content in most test strings.re.search()scans the entire string for a match at any position, while still respecting the$anchor to enforce the end-of-string constraint.
Exercise 12: Find a Specific Letter
Problem Statement: Write a Python program to find all words in a string that contain the letter z.
Purpose: This exercise practises using re.findall() to extract multiple matches from a string in a single call. Scanning text for words that contain a specific character is a foundational technique in search tools, spell checkers, and vocabulary analysis.
Given Input: text = "The pizza was amazing but the fizz and buzz were too loud"
Expected Output: ['pizza', 'amazing', 'fizz', 'buzz']
Refer: Python regex find all matches
▼ Hint
- A word containing
zcan be broken down as: zero or more word characters, then az, then zero or more word characters. - Use
\wto match any word character (letters, digits, and underscores) and combine it with*on each side of thez. - Add word boundaries
\bon both sides of the pattern to ensure you extract complete words rather than partial substrings. - Use
re.findall()to return all matches as a list in a single call.
▼ Solution & Explanation
Explanation:
\w*: Matches zero or more word characters on either side of thez. Using*rather than+ensures the pattern also captures words wherezappears at the very start or end, such asfizzorbuzz.z: The literal character being searched for. Because the match is case-sensitive by default, uppercaseZwould not be captured. To include both cases you could use[zZ]or pass there.IGNORECASEflag.\bon both sides: Word boundary assertions ensure the pattern matches complete words only. Without them, the pattern could return partial matches from within longer tokens.re.findall(): Scans the entire string from left to right and returns a list of all non-overlapping matches. This is more concise than manually looping over words and checking each one withre.search().
Exercise 13: Find Letter in Middle
Problem Statement: Write a Python program to find words containing the letter z, but only if the z is not at the start or end of the word.
Purpose: This exercise builds on the previous one by adding positional constraints within a word. Requiring at least one character on both sides of a target letter is a practical technique used in linguistic pattern matching, morphological analysis, and filtering tokens by internal structure.
Given Input: text = "The pizza was amazing but the fizz and buzz were too loud"
Expected Output: ['pizza', 'amazing']
▼ Hint
- The difference from Exercise 12 is that at least one word character must appear before the
zand at least one must appear after it. - Replace the
\w*on each side with\w+to enforce a minimum of one character on both sides of thez. - Keep the
\bword boundaries on the outside so that only complete words are returned.
▼ Solution & Explanation
Explanation:
\w+beforez: Requires at least one word character to precede thez. This eliminates words wherezis the first letter, since there would be nothing to satisfy the+quantifier before it.\w+afterz: Requires at least one word character to follow thez. This eliminates words likefizzandbuzzwherezis the final character.- Contrast with Exercise 12: Swapping
\w*for\w+on both sides is the only change needed. The*quantifier (zero or more) allowedzat any position; the+quantifier (one or more) forces it into a strictly interior position. - Why
fizzandbuzzare excluded: In both words thezcharacters appear at the end. After matching the finalz, there are no remaining word characters to satisfy\w+, so these words are correctly filtered out.
Exercise 14: Match Adjacent Words
Problem Statement: Write a Python program to match if two consecutive words in a sentence both start with the letter P.
Purpose: This exercise practises matching multi-token patterns separated by whitespace. Detecting adjacent words that share a property is useful in natural language processing tasks such as identifying repeated initials, alliterative phrases, and consecutive proper nouns.
Given Input: test_strings = ["Peter Parker is here", "Paul and Peter met", "Pretty Please", "Python Programming is fun", "No match here"]
Expected Output:
Peter Parker is here -> Match: Peter Parker Paul and Peter met -> No match Pretty Please -> Match: Pretty Please Python Programming is fun -> Match: Python Programming No match here -> No match
▼ Hint
- A word starting with
Pcan be matched withP\w*: a literal uppercasePfollowed by zero or more word characters. - Between the two words there will be one or more whitespace characters. Use
\s+to match that gap. - Combine the two word patterns and the whitespace into a single pattern:
P\w*\s+P\w*. - Use
re.search()so the pair can be found anywhere within a longer sentence.
▼ Solution & Explanation
Explanation:
P\w*: Matches any word that begins with an uppercaseP, followed by zero or more word characters. Using*rather than+means the single letterPon its own would also be a valid match.\s+: Matches one or more whitespace characters between the two words. Using+rather than a literal space handles edge cases such as multiple spaces or a tab character separating the words.result.group(): Returns the exact substring that was matched. This makes the output more informative by showing precisely which consecutive pair was found.- Why
Paul and Peter metfails: Although bothPaulandPeterstart withP, they are not consecutive. The wordandsits between them, breaking the adjacency required by the pattern. - Case sensitivity note: The pattern only matches words starting with uppercase
P. To also match lowercasep, you could use[Pp]\w*or pass there.IGNORECASEflag tore.search().
Exercise 15: Filter by Starting Letter
Problem Statement: Write a Python program to find all words starting with either a or e in a given string.
Purpose: This exercise practises using alternation inside a character class to match multiple possible starting characters. Filtering words by their first letter is a common requirement in text analysis, vocabulary sorting, concordance building, and educational language tools.
Given Input: text = "an eagle soared above the endless empty arena every afternoon"
Expected Output: ['an', 'eagle', 'above', 'endless', 'empty', 'arena', 'every', 'afternoon']
▼ Hint
- A word starting with
aorecan be matched with a character class at the front:[ae]covers both options in a single concise expression. - Follow the character class with
\w*to capture the rest of the word after the initial letter. - Add a
\bword boundary at the start so the pattern matches only at the beginning of a word and not in the middle of one. - Use
re.findall()to collect all matching words from the string at once.
▼ Solution & Explanation
Explanation:
\b: A word boundary placed at the start of the pattern ensures that matching begins only at the edge of a word. Without it, the pattern could matcheoraappearing in the interior of a longer word.[ae]: A character class that matches either the lowercase letteraor the lowercase lettere. This is more concise than the alternation operator (a|e) for single-character options and integrates naturally with the rest of the pattern.\w*: Matches zero or more word characters following the initial letter, capturing the full remainder of the word. Using*means single-letter words likeaare also captured if they appear in the text.re.findall(): Returns every non-overlapping match as a plain list of strings. Because the pattern contains no capturing groups, each element in the list is the full matched word rather than a tuple.- Extending the pattern: To match words starting with any vowel, expand the character class to
[aeiou]. To make the match case-insensitive and include uppercase initials, either use[aeAE]or passre.IGNORECASEas a flag tore.findall().
Exercise 16: Validate Alphanumeric ID
Problem Statement: Write a Python program to match a string that contains only uppercase letters, lowercase letters, numbers, and underscores, with no spaces or special characters allowed.
Purpose: This exercise practises building strict allowlist patterns for input validation. Alphanumeric-plus-underscore strings are the standard format for identifiers in most programming languages, database column names, and API keys. Being able to validate this format reliably is a foundational defensive-programming skill.
Given Input: test_strings = ["user_123", "User_Name", "invalid id", "bad-char!", "_leadingUnderscore", "ALL_CAPS_99"]
Expected Output:
user_123 -> Valid User_Name -> Valid invalid id -> Invalid bad-char! -> Invalid _leadingUnderscore -> Valid ALL_CAPS_99 -> Valid
▼ Hint
- The shorthand
\wmatches any word character, which is exactly the set of letters (upper and lower), digits, and the underscore. This makes it a natural fit for this pattern. - Use
\w+withre.fullmatch()to require the entire string to consist of one or more such characters, with nothing else permitted. - No additional character class is needed because
\walready excludes spaces, hyphens, exclamation marks, and all other special characters.
▼ Solution & Explanation
Explanation:
\w: A shorthand character class that is equivalent to[a-zA-Z0-9_]. It matches any uppercase or lowercase letter, any digit, and the underscore. It does not match spaces, hyphens, punctuation, or any other special character.\w+: Requires at least one word character. An empty string would not match, which is typically the correct behaviour for an identifier validator.re.fullmatch(): Enforces that every character in the string belongs to\w. A single disallowed character anywhere in the string, such as the space ininvalid idor the hyphen inbad-char!, causes the entire match to fail.- Why
_leadingUnderscoreis valid: The underscore is part of the\wcharacter class, so a leading underscore is perfectly acceptable. This aligns with Python’s own identifier rules, where_nameis a valid variable name. - Alternative approach: You could write the explicit character class
[a-zA-Z0-9_]+instead of\w+. Both are equivalent in standard ASCII contexts, but\wis shorter and more idiomatic.
Exercise 17: Check Starting Number
Problem Statement: Write a Python program to verify if a string starts with a specific number.
Purpose: This exercise practises anchoring a numeric pattern at the start of a string. Detecting a specific leading number is useful in tasks such as validating version strings, parsing log lines that begin with a status code, and routing input based on a numeric prefix.
Given Input: test_strings = ["42 is the answer", "42", "The answer is 42", "420 wide", "142 steps"], target number: 42
Expected Output:
42 is the answer -> Match 42 -> Match The answer is 42 -> No match 420 wide -> No match 142 steps -> No match
▼ Hint
- Use the
^anchor to assert that the number must appear at the very beginning of the string. - Add a word boundary
\bafter the number so that42does not incorrectly match the start of420or4200. - Use
re.search()orre.match(). Both respect the^anchor, butre.match()implicitly starts at the beginning of the string even without^.
▼ Solution & Explanation
Explanation:
rf"^{target}\b": An f-string prefixed with bothrandf. Therprefix treats backslashes as raw characters (needed for\b), and thefprefix allows the{target}variable to be interpolated into the pattern at runtime. This makes the pattern reusable for any target number without editing the regex directly.^: Anchors the match to the very start of the string. Strings likeThe answer is 42and142 stepsfail immediately because their first character is not a digit matching the target.\bafter the number: A word boundary that prevents the pattern from matching42at the start of420. Without this,420 widewould incorrectly be reported as a match because its first two characters are42.- Why
142 stepsfails: The^anchor requires the match to start at position zero. The string begins with1, not4, so the pattern fails before consuming any characters.
Exercise 18: Number at End
Problem Statement: Write a Python program to check if a string ends with a number.
Purpose: This exercise practises anchoring a numeric pattern at the end of a string using the $ anchor combined with \d. Detecting a trailing number is useful when processing filenames with version suffixes, log entries that end with a numeric code, or any structured string where a numeric tail carries meaning.
Given Input: test_strings = ["version 2", "file_backup_3", "hello", "order 99b", "track5", "2024"]
Expected Output:
version 2 -> Ends with a number file_backup_3 -> Ends with a number hello -> Does not end with a number order 99b -> Does not end with a number track5 -> Ends with a number 2024 -> Ends with a number
▼ Hint
- Use
\dto match any digit character (0-9). - Place
$at the end of the pattern to assert that the digit must be the very last character of the string. - Use
\d+rather than\dif you want to match one or more trailing digits as a group, though for this check either form works since you only need to confirm the string ends with at least one digit.
▼ Solution & Explanation
Explanation:
\d: A shorthand character class that matches any single decimal digit, equivalent to[0-9]. It does not match letters, underscores, or any other character.\d+: Matches one or more consecutive digits. Using+means the pattern captures the full trailing numeric run (e.g.,99in a future input) rather than just the last digit, which is useful if you later want to extract the value viaresult.group().$: Anchors the match so the digit sequence must appear at the very end of the string. Combined withre.search(), the engine scans the string for a digit run that terminates exactly at the last position.- Why
order 99bfails: The string ends with the letterb, not a digit. Even though digits appear earlier in the string, the$anchor requires the final character to satisfy\d. - Extracting the number: If you need the trailing number itself rather than just confirming its presence, replace the print statement with
print(result.group())to display the matched digit sequence.
Exercise 19: Clean IP Addresses
Problem Statement: Write a Python program to remove leading zeros from each segment of an IP address (e.g., convert 192.168.001.001 to 192.168.1.1).
Purpose: This exercise introduces re.sub() with a callable replacement function, a powerful technique that goes beyond simple string substitution. Normalising IP address segments is a practical data-cleaning task encountered in network log processing, configuration file parsing, and input sanitisation pipelines.
Given Input: ip_addresses = ["192.168.001.001", "010.000.000.001", "255.255.255.000", "192.168.1.1"]
Expected Output:
192.168.001.001 -> 192.168.1.1 010.000.000.001 -> 10.0.0.1 255.255.255.000 -> 255.255.255.0 192.168.1.1 -> 192.168.1.1
Refer: Python re.sub() regex replace
▼ Hint
- Use
re.sub()with a pattern that matches each numeric segment and a replacement function that converts the matched string to an integer usingint()and then back to a string withstr(). Converting tointautomatically strips any leading zeros. - The pattern
\d+will match each individual numeric segment between the dots, since the dots themselves are not digits and are therefore skipped. - Pass a lambda as the replacement argument to
re.sub():lambda m: str(int(m.group())).
▼ Solution & Explanation
Explanation:
re.sub(pattern, repl, string): Finds every non-overlapping match ofpatterninstringand replaces each one with the value returned byrepl. Whenreplis a callable rather than a plain string, it receives the match object as its argument and its return value is used as the replacement text.\d+: Matches each run of one or more consecutive digits. Because the dot separators in the IP address are not digits,re.sub()naturally processes each of the four segments independently without needing to split the string manually.lambda m: str(int(m.group())): The replacement function receives a match objectmfor each segment.m.group()returns the matched text (e.g.,"001"),int()converts it to an integer (dropping leading zeros), andstr()converts it back to a string for substitution.- Why
192.168.1.1is unchanged: The segments192,168,1, and1have no leading zeros, so converting them tointand back tostrproduces the same value.re.sub()handles already-clean inputs safely. - Alternative without a lambda: You could split on
., applystr(int(seg))to each part in a list comprehension, and rejoin with.. There.sub()approach is more concise and generalises better to less-regular input formats.
Exercise 20: Convert Date Format
Problem Statement: Write a Python program to convert a date string from yyyy-mm-dd format to dd-mm-yyyy format.
Purpose: This exercise introduces capturing groups in re.sub(), one of the most practical regex techniques for restructuring text. Reformatting dates is a ubiquitous data-wrangling task in ETL pipelines, report generation, and any system that exchanges data between regions with different date conventions.
Given Input: dates = ["2024-01-15", "1999-12-31", "2000-07-04", "2024-11-05"]
Expected Output:
2024-01-15 -> 15-01-2024 1999-12-31 -> 31-12-1999 2000-07-04 -> 04-07-2000 2024-11-05 -> 05-11-2024
▼ Hint
- Wrap each part of the date in a capturing group using parentheses: one group for the year, one for the month, and one for the day.
- In the replacement string, refer to the captured groups using backreferences:
\1for the first group (year),\2for the second (month), and\3for the third (day). - To swap the format, write the replacement string as
\3-\2-\1, which places the day first, then the month, then the year.
▼ Solution & Explanation
Explanation:
(\d{4}): The first capturing group. It matches exactly four consecutive digits and captures them as group 1, representing the year portion of the date.(\d{2}): Used twice: once for the month (group 2) and once for the day (group 3). Each matches exactly two consecutive digits, corresponding to the zero-padded month and day values in the source format.- Backreferences in the replacement string:
\1,\2, and\3refer to the text captured by the first, second, and third groups respectively. Writing\3-\2-\1reorders them to day-month-year without any manual string slicing. - Zero-padding is preserved: Because the groups capture the raw digit strings rather than converting them to integers, leading zeros in the month and day (e.g.,
01,07) are carried over unchanged into the output. This is the correct behaviour for date formatting. - Named groups as an alternative: For improved readability in complex patterns, you can use named groups:
(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})and reference them in the replacement as\g<day>-\g<month>-\g<year>. Named groups make the intent of each captured segment self-documenting.
Exercise 21: Extract 1-3 Digit Numbers
Problem Statement: Write a Python program to search a string and extract all numbers that are between 1 and 3 digits long.
Purpose: This exercise practises combining numeric patterns with range quantifiers and word boundaries to extract only the numbers that satisfy a length constraint. Selective numeric extraction is commonly needed in data parsing tasks where very large numbers (such as timestamps or IDs) must be excluded from results that are intended to capture shorter codes, counts, or scores.
Given Input: text = "There are 3 cats, 12 dogs, 500 fish, 1000 birds, and 42 turtles in the sanctuary"
Expected Output: ['3', '12', '500', '42']
▼ Hint
- Use
\d{1,3}to match a run of one to three digits. - Wrap the pattern in word boundaries
\bon both sides. Without them,\d{1,3}would match the first three digits of a longer number like1000and incorrectly include it in the results. - Use
re.findall()to return all qualifying matches as a list of strings in a single call.
▼ Solution & Explanation
Explanation:
\d{1,3}: A range quantifier that matches a consecutive run of digits with a minimum length of one and a maximum of three. On its own, without boundaries, this is a greedy sub-pattern that will match within any larger digit sequence.\bon both sides: Word boundaries are essential here. They assert that the digit run must be bordered by a non-word character (or the start or end of the string) on each side. This prevents1000from being partially matched as100and correctly excludes it from the results entirely.- Why
1000is excluded: The number1000has four digits. The opening\banchors the pattern at the start of the digit run, and\d{1,3}can only match up to three of those digits. The closing\bthen finds itself between two digit characters, which is not a word boundary, so the match fails for the whole token. re.findall(): Returns each match as a plain string rather than a match object. Because the pattern contains no capturing groups, the full matched text is returned for each hit, giving a clean list of number strings that can be converted to integers withlist(map(int, matches))if needed.
Exercise 22: Search Literal Strings
Problem Statement: Write a Python program to search for a set of specific literal strings within a larger text and report which ones are found and where.
Purpose: This exercise introduces the alternation operator |, which allows a single pattern to match any one of several fixed strings. Searching for multiple literals simultaneously is more efficient than running separate searches and is widely used in keyword filtering, content moderation, and log scanning.
Given Input: text = "The quick brown fox jumps over the lazy dog", target words: fox and dog
Expected Output:
Found "fox" at index 16-19 Found "dog" at index 40-43
▼ Hint
- Use the alternation operator
|to combine the target words into a single pattern:fox|dog. - Wrap each alternative in word boundaries to avoid partial matches inside longer words (e.g., matching
doginsidehotdog). - Use
re.finditer()rather thanre.findall()so you have access to the match object and can call.start()and.end()on each result to retrieve the exact position.
▼ Solution & Explanation
Explanation:
fox|dog: The alternation operator|instructs the regex engine to attempt matchingfoxfirst and, if that fails at the current position, to attemptdog. The alternatives are evaluated left to right. Any number of alternatives can be chained with additional|characters.- Parentheses around the alternatives: The grouping
(fox|dog)ensures the|operator applies only betweenfoxanddog, not to any surrounding pattern elements. Without parentheses, a pattern like\bfox|dog\bwould be parsed as(\bfox)or(dog\b), producing incorrect boundary behaviour. re.finditer(): Returns an iterator of match objects rather than a list of strings. This gives access to positional metadata for each hit without storing all matches in memory at once, which matters for large texts.match.start()andmatch.end(): Return the start index (inclusive) and end index (exclusive) of the matched substring within the original string. Forfoxin this text,start()returns16andend()returns19, meaning the match spans characters at positions 16, 17, and 18.
Exercise 23: Find Pattern Location
Problem Statement: Write a Python program to find a literal string in a text and return its exact starting and ending index position.
Purpose: This exercise focuses on using re.search() to locate a single pattern and then extracting precise position information from the resulting match object. Knowing the exact span of a match is essential in text editors, syntax highlighters, and any tool that needs to annotate or replace a specific region of a string.
Given Input: text = "The quick brown fox jumps over the lazy dog", target: "brown fox"
Expected Output: Found "brown fox" at start=10, end=19
▼ Hint
- Use
re.search()with the literal target string as the pattern. Because you are searching for a fixed phrase with no special regex characters, no escaping is needed for this input, but usingre.escape()around the target is a good habit to protect against inputs that contain characters like.or*. - Check that the result is not
Nonebefore accessing the match object. - Use
.start()and.end()on the match object, or use.span()to get both values as a tuple in a single call.
▼ Solution & Explanation
Explanation:
re.escape(target): Escapes any characters in the target string that have special meaning in a regular expression, such as.,*,+, or(. For the input"brown fox"this makes no visible difference, but it is the correct practice whenever the search term comes from user input or an external source where special characters cannot be guaranteed absent.re.search(): Scans the entire string from left to right and returns the first match object it finds, orNoneif the pattern is not present. Unlikere.match(), it does not restrict the search to the start of the string.match.start()andmatch.end(): Return the zero-based start and end positions of the matched substring. The end index is exclusive, following Python’s standard slice convention. For"brown fox", which begins at position 10, the end value is 19 because the substring occupies indices 10 through 18.match.span()as an alternative: Callingmatch.span()returns the tuple(start, end)in a single call. This is convenient when you need to pass the position to another function or unpack it withstart, end = match.span().
Exercise 24: Find All Substrings
Problem Statement: Write a Python program to find all occurrences of a specific substring within a string using re.findall().
Purpose: This exercise demonstrates how re.findall() handles repeated occurrences of the same substring and builds familiarity with its return behaviour. Counting and collecting all occurrences of a substring is a routine operation in text analysis, frequency counting, and search-and-highlight features.
Given Input: text = "cat and cattle and catfish and catch and tomcat", target: "cat"
Expected Output:
Occurrences of "cat": ['cat', 'cat', 'cat', 'cat', 'cat'] Total count: 5
▼ Hint
- Pass the literal target string directly to
re.findall(). It will return a list containing one entry per occurrence. - For this exercise the search is intentionally substring-level, not whole-word, so do not add
\bword boundaries. The goal is to find"cat"wherever it appears, including inside words likecattle,catfish, andtomcat. - Use
len()on the returned list to get the total count without any additional counting logic.
▼ Solution & Explanation
Explanation:
re.escape(target): Wraps the target string to neutralise any regex metacharacters it might contain. For"cat"this has no effect, but it makes the code robust against targets like"c.t", which without escaping would be interpreted as a regex pattern rather than a literal string.- No word boundaries by design: This exercise deliberately omits
\bto perform a raw substring search. All five occurrences of"cat"are captured regardless of whether they appear as standalone words (cat), prefixes (cattle,catfish,catch), or suffixes (tomcat). Compare this with Exercise 12, where\bwas used to restrict matches to complete words. re.findall()return value: When the pattern contains no capturing groups,re.findall()returns a list of the matched strings themselves. Every element is identical here because the pattern is a fixed literal, but the list length directly encodes the frequency of the substring.len(matches): A straightforward way to obtain the occurrence count without a separate loop or counter variable. It is equivalent totext.count(target)for plain substring counting, but the regex approach scales to more complex patterns wherestr.count()is not applicable.
Exercise 25: Iterate Matches
Problem Statement: Write a Python program to find the occurrence and position of all matches of a substring within a string using re.finditer().
Purpose: This exercise demonstrates how re.finditer() differs from re.findall() by returning full match objects rather than plain strings. Having access to the position of every occurrence, alongside the matched text itself, is essential in tools that need to annotate, highlight, or replace matches at precise locations within a document.
Given Input: text = "cat and cattle and catfish and catch and tomcat", target: "cat"
Expected Output:
Match 1: "cat" found at position 0-3 Match 2: "cat" found at position 8-11 Match 3: "cat" found at position 19-22 Match 4: "cat" found at position 30-33 Match 5: "cat" found at position 43-46
Refer: Python regex capturing groups
▼ Hint
- Use
re.finditer()in place ofre.findall(). It returns an iterator of match objects, each of which carries both the matched text and its position. - Use
enumerate()on the iterator to get a running match number alongside each match object, which makes the output easier to read. - Access
match.group()for the matched text, andmatch.start()andmatch.end()for the positional span.
▼ Solution & Explanation
Explanation:
re.finditer(): Returns a lazy iterator of match objects rather than materialising all results into a list at once. This is more memory-efficient thanre.findall()for large texts, because each match object is produced and processed one at a time.enumerate(..., start=1): Wraps the iterator to produce(counter, match_object)pairs. Thestart=1argument makes the counter begin at 1 instead of 0, which reads more naturally in human-facing output likeMatch 1,Match 2, and so on.match.group(): Returns the exact text that was matched. In this exercise every match is the same string"cat", but for variable patterns this method would return different text for each hit.- Contrast with Exercise 24: Both exercises search the same text for the same target. Exercise 24 uses
re.findall()to retrieve matched strings and a total count. This exercise usesre.finditer()to retrieve matched strings together with their exact positions. The two functions are complementary: usere.findall()when you only need the values, andre.finditer()when you also need positional information or want to process matches one at a time without building a full list in memory.
Exercise 26: Extract Date from URL
Problem Statement: Write a Python program to extract the year, month, and day components from a URL string formatted as https://example.com/yyyy/mm/dd/article-slug.
Purpose: This exercise practises using multiple capturing groups to pull structured data out of a predictably formatted string. Extracting date segments from URLs is a common task in web scraping, content management systems, and analytics pipelines where publication dates are embedded in permalink structures.
Given Input: urls = ["https://example.com/2026/05/22/my-article", "https://news.site.org/2019/11/03/breaking-story", "https://blog.example.com/2023/07/30/summer-update"]
Expected Output:
URL: https://example.com/2026/05/22/my-article Year: 2026 | Month: 05 | Day: 22 URL: https://news.site.org/2019/11/03/breaking-story Year: 2019 | Month: 11 | Day: 03 URL: https://blog.example.com/2023/07/30/summer-update Year: 2023 | Month: 07 | Day: 30
▼ Hint
- Use three capturing groups, one for each date component:
(\d{4})for the four-digit year,(\d{2})for the two-digit month, and(\d{2})for the two-digit day. - Separate the groups with a literal forward slash
/to match the URL path structure. - Use
re.search()to locate the date pattern anywhere within the URL string, then unpack the three groups from the match object usingmatch.groups().
▼ Solution & Explanation
Explanation:
- Leading and trailing
/in the pattern: The forward slashes outside the capturing groups anchor each date segment within the URL path structure. This prevents the pattern from accidentally matching a four-digit number that appears in a different part of the URL, such as a port number or a numeric slug. (\d{4}),(\d{2}),(\d{2}): Three capturing groups that isolate the year, month, and day respectively. The fixed-width quantifiers mirror the expected format exactly, so a three-digit year or single-digit month would not match.match.groups(): Returns all captured groups as a tuple in the order they appear in the pattern. Unpacking directly intoyear, month, daygives each value a meaningful name without needing to index into the tuple manually.- Zero-padding is preserved: Because the groups capture raw digit strings rather than converting to integers, leading zeros in the month and day (e.g.,
05,03) are retained in the output. This matches the source format and avoids an unintended change in representation. - Named groups as an alternative: The pattern could be rewritten as
/(?P<year>\d{4})/(?P<month>\d{2})/(?P<day>\d{2})/, after which values can be accessed by name withmatch.group("year")and so on. Named groups improve readability when a pattern has many components.
Exercise 27: Extract All Numbers
Problem Statement: Write a Python program to separate and extract all numeric values from a mixed string of text and digits.
Purpose: This exercise practises using re.findall() with a digit pattern to strip numbers out of unstructured mixed content. Extracting numeric values from prose is a frequent requirement in data entry parsing, invoice processing, scientific text mining, and any pipeline that ingests human-written content containing figures.
Given Input: text = "In 2024 there were 1200 participants across 3 events, with scores of 98.5, 76, and 100"
Expected Output: ['2024', '1200', '3', '98.5', '76', '100']
▼ Hint
- To capture both integers and decimal numbers, your pattern needs to handle an optional fractional part: one or more digits, followed optionally by a dot and one or more further digits.
- The pattern
\d+\.?\d*matches a run of digits, an optional dot, and zero or more digits after the dot. This covers integers like2024and decimals like98.5. - Use
re.findall()to collect all matches as a list of strings in one call.
▼ Solution & Explanation
Explanation:
\d+: Matches one or more consecutive digit characters before any decimal point. This handles pure integers like2024,1200,3,76, and100, and also anchors the start of a decimal number such as98.5.\.?: Matches a literal dot zero or one time. The backslash is necessary because an unescaped.in a regex pattern matches any character. The?makes it optional so that integers without a fractional part are still matched.\d*: Matches zero or more digits after the optional dot. Using*rather than+means a trailing dot with no digits (e.g.,98.) is still captured, with the fractional part being empty. If you need to exclude such cases, replace\d*with\d+and use a full alternation:\d+\.\d+|\d+.- Results are strings:
re.findall()always returns strings, not numeric types. To work with the values arithmetically, convert them withfloat(n)orint(n)as appropriate:[float(n) for n in matches].
Exercise 28: Extract Email Addresses
Problem Statement: Write a Python program to extract all valid email addresses from a large block of unstructured text.
Purpose: This exercise practises constructing a multi-part pattern that mirrors the structural rules of a real-world format. Email extraction is one of the most common practical applications of regular expressions, appearing in contact harvesting tools, data cleaning pipelines, and communication platform integrations.
Refer: Regex Special Sequences and Character classes
Given Input:
text = """Please reach out to support@example.com for help. You can also contact the team at admin.team@company.org or sales@shop.co.uk. Invalid addresses like @nodomain and user@ should be ignored. For billing queries write to billing_dept+invoices@finance.example.net."""
Expected Output: ['support@example.com', 'admin.team@company.org', 'sales@shop.co.uk', 'billing_dept+invoices@finance.example.net']
▼ Hint
- An email address has three structural parts: the local part before the
@, the@symbol itself, and the domain part after it. - The local part can contain letters, digits, dots, underscores, hyphens, and plus signs. Match it with a character class such as
[\w.+\-]+. - The domain part consists of one or more labels separated by dots, where each label contains letters, digits, or hyphens. The top-level domain (e.g.,
com,org,co.uk) must have at least two characters. - Use
re.findall()to collect all matches. Wrapping the full pattern in a single group ensures each complete email address is returned as one string.
▼ Solution & Explanation
Explanation:
[\w.+\-]+: Matches the local part of the email address. The character class includes word characters (\w, which covers letters, digits, and underscores), dots, plus signs, and hyphens. The+quantifier requires at least one character, which rejects entries like@nodomainthat have nothing before the@.@: A literal at-sign acting as the required separator between the local part and the domain. Its presence is mandatory, so bare strings without@are never matched.[\w\-]+: Matches the first label of the domain (e.g.,example,company,shop). This rejectsuser@because there is nothing after the@to satisfy the+quantifier.(?:\.[\w\-]+)*: A non-capturing group that matches zero or more additional dot-separated domain labels. The?:prefix means the group is used purely for grouping the alternation without creating a capture that would affectre.findall()‘s return value. This handles subdomains such asfinance.examplein the final test address.\.[a-zA-Z]{2,}: Matches the mandatory top-level domain: a literal dot followed by at least two letters. This enforces that the address ends with a recognisable TLD (e.g.,.com,.org,.uk,.net) and rejects fragments that trail off without one.
Exercise 29: Swap Characters
Problem Statement: Write a Python program to replace all whitespace characters with an underscore, and all underscores with a whitespace, in a single pass over the string.
Purpose: This exercise introduces the technique of performing two simultaneous character substitutions without one replacement interfering with the other. True in-place swapping requires a strategy that distinguishes the original characters from those that have already been substituted, and is a practical problem in slug generation, identifier normalisation, and format conversion pipelines.
Given Input: test_strings = ["hello world", "hello_world", "the quick_brown fox_jumps", "no_change"]
Expected Output:
hello world -> hello_world hello_world -> hello world the quick_brown fox_jumps -> the_quick brown_fox jumps no_change -> no change
▼ Hint
- Running two separate
re.sub()calls in sequence will not work: the second call will undo part of the first. You need to handle both substitutions in a single pass. - Use a pattern that matches either a space or an underscore:
[ _]. - Pass a lambda as the replacement argument. Inside the lambda, check what character was matched using
m.group()and return the opposite character.
▼ Solution & Explanation
Explanation:
- Why two sequential
re.sub()calls fail: If you first replace spaces with underscores and then replace underscores with spaces, the second call converts both the original underscores and the newly inserted ones back to spaces, producing a result with no underscores at all. The single-pass approach avoids this by deciding the replacement for each character before any substitutions have been written back into the string. [ _]: A character class that matches either a single space or a single underscore. Each character is handled individually as the regex engine scans left to right, so mixed strings likethe quick_brown fox_jumpsare processed correctly in one pass.- Lambda as the replacement: The callable form of
re.sub()receives a fresh match object for each character hit. The lambda inspectsm.group()and returns the opposite character. Because the decision is made per-match before the string is modified, there is no risk of a substituted character being re-evaluated. - Alternative using a translation table: For plain character-for-character swaps without regex, Python’s
str.translate(str.maketrans(" _", "_ "))performs the same operation in a single pass and is slightly more efficient for this specific case. The regex approach is shown here because it scales to more complex conditional substitutions thatstr.translate()cannot handle.
Exercise 30: Replace Multiple Delimiters
Problem Statement: Write a Python program to replace all occurrences of spaces, commas, and dots in a string with a colon.
Purpose: This exercise demonstrates how a single re.sub() call with a character class can replace multiple different delimiters simultaneously, replacing the need for chained str.replace() calls. Normalising mixed delimiters into a single consistent separator is a standard data-cleaning step in CSV processing, configuration parsing, and token splitting.
Given Input: test_strings = ["one two three", "one,two,three", "one.two.three", "one, two. three", "no.delimiters,here today"]
Expected Output:
one two three -> one:two:three one,two,three -> one:two:three one.two.three -> one:two:three one, two. three -> one::two::three no.delimiters,here today -> no:delimiters:here:today
Refer:
▼ Hint
- Use a character class that lists all three target delimiters:
[ ,.]. Inside a character class, the dot is treated as a literal character and does not need to be escaped. - Pass the plain string
":"as the replacement argument tore.sub(). Every matched delimiter, regardless of which one it is, will be replaced with a colon. - Note that a space followed immediately by a comma (or any two adjacent delimiters) will produce two consecutive colons in the output, because each delimiter is replaced independently.
▼ Solution & Explanation
Explanation:
[ ,.]: A character class that matches any one of three characters: a space, a comma, or a dot. Inside a character class, the dot loses its wildcard meaning and is treated as a literal period, so no backslash is needed. Each character in the class is an independent alternative; the regex engine replaces whichever one it encounters at each position.- Plain string replacement: The second argument to
re.sub()is the string":". Because every matched delimiter is replaced with the same fixed value, no lambda or backreference is needed, keeping the call simple and readable. - Why
one, two. threeproduces double colons: The comma and the space are two separate characters, and each is an independent match. The comma is replaced with:and the space immediately after it is also replaced with:, producing::. This is the correct and expected behaviour for a per-character replacement. If you want to collapse consecutive delimiters into a single colon, change the pattern to[ ,.]+to match one or more delimiters as a group. - Advantage over chained
str.replace(): Replacing three delimiters withstr.replace()would require three separate calls:s.replace(" ", ":").replace(",", ":").replace(".", ":"). There.sub()approach handles all three in a single pass over the string, which is both more concise and more efficient for longer strings or larger sets of delimiters.

Leave a Reply