Strings and RegEx with Python
Strings & RegEx with Python
Strings
A String or a str object in Python, is a sequence of Unicode characters. In python, strings are written within either single or double quotes. We may also use triple quotes (''') or (""") for multiline strings. Strings may include letters, numbers, symbols, and even spaces. Since, Python does not have a character data type, a single character is defined as a single length String.
Strings are immutable, hence once created, their values cannot be modified. Strings are considered as arrays of characters. Hence, they possess properties similar to arrays. Strings support common sequence operations such as concatenation (+), repetition (*), slicing, indexing etc.
Let’s write a simple code snippet to extract words from a sentence using indexing and slicing. The variable word1 is assigned a substring (sentence[6:11]) sliced from the variable sentence. It includes characters from the 6th index to the 11th index (exclusive). The variable word2 is assigned the substring (sentence[-7:]) sliced from the negative 7th index through the last index (inclusive).
The variable repetition shows how ‘*’ operator is used to repeat a string and the variable concatenation shows how ‘+’ operator is used to concatenate two strings. The outputs of print statements are shown as comments. Repetition and concatenation are two instances of operator overloading; which is the changing behavior of the operator (+ and *) depending on the type (int or str) of the operands.
sentence = "Let's Learn Python "
word1 = sentence[6:11]
word2 = sentence[-7:]
repetition = (word2 * 3) # "Python Python Python "
print(repetition.rstrip()) # "Python Python Python"
concatenation = word1 + ' ' + word2 # "Learn Python "
print(concatenation.rstrip()) # "Learn Python"
Did you notice that in the above example we have used the rstrip() method to remove the trailing whitespaces from the output? Python provides such additional built-in String methods for String manipulation as listed in the official Python documentation.
Let’s write another simple example to show how some of these methods work. The variable sentence is assigned a multiline text with leading and trailing whitespaces. The strip() method is used to trim these whitespaces. The capitalize() method converts the first character in the sentence to uppercase. These String methods does not modify the original string, hence we have to reassign the returned output of the methods or assign it to a new variable.
The split() method can be used to split the words of the sentence and store them in a Python list (words). The join() method is used to convert this iterable back into a single string.
sentence = ''' the quick brown fox
jumps over the lazy dog '''
sentence = sentence.strip().capitalize()
print(sentence)
# "The quick brown fox
# jumps over the lazy dog"
words = sentence.lower().split()
print(words)
# ['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
pangram = " ".join(words)
print(pangram)
# "the quick brown fox jumps over the lazy dog"
RegEx
A RegEx, or a Regular Expression, is a sequence of characters that forms a search pattern that can be used for pattern matching or text manipulation. Python has a built-in module called re, which can be used to work with Regular Expressions. The module provides methods such as search(), match(), and findall() for pattern matching.
Regular expressions are especially useful to manipulate complex text patterns, such as validating email addresses, mobile numbers or dates, extracting specific information from a dataset, or replacing text based on a certain condition.
Let’s implement a simple Python program to validate a mobile number using RegEx. We have imported the re module in order to use its search method. This method takes Regex pattern and target string as arguments. We have defined a function called validate_mobile to perform the pattern matching task and return a Boolean value to indicate if the match is found or not. Based on this return value, output will be displayed if the mobile number is valid or invalid, using an if-else statement.
The variable pattern is assigned a raw string (r""). This string is enclosed within caret (^) symbol and dollar ($) sign indicating that the pattern should be matched from start through end of the string. Parentheses “()” are used in RegEx to create groups. Here we have created two groups; one for the country code and another for the rest of the digits. Country code should have a + symbol in the literal sense. So, we have used a \ before the + symbol to use the symbol without invoking it in its RegEx context. The \d indicates that any digit should occur after the + symbol, followed by {1,3} indicating that these digits may occur at least once and at most thrice. The \s? indicates that there is either 0 or 1 whitespace after these digits, making the whitespace optional. The ? symbol after the parentheses applies to the whole group “(\+\d{1,3}\s?)”, making the country code group optional. The second group of the pattern is a combination of any 9 digits.
import re
def validate_mobile(mobile):
pattern = r"^(\+\d{1,3}\s?)?(\d{9})$"
match = re.search(pattern, mobile)
return bool(match)
MOBILE = "+94 987654321"
is_valid = validate_mobile(MOBILE)
if is_valid:
print(f"{MOBILE} is valid")
else:
print(f"{MOBILE} is not valid")
# +94 987654321 is valid
RegEx patterns
Follow this official Python document to explore other special characters used to define RegEx patterns. Some common characters are listed in the table below.
| Character | Description |
|---|---|
| ^ | Starts with |
| $ | Ends with |
| . | Any single character |
| * | 0 or more occurrences |
| + | 1 or more occurrences |
| ? | 0 or 1 occurrence |
| \s | Whitespace |
| \S | Non whitespace |
| \d | any digit |
| \D | any non-digit |
| {n} | n occurrences |
| {n,m} | at least n, at most m occurrences |
| [xyz] | any 1 of the enclosed characters |
| [^xyz] | complement of the enclosed characters |
| [x-z] | any 1 of the enclosed range of characters |
- Note: Use the escape character (backslash - \) when in need to use these symbols in their literal sense to define the pattern without invoking their RegEx meaning.
Comments
Post a Comment