Regular Expression

Regular Expression(regex or RE for short) as the name suggests is an expression which contains a sequence of characters that define a search pattern. Take an example of this simple Regular Expression :

\b[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}\b 

This expression can be used to find all the possible emails in a large corpus of text. This is useful because otherwise, you will have to manually go through the whole text document and find every email id in it. After going through this article you’ll know how the above Regular Expression works and much more. We can use different programming languages such as Java, Python, JavaScript, PHP and many more to implement Regular Expressions but there are certain variations in its implementation across these languages. So now let us see the subtopics we are going to cover in this article:

  1. What is Regular Expression in Python?
  2. How to write Regular Expression in Python?
  3. Examples of Regular Expression in Python
  4. Regular Expression program in Python
    1. Email Validation in Python using Regular Expression
    2. Validate mobile number using Regular Expression in Python
Regular expression in Python

What is Regular Expression in Python?

In Python, a Regular Expression (REs, regexes or regex pattern) are imported through re module which is an ain-built in Python so you don’t need to install it separately.

The re module offers a set of functions that allows us to search a string for a match:

Function Description
findall Returns a list containing all matches
compile  Returns a regex objec
search Returns a Match object if there is a match anywhere in the string
split Returns a list where the string has been split at each match
sub Replaces one or many matches with a string
subn    Similar to sub except it returns a tuple of 2 items containing the new string and the number of substitutions made.
group  Returns a tuple containing all the subgroups of the match, from 1 up to however many groups are in the pattern
match Similar to search, but only searches in the first line of the text

We shall use all of these methods once we know how to write Regular Expressions which we will learn in the next section.

How to write Regular Expression in Python?

To learn how to write RE, let us first clarify some of the basics. In RE we use either literals or meta characters.literals are the characters themselves and have no special meaning. Here is an example in which I use literals to find a specific string in the text using findall method of re module.

import re
string="Hello my name is Hussain"
print(re.findall(r"Hussain",string))

  
OUTPUT:
['Hussain']

As you can see we used the word ‘Hussain’ itself to find it in the text. This may not seem a good idea when we have to extract thousands of names from a corpus of text. To do that we need to find a specific pattern and use meta-characters.

Meta-characters

Metacharacters are characters with a special meaning and they are not interpreted as they are which is in the case of literals. We may further classify meta-characters into identifier and modifiers.

Identifiers are used to recognise a certain type of characters. For example, to find all the number characters in a string we can use an identifier ‘/d’

import re
string="Hello I live on street 9 which is near street 23"
print(re.findall(r"\d",string))


OUTPUT:
['9', '2', '3']

But there seems to be a problem with this. It only returns single-digit numbers and even worst even split the number 23 into two digits. So how can we tackle this problem, can using two \d help?

import re
string="Hello I live on street 9 which is near street 23"
print(re.findall(r"\d\d",string))

OUTPUT:
['23']

Using two identifiers did help, but now it can only find two-digit numbers, which is not what we wanted.

One way to solve this problem will be modifiers, but first, here are some identifiers that we can use in Python. We shall use some of them in the examples we are going to do in the next section. 

 \d = any number
\D = anything but a number
\s = space
\S = anything but a space
\w = any letter
\W = anything but a letter
. = any character, except for a new line
\b = space around whole words
\. = period. must use a backslash, because ‘ . ‘ normally means any character.

 

Modifiers are a set of meta-characters that add more functionality to identifiers. Going back to the example above, we will see how we can use a modifier “ + ” to get numbers of any length from the string. This modifier returns a string when it matches 1 or more characters.

import re
string="Hello I live on street 9 which is near street 23"
print(re.findall(r"\d+",string))


OUTPUT:
['9', '23']

Great! finally, we got our desired results. By using ‘+’ modifier with the /d identifier, I can extract numbers of any length. Here are few of the modifiers that we are also going to use in the examples section ahead.

 + = match 1 or more
? = match 0 or 1 repetitions.
* = match 0 or MORE repetitions
$ = matches at the end of string
^ = matches start of a string
| = matches either/or. Example x|y = will match either x or y
[] = A set of characters in which we define range, or "variance"
{x} = expect to see this amount of the preceding code.
{x,y} = expect to see this x-y amounts of the precedng code

 

Did you notice we are using the r character at the start of all RE, this r is called a raw string literal. It changes how the string literal is interpreted. Such literals are stored as they appear.

For example, \ is interpreted as an escape sequence usually but it is just a backslash when prefixed with an r.  You will see what this means with special characters. Sometimes, the syntax involves backslash-escaped characters, and to prevent these characters from being interpreted as escape sequences we use this raw string literals.

Examples of Regular Expression in Python

Let us explore some of the examples related to the meta-characters. Here we are going to see how we use different meta-characters and what effect do they have on output:

import re
string="get out Of my house !!!"
print(re.findall(r"\w+",string))
OUTPUT:
['get', 'out', 'Of', 'my', 'house'] 
import re
string="get out Of my house !!!"
print(re.findall(r"\w{2}",string))
 OUTPUT:
['et', 'ut', 'Of', 'my', 'se'] 
import re
string="abc abcccc abbbc ac def"
print(re.findall(r"\bab*c\b",string))
OUTPUT:
['abc', 'abbbc', 'ac'] 
import re
string="abc abcccc abbbc ac def"
print(re.findall(r"\bab+c\b",string))
 OUTPUT:
['abc', 'abbbc']  
import re
string="get out Of my house !!!"
print(re.findall(r"\b\w{2}\b",string))
OUTPUT:
['Of', 'my'] 
import re
string="name and names are 23 blah blah"
print(re.findall(r"\b\w+es?\b",string))
 OUTPUT:
['name', 'names', 'are']  
import re
string='''I am Hussain Mujtaba and M12  !a
'''
print(re.findall(r"M.....a",string))
OUTPUT:
['Mujtaba', 'M12  !a'] 
import re
string='''123345678
'''
print(re.findall(r"[123]",string))
OUTPUT: 
['1', '2', '3', '3'] 
import re
 string='''123345678
 '''
 print(re.findall(r"[^123]",string))
OUTPUT:
['4', '5', '6', '7', '8', '\n'] 
import re
string='''
hello I am a student from India
'''
print(re.findall(r"[A-Z][a-z]+",string))
OUTPUT:
['India'] 
import re
string='''
hello I am a student from India
'''
print(re.findall(r"\b[A-Ia-i][a-z]+\b",string))
OUTPUT:
['hello', 'am', 'from', 'India'] 
import re
string='''
hello I am a student from India. Hello again
'''
print(re.findall(r"\b[h|H]\w+\b",string))
OUTPUT:
['hello', 'Hello']  
import re
string='''
hello I am a student from India and heere is now
'''
print(re.findall(r"([a-z])\1",string))
OUTPUT:
['l', 'e']

Now that we have seen enough of meta-characters, we will see how some of the methods of re module work. First, let us start with re.compile

re.compile

We can combine a regular expression pattern into pattern objects, which can be used for pattern matching. It also helps to search a pattern again without rewriting it. Here is an example:

import re
pattern=re.compile('[A-Z][a-z]+')
result=pattern.findall('Great Learning is all about excellence')
print(result)
OUTPUT:
['Great', 'Learning'] 

re.search

The re.search function searches the string for a match and returns a Match object if there is a match. If there is more than one match, only the first occurrence of the match will be returned. Here is an example:

import re
txt = "The rain in Spain"
print(re.search("Spain", txt))
OUTPUT:
<_sre.SRE_Match object; span=(12, 17), match='Spain'> 

re.split

The re.split function returns a list where the string has been split at each match:

import re

s_nums = 'one1two22three333four'

print(re.split('\d+', s_nums))
OUTPUT:
['one', 'two', 'three', 'four']

re.sub

The re.sub function replaces the matches with the text of your choice

import re
s = 'aaa@xxx.com bbb@yyy.com ccc@zzz.com'
print(re.sub('[a-z]*@', 'XXX@', s))
XXX@xxx.com XXX@yyy.com XXX@zzz.com

re.subn

As mentioned earlier, re.subn function is similar to re.sub function but it returns a tuple of 2 items containing the new string and the number of substitutions made.

import re

s = 'aaa@xxx.com bbb@yyy.com ccc@zzz.com'

print(re.subn('[a-z]*@', 'XXX@', s))
 ('XXX@xxx.com XXX@yyy.com XXX@zzz.com', 3) 

re.match

re.match function will search the regular expression pattern and return the first occurrence. This method checks for a match only at the beginning of the string. So, if a match is found in the first line, it returns the match object. But if a match is found in some other line, it returns null. Here is an example:

import re 
String ='''learning regex with 
           great learning is easy 
           also regex is very useful for string matching. 
           It is fast too.'''
  
# Use of re.search() Method 
print(re.search('learning', String)) 
# Use of re.match() Method 
print(re.match('learning', String)) 
# Use of re.search() Method 
print(re.search('great learning', String)) 
# Use of re.match() Method 
print(re.match('great learning', String)) 
OUTPUT:
 <_sre.SRE_Match object; span(0, 8), match='learning'>                                                                        
<_sre.SRE_Match object; span(0, 8), match='learning'>                                                                        
<_sre.SRE_Match object; span(32, 46), match='great learning'>                                                                
None   

re.group

The re.group function returns entire match (or specific subgroup num).We can mention subgroups in Regular expression if we enclose them in parentheses.Here is an example to make it clear

import re
string = "Dogs are more loyal than cats"
matchObj = re.match( r'(.*) are (.*?) .*', string)
print ("matchObj.group() : ", matchObj.group(0))
print ("matchObj.group(1) : ", matchObj.group(1))
print ("matchObj.group(2) : ", matchObj.group(2))
OUTPUT:
 matchObj.group() :  Dogs are more loyal than cats                                                                             
matchObj.group(1) :  Dogs                                                                                                     
matchObj.group(2) :  more  

We define a group in Regular expression by enclosing them in parenthesis. As you can see we have defined two groups in the above Regular Expression, one is before are and another is after it. Thus in group 1, we have dogs and in group 2 we have more.

Regular Expression program in Python

Now that we have seen how to use different RE, we are going to use them to write certain programs. So let us first write a program that can validate email id 

Regular expression in Python

Email Validation in Python using Regular Expression

This program is going to take in different email ids and check if the given email id is valid or not. First, we will find patterns in different email id and then depending on that we design a RE that can identify emails. Now let us take look at some valid emails:

  • dani123@gmail.com
  • mysite@ourearth.com
  • my.ownsite@ourearth.org
  • mysite@you.me.net
  • rahimrar@jkbnet.in

Now let us take look at some examples of invalid email id:

  • mysite.ourearth.com [@ is not present]
  • mysite@.com.my [ tld (Top Level domain) can not start with dot “.” ]
  • @you.me.net [ No character before @ ]
  • mysite123@gmail.b [ “.b” is not a valid tld ]
  • mysite@.org.org [ tld can not start with dot “.” ]
  • .mysite@mysite.org [ an email should not be start with “.” ]
  • mysite()*@gmail.com [ here the regular expression only allows character, digit, underscore, and dash ]
  • mysite..1234@yahoo.com [double dots are not allowed]

From the above examples we are able to find these patterns in the email id:

The personal_info part contains the following ASCII characters.

  1. Uppercase (A-Z) and lowercase (a-z) English letters.
  2. Digits (0-9).
  3. Characters ! # $ % & ‘ * + – / = ? ^ _ ` { | } ~
  4. Character. ( period, dot or full stop) provided that it is not the first or last character and it will not come one after the other.

The domain name [for example com, org, net, in, us, info] part contains letters, digits, hyphens, and dots.
Finally here is the program that can validate email ids using Regular Expressions

import re
def validate_email(email):
    if(re.search(r'\b[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}\b',email)):  
        print("Valid Email")  
          
    else:  
        print("Invalid Email")
validate_email("dani123@gmail.com")
validate_email("dani123gmail.com.in")
OUTPUT:
Valid Email 
Invalid Email  

Validate mobile number using Regular Expression in Python


In this section, we are going to validate the phone numbers. As the format of phone numbers can vary, we are going to make a program that can identify such numbers:

  • +91-1234-567-890
  • +911234567890
  • +911234-567890
  • 01234567890
  • 01234-567890
  • 1234-567-890
  • 1234567890

Here we can see a number can have a prefix of +91 0r 0. Also, there can be dashes after the first four digits of the number, and then after every 3 digit. You can try to find more patterns if they exist and then write your own regular expression.

import re
def validate_number(number):
    if(re.search(r'^\+91-?\d{4}-?\d{3}-?\d{3}$|^0?\d{4}-?\d{3}-?\d{3}$',number)):  
        print("Valid Number")  
          
    else:  
        print("Invalid Number")
validate_number("+91-1234-567-890")
validate_number("+911234567890")
validate_number("+911234-567890")
validate_number("01234567890")
validate_number("01234-567890")
validate_number("1234-567-890")
validate_number("1234567890")
validate_number("12344567890")
validate_number("123-4456-7890")
OUTPUT:
Valid Number                                                                                                                  
 Valid Number                                                                                                                  
 Valid Number                                                                                                                  
 Valid Number                                                                                                                  
 Valid Number                                                                                                                  
 Valid Number                                                                                                                  
 Valid Number                                                                                                                  
 Invalid Number                                                                                                                
 Invalid Number 

This brings us to the end of this article where we learned about Regular Expressions in Python and how to use them in different scenarios. You can take a free course on Python for Machine learning from Great Learning academy, just click the banner below.

Regular expression in Python
0

LEAVE A REPLY

Please enter your comment!
Please enter your name here

19 + three =