- Extracting email addresses using regular expressions in Python
- How To Extract Emails From a Text File Using Python
- Using RegEx with Python
- Applications of Regular Expressions
- A small tutorial on RegEx Python library
- Limitations of matching for special characters
- Compiling a regular expression
- The match() function
- Advance matching entities
- Flags for the match function:
- The search() function
- Extracting the email using RegEx module
- Conclusion
Extracting email addresses using regular expressions in Python
Let suppose a situation in which you have to read some specific data like phone numbers, email addresses, dates, a collection of words etc. How can you do this in a very efficient manner?The Best way to do this by Regular Expression.
Let take an example in which we have to find out only email from the given input by Regular Expression.
Examples:
Input : Hello shubhamg199630@gmail.com Rohit neeraj@gmail.com Output : shubhamg199630@gmail.com neeraj@gmail.com Here we have only selected email from the given input string. Input : My 2 favourite numbers are 7 and 10 Output :2 7 10 Here we have selected only digits.
Regular Expression–
Regular expression is a sequence of character(s) mainly used to find and replace patterns in a string or file.
So we can say that the task of searching and extracting is so common that Python has a very powerful library called regular expressions that handles many of these tasks quite elegantly.
Symbol | Usage |
---|---|
$ | Matches the end of the line |
\s | Matches whitespace |
\S | Matches any non-whitespace character |
* | Repeats a character zero or more times |
\S | Matches any non-whitespace character |
*? | Repeats a character zero or more times (non-greedy) |
+ | Repeats a character one or more times |
+? | Repeats a character one or more times (non-greedy) |
[aeiou] | Matches a single character in the listed set |
[^XYZ] | Matches a single character not in the listed set |
[a-z0-9] | The set of characters can include a range |
( | Indicates where string extraction is to start |
) | Indicates where string extraction is to end |
How To Extract Emails From a Text File Using Python
In this article, we all going to see how we can extract emails from a text file using Python. To make things easier to use we shall make some use of regular expressions. These are some special character equations that are in use for String Manipulations for a very long time even before the origin of computers.
Using RegEx with Python
The term Regular Expressions means a lot when we need to manipulate a string and make a thorough approach towards creating a good format for our output. The “re” module is a built-in module in Python. In the sub-sections, we will see the basic operations and then move toward the main topic.
Applications of Regular Expressions
To get a more clear idea here are some of the applications:
- Finding a specific pattern in a string.
- Matching a particular keyword or alphabet in a sentence.
- Extraction of useful symbols or patterns from a long text.
- Performing complex string operations.
A small tutorial on RegEx Python library
A regular expression allows us to match a specific pattern in the given text. So, to make things easier we shall know about them for this topic. Not only for email extraction but, for ETL (Extract Transform and Load ) processing of texts in BigData they are in use for a long time.
There are four basic functions to perform four basic operations on strings:
- match(): To match a particular string pattern at the beginning of the text.
- find(): To find a string pattern in the given text.
- findall(): Find all the matching strings in the whole text.
- finditer(): Finds a matching pattern and returns it as an iterable.
Limitations of matching for special characters
There is a set of special characters that do not involve in matching rather they help in finding the complex patterns in a string. Here is a list of those:
Point to remember: Also take a note that whenever matching a pattern we must specify it as a raw string using the “r” alphabet before declaring a string. This makes the RegEx engine of Python to avoid any types of errors. Ex: myPattern = r”myString”.
Compiling a regular expression
The first thing to start string operations is we need to compile our expression into our system. This will create a object that helps us to call the above four functions. To compile an expression we use the re.compile() function and insert our pattern inside that function. Set the flag to re.UNICODE.
import re myPattern = re.compile("python", flags = re.UNICODE) print(type(myPattern))
Now we have successfully created a pattern object. Using this only we are going to call the functions and perform all the operations.
The match() function
This function creates an object if the string’s starting characters match the pattern.
match = myPattern.match("python") print(match.group())
The group function is called we can specify whether. Thus, when a pattern matches our sample string then the object is created. We can check the matching index using the span() function.
print("The pattern matches upto ".format(match.span()))
The pattern matches upto (0, 6)
Please remember that, if the function does not find any match then no object is created. We get a NoneType as a return answer. The match() function returns the matching string index positions in the form of a tuple. It also has two extra parameters namely:
- pos: Starting position/index of the matching text/string.
- endpos: Ending position/index of the starting text.
match = myPattern.match("hello python", pos = 6) print(match.group()) print("The pattern matches upto ".format(match.span())) # output python The pattern matches upto (6, 12)
Advance matching entities
Sometimes our string may contain some numbers, digits, spaces, alphanumeric characters, etc. So, to make things more reliable re has some set of signatures. We need to specify those in our raw strings.
- \d: To match integer characters from 0 to 9.
- \D: To match non-integer characters from 0 to 9.
- \s: For any whitespace characters. “\n”, “\t”, “\r”
- \S: For any non-whitespace character.
- \w: Matching the alphanumeric characters.
- \W: Matching any non-alphanumeric characters.
Flags for the match function:
Flags prove an extra helping hand when we perform some sort of complex text analysis. So, the below is a list of some flags:
- re.ASCII or re.A: For all ASCII code characters like: \w, \W, \b, \B, \d, \D, \s and \S .
- re.DEBUG: Displays all the debug information.
- re.IGNORECASE or re.I: This flag performs case-insensitive matching.
- re.MULTILINE or re.M: Immediately proceeds to newline after matching the starting or ending patterns.
For more info about flags please go through this link: https://docs.python.org/3/library/re.html#flags
The search() function
The search function searches for a specific pattern/word/alphabet/character in a string and returns the object if it finds the pattern.
import re pattern = r"rain rain come soon, come fast, make the land green"; mySearch = re.search("rain", pattern, re.IGNORECASE)) print("Successfully found, ", mySearch.group(), " from", mySearch.start(), " to ",mySearch.end()) #output Successfully found "rain" from 0 to 4
Extracting the email using RegEx module
As we are studying all the basics now it’s time for a bigger challenge. Let us implement the knowledge of file read and regular expression in one code and extract some email addresses from that file.
Sample file:
Hello my name is Tom the cat. I like to play and work with my dear friend jerry mouse. We both have our office and email addresses also. They are [email protected], [email protected]. Our friend spike has also joined us in our company. His email address is spikethedo[email protected]. We all entertaint the children through our show.
Here is the simple file that contains the three email addresses. This also makes things more complex but, our code shall make them simpler. Using the above knowledge of regex we are good to implement it.
The regular expression for this is: “[0-9a-zA-z]+@[0-9a-zA-z]+\.[0-9a-zA-z]+”
import re try: file = open("data.txt") for line in file: line = line.strip() emails = re.findall("[0-9a-zA-z]+@[0-9a-zA-z]+\.[0-9a-zA-z]+", line) if(len(emails) > 0): print(emails) except FileNotFoundError as e: print(e)
Explanation:
- The pattern says that: extract the text that starts with alphanumeric characters and has a “@” symbol after that again it has alphanumeric characters and has a dot “.” and after the dot again the text has the same type of characters.
- Do not directly take the dot, rather include it with a backslash “\.”, to specify the python regex engine that we are using the dot. Using it as it is will specify that we are taking each character except newline in the patterns.
- Then include the sample text in a file.
- Open the file in reading mode.
- Implement a for loop with a line variable. It reads every line in the text.
- Then strip the line to extract each part of the text.
- Create an object of the findall() function and include our pattern expression inside it, after that include the line variable. This piece of code matches each strip of the text with the pattern.
- After the pattern matches, it just prints it.
- The outer code is just a try-catch block to handle errors.
['[email protected]', '[email protected]'] ['[email protected]']
Conclusion
Hence we implemented a smart script using a few lines of code that extracts emails from a given text.