Python регулярные выражения multiline

Python — Regular Expression to Match a Multiline Block of Text

Python - Regular Expression to Match a Multiline Block of Text

  1. Reason to Write Regex to Match Multiline String
  2. Possible Solutions to Match Multiline String

This article discusses ways to search for a specific pattern in multiline strings. The solution compromises several approaches for known and unknown patterns and explains how the matching patterns work.

Reason to Write Regex to Match Multiline String

Suppose that we have the following block of text:

Any compiled body of information is known as a data set. Depending on the situation's specifics, this may be a database or a simple array.\n \n IBM first used the term "data set," which meant essentially the same thing as "file," to describe a collection of related records. 

From the text block given above, it is required to find the starting text, and the text is presented a few lines below. It is important to note that \n symbolizes a newline and is not literal text.

To sum it up, we want to find and match text across multiple lines, ignoring any empty lines which may come in between the text. In the case of the text mentioned above, it should return the Any compiled body. line and the IBM first used the term. line in a single regular expression query.

Читайте также:  Is html valid xml

Possible Solutions to Match Multiline String

Before discussing the solutions to this particular problem, it is essential to understand the different aspects of the regex (regular expression) API, particularly those used frequently throughout the solution.

So, let’s start with the re.compile() .

Python re.compile() Method

re. compile() compiles a regex pattern into a regular expression object that we can use for matching with match() , search() , and other described methods.

One advantage of re.compile() over uncompiled patterns is reusability. We can use the compiled expression multiple times instead of declaring a new string for each uncompiled pattern.

import re as regex  pattern = regex.compile(".+World") print(pattern.match("Hello World!")) print(pattern.search("Hello World!")) 

Python re.search() Method

re. search() searches a string for a match and returns a Match object if one is found.
If many matches exist, we will return the first instance.

We can also use it directly without the usage of re.compile() , applicable when only one query is required to be made.

import re as regex  print(regex.search(".+World", "Hello World!")) 

Python re.finditer() Method

re.finditer() matches a pattern within a string and returns an iterator that delivers Match objects for all non-overlapping matches.

We can then use the iterator to iterate over the matches and perform the necessary operations; the matches are ordered in the way they are found, from left to right in the string.

import re as regex  matches = regex.finditer(r'[aeoui]', 'vowel letters') for match in matches:  print(match) 

Python re.findall() Method

re.findall() returns a list or tuple of all non-overlapping matches of a pattern in a string. A string is scanned from the left to the right side. And the matches are returned in the order in which they were discovered.

import re as regex  # Find all capital words string= ',,21312414.ABCDEFGw#########' print(regex.findall(r'[A-Z]+', string)) 

Python re.MULTILINE Method

A significant advantage of re.MULTILINE is that it allows ^ to search for patterns at the beginning of every line instead of just at the beginning of the string.

Python Regex Symbols

Regex symbols can quickly become quite confusing when used in a complex manner. Below are some of the symbols used in our solutions to help better understand the underlying concept of these symbols.

  • ^ asserts position at the start of a line
  • String matches the (case sensitive) characters «String» literally
  • . matches all characters (except for symbols used for line termination)
  • + matches the previously given token as often as possible.
  • \n matches a newline character
  • \r matches a ( CR ) carriage return symbol
  • ? matches the previous token between 0-1 times
  • +? matches the previous token between 1 to infinite times, as less as possible.
  • a-z matches a single character in the range between a and z (case sensitive)

Use re.compile() to Match a Multiline Block of Text in Python

Let’s understand using different patterns.

Pattern 1: Use re.search() for Known Pattern

import re as regex  multiline_string = "Regular\nExpression" print(regex.search(r'^Expression', multiline_string, regex.MULTILINE)) 

The above expression first asserts its position at the start of the line (due to ^ ) and then searches for the exact occurrences of «Expression» .

Using the MULTILINE flag ensures that each line is checked for occurrences of `“Expression” instead of just the first line.

Pattern 2: Use re.search() for Unknown Pattern

import re as regex  data = """Any compiled body of information is known as a data set. Depending on the situation's specifics, this may be a database or a simple array.\n  \n IBM first used the term "data set," which meant essentially the same thing as "file," to describe a collection of related records. """  result = regex.compile(r"^(.+)(?:\n|\r\n)+((?:(?:\n|\r\n?).+)+)", regex.MULTILINE)  print(result.search(data)[0].replace("\n", "")) 
Any compiled body of information is known as a data set. Depending on the situation's specifics, this may be a database or a simple array.IBM first used the term "data set," which meant essentially the same thing as "file," to describe a collection of related records. 

The regex expression can be broken down and simplified into smaller chunks for better readability:

In the first capturing group (.+) , each character is matched in the line (except for any symbols corresponding to line terminators); this process is done as often as possible.

After which, in the non-capturing group (?:\n|\r\n) , just a line terminator or a line terminator and carriage return are matched as many times as possible.

As for the second capturing group ((?:(?:\n|\r\n?).+)+) , it consists of a non-capturing group (?:(?:\n|\r\n?).+)+ either a new line character or a new line character and a carriage return are matched for a maximum of one time.

Every character is matched outside the non-capturing group, excluding line terminators. This procedure is done as many times as possible.

Pattern 3: Use re.finditer() for Unknown Pattern

import re as regex  data = """Regex In Python  Regex is a feature available in all programming languages used to find patterns in text or data. """  query=regex.compile(r"^(.+?)\n([\a-z]+)",regex.MULTILINE)  for match in query.finditer(data):  topic, content = match.groups()  print ("Topic:",topic)  print ("Content:",content) 
Topic: Regex In Python Content: Regex is a feature available in all programming languages used to find patterns in text or data. 

The above expression can be explained as follows:

In the first capturing group (.+?) , all characters are matched (except for line terminators, as before) as less as possible. After which, a single newline character \n is matched.

After matching the newline character, the following operations are performed in the second capturing group (\n[a-z ]+) . First, a newline character is matched, followed by matching characters between a-z as many times possible.

Use re.findall() to Match a Multiline Block of Text in Python

import re as regex  data = """When working with regular expressions, the sub() function of the re library is an invaluable tool.  the subroutine looks over the string for the given pattern and applies the given replacement to all instances where it is found. """  query = regex.findall('([^\n\r]+)[\n\r]([a-z \n\r]+)',data)  for results in query:  for result in results:  print(result.replace("\n","")) 
When working with regular expressions, the sub() function of the re library is an invaluable tool. the subroutine looks over the string for the given pattern and applies the given replacement to all instances where it is found 

To better understand the regex explanation, let’s break it down by each group and see what each part does.

In the first capturing group ([^\n\r]+) , all characters are matched, excluding a newline symbol or a carriage return character, as often as possible.

After that, matches are made when a character is either a carriage return or newline in the expression [\n\r] .

In the second capture group ([a-z \n\r]+) , characters between a-z or a newline or carriage return are matched as many times as possible.

Hello! I am Salman Bin Mehmood(Baum), a software developer and I help organizations, address complex problems. My expertise lies within back-end, data science and machine learning. I am a lifelong learner, currently working on metaverse, and enrolled in a course building an AI application with python. I love solving problems and developing bug-free software for people. I write content related to python and hot Technologies.

Related Article — Python Regex

Источник

python multiline regex

I’m having an issue compiling the correct regular expression for a multiline match. Can someone point out what I’m doing wrong. I’m looping through a basic dhcpd.conf file with hundreds of entries such as:

I’ve gotten various regex’s to work for the MAC and fixed-address but cannot combine them to match properly.

f = open('/etc/dhcp3/dhcpd.conf', 'r') re_hostinfo = re.compile(r'(hardware ethernet (.*))\;(?:\n|\r|\r\n?)(.*)',re.MULTILINE) for host in f: match = re_hostinfo.search(host) if match: print match.groups() 

Currently my match groups will look like:
(‘hardware ethernet 00:22:38:8f:1f:43′, ’00:22:38:8f:1f:43’, ») But looking for something like:
(‘hardware ethernet 00:22:38:8f:1f:43′, ’00:22:38:8f:1f:43’, ‘node20007.domain.com’)

If the file is exactly this format it might be easier yo just split lines on spaces and take the element at the end as the value

2 Answers 2

Update I’ve just noticed the real reason that you are getting the results that you got; in your code:

for host in f: match = re_hostinfo.search(host) if match: print match.groups() 

host refers to a single line, but your pattern needs to work over two lines.

data = f.read() for x in regex.finditer(data): process(x.groups()) 

where regex is a compiled pattern that matches over two lines.

If your file is large, and you are sure that the pieces of interest are always spread over two lines, then you could read the file a line at a time, check the line for the first part of the pattern, setting a flag to tell you whether the next line should be checked for the second part. If you are not sure, it’s getting complicated, maybe enough to start looking at the pyparsing module.

Now back to the original answer, discussing the pattern that you should use:

You don’t need MULTILINE; just match whitespace. Build up your pattern using these building blocks:

(1) fixed text (2) one or more whitespace characters (3) one or more non-whitespace characters

and then put in parentheses to get your groups.

>>> m = re.search(r'(hardware ethernet\s+(\S+));\s+\S+\s+(\S+);', data) >>> print m.groups() ('hardware ethernet 00:22:38:8f:1f:43', '00:22:38:8f:1f:43', 'node20007.domain.com') >>> 

Please consider using «verbose mode» . you can use it to document exactly which pieces of pattern match which pieces of data, and it can often help getting the pattern right in the first place. Example:

>>> regex = re.compile(r""" . (hardware[ ]ethernet \s+ . (\S+) # MAC . ) ; . \s+ # includes newline . \S+ # variable(??) text e.g. "fixed-address" . \s+ . (\S+) # e.g. "node20007.domain.com" . ; . """, re.VERBOSE) >>> print regex.search(data).groups() ('hardware ethernet 00:22:38:8f:1f:43', '00:22:38:8f:1f:43', 'node20007.domain.com') >>> 

Источник

Оцените статью