finding text between two specified words in Python, when one of the two words changes
Basically, I am trying to extract text between two strings within a loop as one of the two words changes after the information is extracted. so for example, the string is:
string = alpha 111 bravo 222 alpha somethingA end, 333 bravo somethingB end 444 alpha 555 bravo
So I want to extract the text between alpha and end and then bravo and end. I have quite a few of these unique words in my file so I have a list and a counter to go through them. See the code below:
string = 'alpha 111 bravo 222 alpha somethingA end, 333 bravo somethingB end 444 alpha 555 bravo' words = ['alpha', 'bravo'] #there will be more words here counter = 0 stringOut = '' #going through the list of words while counter < len(words): firstWord = words[counter] lastWord = 'end' data = string[string.find(firstWord)+len(firstWord):string.find(lastWord)].strip() #this will give the text between the first ocurrance of "alpha" and "end" #since I want just the smallest string between "alpha" and "end", I use another #while loop #to see if firstWord occurs again while firstWord in data: ignore,ignore2,data = data.partition(str(firstWord)) counter = counter + 1 stringOut += str(data) + str('\n') print('output string is \n' + str(stringOut)) #this code gives the correct output for the text between the first word ("alpha") and #"end". #but when the list moves to the next string "bravo", it takes the text between the #first "bravo" #and the "end" that was associated with the information required for "alpha" #("somethingA")
Am I correct to assume that any text you already parsed is not useful anymore? so if you first extract 'alpha 111 bravo 222 alpha somethingA end' then you want to search only through the rest of the string, ', 333 bravo somethingB end 444 alpha 555 bravo' ? kind of like slicing the string each time?
and there are many of these words so, in a big text - i have lets say many alpha, bravo, charlie etc. and end. I want to extract all the text between "alpha" and "end", "bravo" and "end", "charlie" and "end" etc. hence using while loop. But alpha, bravo, charlie can appear more than once and i want only the smallest string that is within those two words
3 Answers 3
I morphed your request into a method/function (iterator). I Hope this helps you 🙂
string = 'alpha 111 bravo 222 alpha somethingA end, 333 bravo somethingB end 444 alpha 555 bravo' words = ['alpha', 'bravo'] def method(string, words, end_word): segments = string.split(end_word) counter = 0 while counter < len(words): data = segments[counter].split(words[counter])[-1] counter += 1 yield data.strip() for r in method(string, words, 'end'): print r >>> somethingA somethingB
note: this solution works if the string is being parsed forward and never needs to be looked back on.
Please note, that without further input from you, I do not know exactly how to restrict this, but at the moment, the length of words must be equal to or less then the number of 'end_word' in the string.
Am I correct to assume that any text you already parsed is not useful anymore? so if you first extract 'alpha 111 bravo 222 alpha somethingA end' then you want to search only through the rest of the string, ', 333 bravo somethingB end 444 alpha 555 bravo' ? kind of like slicing the string each time? Yes you are correct in assuming that. Does the code you suggested slice the string?
and what could you do if you want the string to remain the same and search from it again later one? so when you find the somethingA, when searching for somethingB - the whole string is intact?
So for example, in my words list I have 'alpha' again ['alpha', 'bravo', alpha']. Using the splitting function omitts the part and hence if the information is required again then it can no longer be found.
you pass the string into the function, but the original string is still intact, but within the function it is sliced up and consumed for the desired result. run your own tests on it if you want. 🙂
import re string = 'alpha 111 bravo 222 alpha somethingA end, 333 bravo somethingB end 444 alpha 555 bravo' words = ['alpha', 'bravo'] #there will be more words here for word in words: expr = re.compile(r'.*' + word + '(.+?)end'); out = expr.findall(string) print word + " => " + str(out[0])
>>> alpha => somethingA bravo => somethingB
string = 'alpha bravo . alpha charlie somethingAC end . . bravo delta somethingBD end alpha . bravo . ' words = ['alpha','bravo','charlie','delta'] def method(string, words, end_word, single=True): segments = string.split(end_word) for word in words: for segment in segments: if word in segment: data = segment.split(word)[-1] yield (word, data.strip()) if single: break
Notice the new argument: single by default, only one result per word will be yeilded, but if you want, it will search for each word in each segment of the string, since I am not sure what you want, you can always remove it later.
# each word only once for r in method(string, words, 'end'): print r >>> ('alpha', 'charlie somethingAC') ('bravo', '. alpha charlie somethingAC') ('charlie', 'somethingAC') ('delta', 'somethingBD')
# each word for each segment for r in method(string, words, 'end', False): print r >>> ('alpha', 'charlie somethingAC') ('alpha', '. bravo . ') ('bravo', '. alpha charlie somethingAC') ('bravo', 'delta somethingBD') ('bravo', '. ') ('charlie', 'somethingAC') ('delta', 'somethingBD')
As a bonus, I am including this generator expression in list-comprehension form:
def method1(string, words, end_word, single=True): return ([(word, segment.split(word)[-1]) for segment in string.split(end_word) if word in segment][:(1 if single else None)] for word in words)
Regex to find words between two tags
Also, it's not a good practice to use str as a variable name in python as str() means a different thing in python.
By the way, the regex can be:
import re print re.findall("(.*?) ", input, re.DOTALL) print re.findall("(.*?) ", input, re.DOTALL)
If you do not want to match the tags, use positive lookahead and lookbehind: (?<= This example works for simple parsing only. Have a look at python official documentation on re To parse HTML, you should consider @sabuj-hassan answer but please remember to check this Stack Overflow gem as well. Well this is not exactly the code example I gave, but I suggest you replace re.DOTALL with flags=re.DOTALL . I just tried it and it still works. Hope that solve your problems 🙂 We have a number of ways to import the data. Reading the file from disk: Reading the data from a string: Other python Xml & Html parser In this example, I would like to search for "Part 1" and "Part 3" and then get everything in between which would be: ". Part 2. " I'm using Python 2x. I'm trying it with my own strings but somehow something is not working. It's always returning None . See: ht = re.search(pattern=r'(fun n : nat => (.*?) : 0 + n = n)', string='(fun n : nat => eq_refl : 0 + n = n)') for my string import re ppt = '(fun n : nat => ?Goal : 0 + n = n)' match = re.search(r'^(.*)\?\w+(.*)$', ppt) re_ppt = f'(.+)' print(re_ppt) print(re.match(re_ppt, "(fun n : nat => eq_refl : 0 + n = n)").groups()) Or use re.findall , if there are more than one occurances. Note that the wanted answer is ". Part 2. " , so the regular expression should be r'Part 1(.*?)Part 3'. See my answer blow. do you mind explaining what the regular expression means? I'm trying it with my own strings but somehow something is not working. It's always returning None . See: ht = re.search(pattern=r'(fun n : nat => (.*?) : 0 + n = n)', string='(fun n : nat => eq_refl : 0 + n = n)') for my string import re ppt = '(fun n : nat => ?Goal : 0 + n = n)' match = re.search(r'^(.*)\?\w+(.*)$', ppt) re_ppt = f'(.+)' print(re_ppt) print(re.match(re_ppt, "(fun n : nat => eq_refl : 0 + n = n)").groups()) Without regular expression, this one works for your example: @VetrivelPS its 6 because the length of "part 1" is 6. str.find() returning the location for the first char. do you mind explaining what the regular expression means? I'm trying it with my own strings but somehow something is not working. It's always returning None . See: ht = re.search(pattern=r'(fun n : nat => (.*?) : 0 + n = n)', string='(fun n : nat => eq_refl : 0 + n = n)') for my string import re ppt = '(fun n : nat => ?Goal : 0 + n = n)' match = re.search(r'^(.*)\?\w+(.*)$', ppt) re_ppt = f'(.+)' print(re_ppt) print(re.match(re_ppt, "(fun n : nat => eq_refl : 0 + n = n)").groups()) I feel these example with increasing complexity are interesting: How would I get the string between the first two paragraph tags? And then, how would I get the string between the 2nd paragraph tags? .+? The following is your text run in console. I'd like to find the string between the two paragraph tags. And also this string .+? I'd like to find the string between the two paragraph tags. And also this string I've given you an upvote, but please correct these misspells in frirst fragment. Should be This might sound stupid but how come this doesn't work? split = re.findall(' .+? ',content) . second = split[1] . it gives me an index out of range error. How do I get the 2nd element? (.+?) ',string) >>> split[1] 'And also this string' If you want the string between the p tags (excluding the p tags) then add parenthesis to .+? in the findall method I'd like to find the string between the two paragraph tags. And also this string (.+?)import re # simple example pattern = r"
probably you are looking for **XML tree and elements** XML is an inherently hierarchical data format, and the most natural way to represent it is with a tree. ET has two classes for this purpose - ElementTree represents the whole XML document as a tree, and Element represents a single node in this tree. Interactions with the whole document (reading and writing to/from files) are usually done on the ElementTree level. Interactions with a single XML element and its sub-elements are done on the Element level. 19.7.1.2. Parsing XML We’ll be using the following XML document as the sample data for this section:
import xml.etree.ElementTree as ET tree = ET.parse('country_data.xml') root = tree.getroot()
root = ET.fromstring(country_data_as_string)
Match text between two strings with regular expression
3 Answers 3
>>> import re >>> s = 'Part 1. Part 2. Part 3 then more text' >>> re.search(r'Part 1\.(.*?)Part 3', s).group(1) ' Part 2. ' >>> re.search(r'Part 1(.*?)Part 3', s).group(1) '. Part 2. '
>>> import re >>> s = 'Part 1. Part 2. Part 3 then more text' >>> re.search(r'Part 1(.*?)Part 3', s).group(1) '. Part 2. '
>>> s = 'Part 1. Part 2. Part 3 then more text' >>> a, b = s.find('Part 1'), s.find('Part 3') >>> s[a+6:b] '. Part 2. '
ppt = 'abc HERE abc' ept = 'abc TERM abc' re_ppt = ppt.replace('HERE', '(.+)') print() print(f'') out = re.search(pattern=re_ppt, string=ept) print(out) print(out.groups()) ppt = 'abc HERE abc HERE abc' ept = 'abc TERM1 abc TERM2 abc' re_ppt = ppt.replace('HERE', '(.+)') print() print(f'') out = re.search(pattern=re_ppt, string=ept) print(out) print(out.groups()) print() ppt = """(fun n : nat => nat_ind (fun n0 : nat => n0 + 0 = n0) ?Goal""" print(f"") ept = """(fun n : nat => nat_ind (fun n0 : nat => n0 + 0 = n0) (eq_refl : 0 + 0 = 0)""" print(f'') pattern_meta_var = r'\?(\w)+' _ppt = re.sub(pattern=pattern_meta_var, repl='HERE', string=ppt) print(f'') _ppt = re.escape(_ppt) print(f'') re_ppt = _ppt.replace('HERE', '(.+)') print(f'') out = re.search(pattern=re_ppt, string=ept) print(out) print(out.groups()) print() # sometimes the actual proof term missing won't have white spaces surrounding it but the ppt will have surrounding spaces where the hole # would be. So in goal cames I removed the surrounding whitespaces. Then inserted a regex that accepts a hole with or # without surrounding white spaces. That way in case the proof term in the hole does have surrounding white spaces then # the regex hole catcher would match it anyway. ppt = """\n (fun (n' : nat) (IH : n' + 0 = n') => ?Goal0) n)""" ept = """\n (fun (n' : nat) (IH : n' + 0 = n') =>\n\teq_ind_r (fun n0 : nat => S n0 = S n') eq_refl IH : S n' + 0 = S n') n)""" print(f"") print(f'') pattern_meta_var = r'\s*\?(\w)+\s*' _ppt = re.sub(pattern=pattern_meta_var, repl='HERE', string=ppt) print(f'') _ppt = re.escape(_ppt) print(f'') re_ppt = _ppt.replace('HERE', '\s*(.+)\s*') print(f'') out = re.search(pattern=re_ppt, string=ept) print(out) assert out is not None, f'expected two holes matched but go ' print(out.groups()) print() ppt = """(fun n : nat => nat_ind (fun n0 : nat => n0 + 0 = n0) ?Goal (fun (n' : nat) (IH : n' + 0 = n') => ?Goal0) n)""" print(f"") ept = """(fun n : nat => nat_ind (fun n0 : nat => n0 + 0 = n0) (eq_refl : 0 + 0 = 0) (fun (n' : nat) (IH : n' + 0 = n') => eq_ind_r (fun n0 : nat => S n0 = S n') eq_refl IH : S n' + 0 = S n') n)""" print(f'') pattern_meta_var = r'\s*\?(\w)+\s*' _ppt = re.sub(pattern=pattern_meta_var, repl='HERE', string=ppt) print(f'') _ppt = re.escape(_ppt) print(f'') re_ppt = _ppt.replace('HERE', '\s*(.+)\s*') print(f'') out = re.search(pattern=re_ppt, string=ept) print(out) print(out.groups())
Get string between two strings
3 Answers 3
import re matches = re.findall(r'
>>>import re >>>string = """
@Zorgan, could you have changed the content? it works in my console: >>> split = re.findall('import re string = """
["I'd like to find the string between the two paragraph tags.", 'And also this string']