Regex for word exclusion
This is the best solution using regular expressions. Match a word (at least one non-whitespace) then eat up the whitespace, eat up another word, eat up more whitespace, then match another word. re.search() returns a «match group» object, and you can extract the matched strings from that.
@steveha, I don’t think so. Check regex on this sentence Hello, beautiful world! . It will match Hello, and world! , i.e. including punctuation ( , and ! ).
@steveha, It is more appropriate use \b word boundaries, like in my answer: stackoverflow.com/questions/6770681/regex-for-word-exclusion/…
@polishchuk, it depends on whether he wants punctuation to delimit words, or not. I agree that it is good to know about the alternatives. If he is parsing some sort of data file, he might want the above expression rather than the one that knows about punctuation.
Instead of regex, consider
>>> "term1 and term2".split()[:3:2] ['term1', 'term2'] >>> "term1 anbd term2".split()[:3:2] ['term1', 'term2'] >>>
@steveha: it’s just «up to the third element, taking every other element». There aren’t even any negative numbers involved 🙂
Sorry, but my personal definition of «tricky» is: if I look at it and say to myself «What! What the heck is that doing. » then it is «tricky». Perhaps this means I need to spend more time slicing; maybe it’s totally obvious to everyone else. 🙂
By the way, I totally agree with using the .split() method function rather than a regex to split the string. Much as I love regular expressions, the simpler way is best.
You can use this regex \b\w+\b to split your sentence on words, then take 1st and 3rd.
import re pat = re.compile(r'\b\w+\b') # pre-compile the pattern # for this example the pre-compiling doesn't really matter. temp = re.findall(pat, "Hello, beautiful world!") lst = [temp[0], temp[2]] # sets lst to ["Hello", "world"]
As far as I can tell, this pattern matches exactly one word. It is not clear to me how to use this to answer his question. If you apply this to the string «Hello, beautiful world!» this will match Hello and I’m not sure how to write a pattern that will discard the comma along with the white space without messing up. To use this pattern, perhaps it would be better to simply split the string on white space (using the string .split() method function) and then use this pattern to strip punctuation off the first and third words found.
@steveha, I don’t know python, but I thought it has similar function to .NET Regex.Matches which will return all occurrences of words, i.e.: Hello , beautiful and world .
I don’t usually use it, so I forgot about it, but there is a function re.findall() that will find all occurrences of a pattern within a string. I just tested it and it worked, so I thank you for teaching me something. Here’s the code I tested: re.findall(r’\b\w+\b’, «Hello, beautiful world!»)
I just tested this, it works 🙂
[] surround a character class — a set of characters to match, or not match. Your regex says «one or more characters, none of which are , a , n or d «, which is why you get the result you do.Getting correct answers to these sorts of things requires correct questions. What’s special about the word «and» in your case? Do you want «every word that is not and «, or do you want «the first and third word of the string, no matter what the words are», or just what?
Your description of the desired output in the second case sounds like you want «every word that is not and «. There are much simpler ways to get this. Regexes are not really as useful as people want them to be.
The split method of strings cuts it into words. From there, we can use a list comprehension to filter out any words that are «and». It looks like:
[word for word in sentence.split() if word != "and"]
See? It’s practically plain English.
Exclude a Specific String From From Regex Using Python
The re module is a package at the center of regular expressions in Python. In most cases, we are interested in matching patterns. However, in other cases, we want to exclude some strings based on a given pattern.
This article covers different methods used to exclude substring, giving examples in each case.
Using re Special Characters
Two special characters come in handy in this case: [ ] and ^. The [ ] character is used to specify a set of characters we wish to match.
For example, “[abc]” is used to match characters “a”, “b” and “c”, and “53” matches number strings between 00 and 59.
If the first character in the set, [ ], is “^”, then all the characters in the set are excluded. For example, “[^xyz]” matches all characters except “x”, “y” and “z”. Here are more examples,
Regular expression | Meaning |
“[^a-z]” | Matches all characters except any lowercase ASCII alphabetical letter |
“[^.]” | Matches any character except a period |
“[^\w]” | Matches any character except word characters |
“[a^z]” | Matches “a”, “^”, and “z”. |
Table 1: Some regular expressions showing how “[ ]” and “^” characters can be used to exclude substrings.
Note: “[^^]” will match any character except “^”. ^ has no special meaning if it’s not the first character in the set, [ ] (see the last row of Table 1).
Let’s now see some of the examples in the code.
Example 1: Excluding all integers
The character “\d+” matches any number in the string passed. If we want to exclude numbers, we need to use the pattern “[^\d+]”.
A regular expression to exclude a word/string
This matches strings such as /hello or /hello123 . However, I would like it to exclude a couple of string values such as /ignoreme and /ignoreme2 . I’ve tried a few variants but can’t seem to get any to work! My latest feeble attempt was
8 Answers 8
Here’s yet another way (using a negative look-ahead):
^/(?!ignoreme|ignoreme2|ignoremeN)([a-z0-9]+)$
Note: There’s only one capturing expression: ([a-z0-9]+) .
Brilliant, that seems to have done the trick. I actually need this rule for url rewriting and I wanted to ignore the «images», «css» and «js» folder. So my rule is as follows: ^/(?!css|js|images)([a-z]+)/?(\?(.+))?$ and it rewrites to /Profile.aspx?id=$1&$3 Will this rule work correctly and propagate the query string too? So if someone visits mydomain.com/hello?abc=123 I’d like it to rewrite to mydomain.com/Profile.aspx?id=hello&abc=123 I’m also a bit unsure about the performance of (.+) at the end to capture the querystring in the original request.
Sounds like this is another question. The regexp that you have looks like it will capture the query string — test and see if your query string comes along. Also — (\?(.+))?$ should be fast. I wouldn’t worry too much about speed.
This didn’t work for me, while Alix Axel’s solution did work. I’m using Java’s java.util.regex.Pattern class.
I confirm Mark’s reMark 😉 — for example, Pycharm is Java-based, isn’t it? So, considering regexes in Pycharm search Alix’s solution works, the other does not.
@Black, that site has a setting for treating / as a delimiter (though I don’t know what it’s delimiting, and am not sure what that feature is for). Here, the / is part of the expression, so the site is getting confused. Try setting the delimiter on the site to something else. Then try a test string that the pattern will match, like /hello .
python regex exclude text containing word
Does it have to use regex? What if the string is just daylight ? How about today ? How about this day is awful ?
i get your point, but is targeted only at some particular words. Like people names, etc. Maybe i didnt pick the best words for the example. I thought regex would be cool, but im starting to think it might be better to do it in a couple more lines of code without regex.
2 Answers 2
# ^(?s)((?!\bX\b).)*\bW\b((?!\bY\b).)*$ ^ (?s) ( (?! \b X \b ) . )* \b W \b ( (?! \b Y \b ) . )* $
edit — It was unclear if you meant XWY was separated by whitespace
or any number of characters. This expanded, commented example shows both ways.
Good luck!
Note — the (?add-remove) construct is a modifier group. Typically its a way to
embed options like s (Dot-All), i(Ignore case), etc. within the regex.
Where (?s) means add Dot-All modifier, and (?si) is the same but with ignore case as well.
# ^(?s)(. *(?:\bX\b\s+\bW\b|\bW\b\s+\bY\b))(. *\b(W)\b.*|.*)$ # This regex validates W is not preceded by X # nor followed by Y. # It also optionally finds W. # Only fails if its invalid. # If passed, can check if W present by # examining capture group 1. ^ # Beginning of string (?s) # Modifier group, with s = DOT_ALL (?! # Negative looahead assertion .* # 0 or more any character (dot-all is set, so we match newlines too) (?: \b X \b \s+ \b W \b # Trying to match X, 1 or more whitespaces, then W | \b W \b \s+ \b Y \b # Or, Trying to match W, 1 or more whitespaces, then Y # Substitute this to find any interval between XWY # \b X \b .* \b W \b <- Trying to match X, 0 or more any char, then W # | \b W \b .* \b Y \b <- Or, Trying to match W, 0 or more any char, then Y ) ) # Still at start of line. # If here, we didn't find any XW, nor WY. # Opotioinally finds W in group 1. (?: .* \b ( W ) # (1), W \b .* | .* ) $ # End of string