Python remove string part

How do I remove a substring from the end of a string?

strip strips the characters given from both ends of the string, in your case it strips «.», «c», «o» and «m».

It will also remove those characters from the front of the string. If you just want it to remove from the end, use rstrip()

Yeah. str.strip doesn’t do what you think it does. str.strip removes any of the characters specified from the beginning and the end of the string. So, «acbacda».strip(«ad») gives ‘cbac’; the a at the beginning and the da at the end were stripped. Cheers.

@scvalex, wow just realised this having used it that way for ages — it’s dangerous because the code often happens to work anyway

24 Answers 24

strip doesn’t mean «remove this substring». x.strip(y) treats y as a set of characters and strips any characters in that set from both ends of x .

On Python 3.9 and newer you can use the removeprefix and removesuffix methods to remove an entire substring from either side of the string:

url = 'abcdc.com' url.removesuffix('.com') # Returns 'abcdc' url.removeprefix('abcdc.') # Returns 'com'

The relevant Python Enhancement Proposal is PEP-616.

On Python 3.8 and older you can use endswith and slicing:

url = 'abcdc.com' if url.endswith('.com'): url = url[:-4]

import re url = 'abcdc.com' url = re.sub('\.com$', '', url)

Yeah, I myself think that the first example, with the endswith() test, would be the better one; the regex one would involve some performance penalty (parsing the regex, etc.). I wouldn’t go with the rsplit() one, but that’s because I don’t know what you’re exactly trying to achieve. I figure it’s removing the .com if and only if it appears at the end of the url? The rsplit solution would give you trouble if you’d use it on domain names like ‘www.commercialthingie.co.uk’

what if I write EXAMLPLE.COM domain names are not case sensitive. (This is a vote for the regex solution)

It is not a rewrite, the rsplit() solution doesn’t have the same behaviour as the endswith() one when the original string does not have the substring at the end, but somewhere in the middle. For instance: «www.comeandsee.com».rsplit(«.com»,1)[0] == «www.comeandsee» but «www.comeandsee.net».rsplit(«.com»,1)[0] == «www»

The syntax s[:-n] has a caveat: for n = 0 , this doesn’t return the string with the last zero characters chopped off, but the empty string instead.

If you are sure that the string only appears at the end, then the simplest way would be to use ‘replace’:

url = 'abcdc.com' print(url.replace('.com',''))

that will also replace url like www.computerhope.com . do a check with endswith() and should be fine.

«If you are sure that the string only appears at the end» do you mean «If you are sure that the substring appears only once» ? replace seems to work also when the substring is in the middle, but as the other comment suggests it will replace any occurence of the substring, why it should be at the end I dont understand

def strip_end(text, suffix): if suffix and text.endswith(suffix): return text[:-len(suffix)] return text

@yarichu I copied the code from PEP 616 that introduced this exact function into the stdlib. The reason I also think this way is better is that the reason you have to do len(text)-len(suffix) is unclear when you can just use negative indices in Python (in fact, you fixed that bug in an edit and there used to be a comment here incorrectly telling you that you don’t need the len(text) , so this seems error prone), whereas if suffix makes it clear exactly what you’re actually checking and why.

Since it seems like nobody has pointed this on out yet:

url = "www.example.com" new_url = url[:url.rfind(".")]

This should be more efficient than the methods using split() as no new list object is created, and this solution works for strings with several dots.

Wow that is a nice trick. I couldn’t get this to fail but I also had a hard time being able to think up ways this might fail. I like it but it is very «magical», hard to know what this does by just looking at it. I had to mentally process each part of line to «get it».

This fails if the searched-for string is NOT present, and it wrongly removes the last character instead.

Starting in Python 3.9 , you can use removesuffix instead:

'abcdc.com'.removesuffix('.com') # 'abcdc'

Depends on what you know about your url and exactly what you’re tryinh to do. If you know that it will always end in ‘.com’ (or ‘.net’ or ‘.org’) then

is the quickest solution. If it’s a more general URLs then you’re probably better of looking into the urlparse library that comes with python.

If you on the other hand you simply want to remove everything after the final ‘.’ in a string then

will work. Or if you want just want everything up to the first ‘.’ then try

If you know it’s an extension, then

url = 'abcdc.com' . url.rsplit('.', 1)[0] # split at '.', starting from the right, maximum 1 split

This works equally well with abcdc.com or www.abcdc.com or abcdc.[anything] and is more extensible.

This seems the most obvious and cleanest way to me. Doesn’t have to be an extension though, you can just split on the whole substring to be matched.

def remove_suffix(text, suffix): return text[:-len(suffix)] if text.endswith(suffix) and len(suffix) != 0 else text

remove_suffix = lambda text, suffix: text[:-len(suffix)] if text.endswith(suffix) and len(suffix) != 0 else text

For urls (as it seems to be a part of the topic by the given example), one can do something like this:

import os url = 'http://www.stackoverflow.com' name,ext = os.path.splitext(url) print (name, ext) #Or: ext = '.'+url.split('.')[-1] name = url[:-len(ext)] print (name, ext)

Both will output: (‘http://www.stackoverflow’, ‘.com’)

This can also be combined with str.endswith(suffix) if you need to just split «.com», or anything specific.

DSCLAIMER This method has a critical flaw in that the partition is not anchored to the end of the url and may return spurious results. For example, the result for the URL «www.comcast.net» is «www» (incorrect) instead of the expected «www.comcast.net». This solution therefore is evil. Don’t use it unless you know what you are doing!

This is fairly easy to type and also correctly returns the original string (no error) when the suffix ‘.com’ is missing from url .

+1 partition is preferred when only one split is needed since it always returns an answer, an IndexError won’t occur.

This doesn’t correctly handle the suffix not being there. For example, it will incorrectly return www for www.comcast.net .

Assuming you want to remove the domain, no matter what it is (.com, .net, etc). I recommend finding the . and removing everything from that point on.

url = 'abcdc.com' dot_index = url.rfind('.') url = url[:dot_index]

Here I’m using rfind to solve the problem of urls like abcdc.com.net which should be reduced to the name abcdc.com .

If you’re also concerned about www. s, you should explicitly check for them:

if url.startswith("www."): url = url.replace("www.","", 1)

The 1 in replace is for strange edgecases like www.net.www.com

If your url gets any wilder than that look at the regex answers people have responded with.

If you mean to only strip the extension:

'.'.join('abcdc.com'.split('.')[:-1]) # 'abcdc'

It works with any extension, with potential other dots existing in filename as well. It simply splits the string as a list on dots and joins it without the last element.

If you need to strip some end of a string if it exists otherwise do nothing. My best solutions. You probably will want to use one of first 2 implementations however I have included the 3rd for completeness.

def remove_suffix(v, s): return v[:-len(s)] if v.endswith(s) else v remove_suffix("abc.com", ".com") == 'abc' remove_suffix("abc", ".com") == 'abc'

def remove_suffix_compile(suffix_pattern): r = re.compile(f"(.*?)()?$") return lambda v: r.match(v)[1] remove_domain = remove_suffix_compile(r"\.[a-zA-Z0-9]") remove_domain("abc.com") == "abc" remove_domain("sub.abc.net") == "sub.abc" remove_domain("abc.") == "abc." remove_domain("abc") == "abc"

For a collection of constant suffixes the asymptotically fastest way for a large number of calls:

def remove_suffix_preprocess(*suffixes): suffixes = set(suffixes) try: suffixes.remove('') except KeyError: pass def helper(suffixes, pos): if len(suffixes) == 1: suf = suffixes[0] l = -len(suf) ls = slice(0, l) return lambda v: v[ls] if v.endswith(suf) else v si = iter(suffixes) ml = len(next(si)) exact = False for suf in si: l = len(suf) if -l == pos: exact = True else: ml = min(len(suf), ml) ml = -ml suffix_dict = <> for suf in suffixes: sub = suf[ml:pos] if sub in suffix_dict: suffix_dict[sub].append(suf) else: suffix_dict[sub] = [suf] if exact: del suffix_dict[''] for key in suffix_dict: suffix_dictPython remove string part = helper([s[:pos] for s in suffix_dictPython remove string part], None) return lambda v: suffix_dict.get(v[ml:pos], lambda v: v)(v[:pos]) else: for key in suffix_dict: suffix_dictPython remove string part = helper(suffix_dictPython remove string part, ml) return lambda v: suffix_dict.get(v[ml:pos], lambda v: v)(v) return helper(tuple(suffixes), None) domain_remove = remove_suffix_preprocess(".com", ".net", ".edu", ".uk", '.tv', '.co.uk', '.org.uk')

the final one is probably significantly faster in pypy then cpython. The regex variant is likely faster than this for virtually all cases that do not involve huge dictionaries of potential suffixes that cannot be easily represented as a regex at least in cPython.

In PyPy the regex variant is almost certainly slower for large number of calls or long strings even if the re module uses a DFA compiling regex engine as the vast majority of the overhead of the lambda’s will be optimized out by the JIT.

In cPython however the fact that your running c code for the regex compare almost certainly outweighs the algorithmic advantages of the suffix collection version in almost all cases.

Источник

Читайте также: Python парсинг xml файла