How to extract a filename from a URL and append a word to it?
I would like to extract the file name in this URL: 09-09-201315-47-571378756077.jpg Once I get this file name, I’m going to save it with this name to the Desktop.
filename = **extracted file name from the url** download_photo = urllib.urlretrieve(url, "/home/ubuntu/Desktop/%s.jpg" % (filename))
After this, I’m going to resize the photo, once that is done, I’ve going to save the resized version and append the word «_small» to the end of the filename.
downloadedphoto = Image.open("/home/ubuntu/Desktop/%s.jpg" % (filename)) resize_downloadedphoto = downloadedphoto.resize.((300, 300), Image.ANTIALIAS) resize_downloadedphoto.save("/home/ubuntu/Desktop/%s.jpg" % (filename + _small))
From this, what I am trying to achieve is to get two files, the original photo with the original name, then the resized photo with the modified name. Like so:
09-09-201315-47-571378756077.jpg
09-09-201315-47-571378756077_small.jpg
12 Answers 12
import os from urllib.parse import urlparse url = "http://photographs.500px.com/kyle/09-09-201315-47-571378756077.jpg" a = urlparse(url) print(a.path) # Output: /kyle/09-09-201315-47-571378756077.jpg print(os.path.basename(a.path)) # Output: 09-09-201315-47-571378756077.jpg
Your URL might contain percent-encoded characters like %20 for space or %E7%89%B9%E8%89%B2 for «特色». If that’s the case, you’ll need to unquote (or unquote_plus ) them. You can also use pathlib.Path().name instead of os.path.basename , which could help to add a suffix in the name (like asked in the original question):
from pathlib import Path from urllib.parse import urlparse, unquote url = "http://photographs.500px.com/kyle/09-09-2013%20-%2015-47-571378756077.jpg" urlparse(url).path url_parsed = urlparse(url) print(unquote(url_parsed.path)) # Output: /kyle/09-09-2013 - 15-47-571378756077.jpg file_path = Path("/home/ubuntu/Desktop/") / unquote(Path(url_parsed.path).name) print(file_path) # Output: /home/ubuntu/Desktop/09-09-2013 - 15-47-571378756077.jpg new_file = file_path.with_stem(file_path.stem + "_small") print(new_file) # Output: /home/ubuntu/Desktop/09-09-2013 - 15-47-571378756077_small.jpg
Also, an alternative is to use unquote(urlparse(url).path.split(«/»)[-1]) .
@elky One does need urlparse. Only with using urlparse an url with query string like http://photographs.500px.com/kyle/09-09-201315-47-571378756077.jpg?size=1000px will be extracted to a filename 09-09-201315-47-571378756077.jpg . If you only use os.path.basename(url) the extracted filename will include the query-string: 09-09-201315-47-571378756077.jpg?size=1000px . This is usually not the desired solution.
@Jean-FrancoisT. it doesn’t work, you just didn’t think of the edge cases, like when you have a percent encoded # . Try Path(unquote(urlparse(‘http://example.com/my%20%23superawesome%20picture.jpg’).path)).name vs Path(urlparse(unquote(‘http://example.com/my%20%23superawesome%20picture.jpg’)).path).name . It’s just never a good idea to blindly modify something you intend to parse before parsing it.
In [1]: os.path.basename("https://example.com/file.html") Out[1]: 'file.html' In [2]: os.path.basename("https://example.com/file") Out[2]: 'file' In [3]: os.path.basename("https://example.com/") Out[3]: '' In [4]: os.path.basename("https://example.com") Out[4]: 'example.com'
Nobody has thus far provided a complete solution.
A URL can contain a ?[query-string] and/or a #[fragment Identifier] (but only in that order: ref)
In [1]: from os import path In [2]: def get_filename(url): . fragment_removed = url.split("#")[0] # keep to left of first # . query_string_removed = fragment_removed.split("?")[0] . scheme_removed = query_string_removed.split("://")[-1].split(":")[-1] . if scheme_removed.find("/") == -1: . return "" . return path.basename(scheme_removed) . In [3]: get_filename("a.com/b") Out[3]: 'b' In [4]: get_filename("a.com/") Out[4]: '' In [5]: get_filename("https://a.com/") Out[5]: '' In [6]: get_filename("https://a.com/b") Out[6]: 'b' In [7]: get_filename("https://a.com/b?c=d#e") Out[7]: 'b'
@Pi «Nobody has thus far provided a complete solution» the accepted answer is a «complete solution» that throws out the # and ? parts of the URL which it does using the URL parsing built into Python (which might handle an edge case you didn’t think of).
I prefer this answer to the one above that uses urllib.parse.urlparse with os.path.basename by @Boris, because this answer only imports the os package, not urllib which is mostly duplicated by Requests and superseded by urllib2. One less dependency to become obsolete and causing future code maintenance.
@RichLysakowskiPhD there is no such thing as urllib2 on Python 3 and requests uses urllib.parse under the hood. How is implementing URL parsing yourself a smaller maintenance burden than an import?
@Boris you are right. urllib2 does not exist in Python 3, so urllib built into Python or requests is the way to go. Thank you for clarifying with a source url : github.com/psf/requests/blob/…
filename = url[url.rfind("/")+1:] filename_small = filename.replace(".", "_small.")
maybe use «.jpg» in the last case since a . can also be in the filename.
You could just split the url by «/» and retrieve the last member of the list:
url = "http://photographs.500px.com/kyle/09-09-201315-47-571378756077.jpg" filename = url.split("/")[-1] #09-09-201315-47-571378756077.jpg
Then use replace to change the ending:
small_jpg = filename.replace(".jpg", "_small.jpg") #09-09-201315-47-571378756077_small.jpg
With python3 (from 3.4 upwards) you can abuse the pathlib library in the following way:
from pathlib import Path p = Path('http://example.com/somefile.html') print(p.name) # >>> 'somefile.html' print(p.stem) # >>> 'somefile' print(p.suffix) # >>> '.html' print(f'-spamspam') # >>> 'somefile-spamspam.html'
❗️ WARNING
The pathlib module is NOT meant for parsing URLs — it is designed to work with POSIX paths only. Don’t use it in production code! It’s a dirty quick hack for non-critical code. The fact that pathlib also works with URLs can be considered an accident that might be fixed in future releases. The code is only provided as an example of what you can but probably should not do. If you need to parse URLs in a canonic way then prefer using urllib.parse or alternatives. Or, if you make an assumption that the portion after the domain and before the parameters+queries+hash is supposedly a POSIX path then you can extract just the path fragment using urllib.parse.urlparse and then use pathlib.Path to manipulate it.
This breaks with URLs with stuff after the path. Path(‘http://example.com/somefile.html?some-querystring#some-id’).name will return ‘somefile.html?some-querystring#some-id’
Use urllib.parse.urlparse to get just the path part of the URL, and then use pathlib.Path on that path to get the filename:
from urllib.parse import urlparse from pathlib import Path url = "http://example.com/some/long/path/a_filename.jpg?some_query_params=true&some_more=true#and-an-anchor" a = urlparse(url) a.path # '/some/long/path/a_filename.jpg' Path(a.path).name # 'a_filename.jpg'
@Stephen it will work because pathlib uses forward slashes when defining paths, even on Windows. However note that pathlib converts «/» to «\» on Windows when you convert Path objects to str or bytes , so if you’re modifying the above code to do something different, like getting the filename and the part before it (as in path/a_filename.jpg ) but you want to keep forward slashes as forward slashes, you can do str(PurePosixPath(urlparse(url).path)) instead of str(Path(urlparse(url).path)) .
Sometimes there is a query string:
filename = url.split("/")[-1].split("?")[0] new_filename = filename.replace(".jpg", "_small.jpg")
A simple version using the os package:
import os def get_url_file_name(url): url = url.split("#")[0] url = url.split("?")[0] return os.path.basename(url)
print(get_url_file_name("example.com/myfile.tar.gz")) # 'myfile.tar.gz' print(get_url_file_name("example.com/")) # '' print(get_url_file_name("https://example.com/")) # '' print(get_url_file_name("https://example.com/hello.zip")) # 'hello.zip' print(get_url_file_name("https://example.com/args.tar.gz?c=d#e")) # 'args.tar.gz'
Sometimes the link you have can have redirects (that was the case for me). In that case you have to solve the redirects
import requests url = "http://photographs.500px.com/kyle/09-09-201315-47-571378756077.jpg" response = requests.head(url) url = response.url
then you can continue with the best answer at the moment (Ofir’s)
import os from urllib.parse import urlparse a = urlparse(url) print(a.path) # Output: /kyle/09-09-201315-47-571378756077.jpg print(os.path.basename(a.path)) # Output: 09-09-201315-47-571378756077.jpg
it doesn’t work with this page however, as the page isn’t available anymore
helps you to extract the image name. to append name :
imageName = '09-09-201315-47-571378756077' new_name = '_small.jpg'.format(imageName)
I see people using the Pathlib library to parse URLs. This is not a good idea! Pathlib is not designed for it, use special libraries like urllib or similar instead.
This is the most stable version I could come up with. It handles params as well as fragments:
from urllib.parse import urlparse, ParseResult def update_filename(url): parsed_url = urlparse(url) path = parsed_url.path filename = path[path.rfind('/') + 1:] if not filename: return file, extension = filename.rsplit('.', 1) new_path = parsed_url.path.replace(filename, f"_small.") parsed_url = ParseResult(**<**parsed_url._asdict(), 'path': new_path>) return parsed_url.geturl()
assert update_filename('https://example.com/') is None assert update_filename('https://example.com/path/to/') is None assert update_filename('https://example.com/path/to/report.pdf') == 'https://example.com/path/to/report_small.pdf' assert update_filename('https://example.com/path/to/filename with spaces.pdf') == 'https://example.com/path/to/filename with spaces_small.pdf' assert update_filename('https://example.com/path/to/report_01.01.2022.pdf') == 'https://example.com/path/to/report_01.01.2022_small.pdf' assert update_filename('https://example.com/path/to/report.pdf?param=1¶m2=2') == 'https://example.com/path/to/report_small.pdf?param=1¶m2=2' assert update_filename('https://example.com/path/to/report.pdf?param=1¶m2=2#test') == 'https://example.com/path/to/report_small.pdf?param=1¶m2=2#test'
urllib2 file name
Is there an easy way to get the file name other then parsing the original URL? EDIT: changed openfile to urlopen. not sure how that happened. EDIT2: I ended up using:
filename = url.split('/')[-1].split('#')[0].split('?')[0]
Do make sure you know what you want in these two cases: trailing slash ( http://example.com/somefile/ ) and no path: http://example.com Your example will fail on the latter for sure (returning «example.com»). So will @insin’s final answer. That’s another reason why using urlsplit is good advice.
Lots of answers here miss the fact that there are two places to look for a file name: the URL and the Content-Disposition header field. All the current answers that mention the header neglect to mention that cgi.parse_header() will parse it correctly. There is a better answer here: stackoverflow.com/a/11783319/205212
14 Answers 14
You could potentially lift the intended filename if the server was sending a Content-Disposition header by checking remotefile.info()[‘Content-Disposition’] , but as it is I think you’ll just have to parse the url.
You could use urlparse.urlsplit , but if you have any URLs like at the second example, you’ll end up having to pull the file name out yourself anyway:
>>> urlparse.urlsplit('http://example.com/somefile.zip') ('http', 'example.com', '/somefile.zip', '', '') >>> urlparse.urlsplit('http://example.com/somedir/somefile.zip') ('http', 'example.com', '/somedir/somefile.zip', '', '')
Might as well just do this:
>>> 'http://example.com/somefile.zip'.split('/')[-1] 'somefile.zip' >>> 'http://example.com/somedir/somefile.zip'.split('/')[-1] 'somefile.zip'
I would always use urlsplit() and never straight string splitting. The latter will choke if you have an URL that has a fragment or query appended, say example.com/filename.html?cookie=55#Section_3.
If you only want the file name itself, assuming that there’s no query variables at the end like http://example.com/somedir/somefile.zip?foo=bar then you can use os.path.basename for this:
[user@host]$ python Python 2.5.1 (r251:54869, Apr 18 2007, 22:08:04) Type "help", "copyright", "credits" or "license" for more information. >>> import os >>> os.path.basename("http://example.com/somefile.zip") 'somefile.zip' >>> os.path.basename("http://example.com/somedir/somefile.zip") 'somefile.zip' >>> os.path.basename("http://example.com/somedir/somefile.zip?foo=bar") 'somefile.zip?foo=bar'
Some other posters mentioned using urlparse, which will work, but you’d still need to strip the leading directory from the file name. If you use os.path.basename() then you don’t have to worry about that, since it returns only the final part of the URL or file path.