Building URLs in Python
Building URLs is really common in applications and APIs because most of the applications tend to be pretty interconnected. But how should we do it in Python? Here’s my take on the subject.
Different codebases might have different requirements such as:
Let’s see how the different options compare.
The standard way
Python has a built in library that is specifically made for parsing URLs, called urllib.parse.
You can use the urllib.parse.urlsplit function to break a URL string to a five-item named tuple. The items are parsed
scheme://netloc/path?query#fragment
The opposite of breaking an URL to parts is to build it using the urllib.parse.urlunsplit function.
If you check the library documentation you’ll notice that there is also a urlparse function. The difference between it and the urlsplit function is an additional item in the parse result for path parameters.
https://www.example.com/some/path;parameter=12?q=query
Path parameters are separated with a semicolon from the path and located before the query arguments that start with a question mark. Most of the time you don’t need them but it is good to know that they exist.
So how would you then build an URL with urllib.parse?
Let’s assume that you want to call some API and need a function for building the API URL. The required URL could be for example:
https://example.com/api/v1/book/12?format=mp3&token=abbadabba
Here is how we could build the URL:
import os from urllib.parse import urlunsplit, urlencode SCHEME = os.environ.get("API_SCHEME", "https") NETLOC = os.environ.get("API_NETLOC", "example.com") def build_api_url(book_id, format, token): path = f"/api/v1/book/book_id>" query = urlencode(dict(format=format, token=token)) return urlunsplit((SCHEME, NETLOC, path, query, ""))
Calling the function works as expected:
>>> build_api_url(12, "mp3", "abbadabba") 'https://example.com/api/v1/book/12?format=mp3&token=abbadabba'
I used environment variables for the scheme and netloc because typically your program is calling a specific API endpoint that you might want to configure via the environment.
I also introduced the urlencode function which transforms a dictionary to a series of key=value pairs separated with & characters. This can be handy if you have lots of query arguments as a dictionary of values can be easier to manipulate.
The urllib.parse library also contains urljoin which is similar to os.path.join . It can be used to build URLs by combining a base URL with a path. Let’s modify the example code a bit.
import os from urllib.parse import urljoin, urlencode BASE_URL = os.environ.get("BASE_URL", "https://example.com/") def build_api_url(book_id, format, token): path = f"/api/v1/book/book_id>" query = "?" + urlencode(dict(format=format, token=token)) return urljoin(BASE_URL, path + query)
This time the whole base URL comes from the environment. The path and query are combined with the base URL using the urljoin function. Notice that this time the question mark at the beginning of the query needs to be set manually.
The manual way
Libraries can be nice but sometimes you just want to get things done without thinking that much. Here’s a straight forward way to build a URL manually.
import os BASE_URL = os.environ.get(BASE_URL, "https://example.com").rstrip("/") def build_api_url(book_id, format, token): return f"BASE_URL>/api/v1/book/book_id>?format=format>&token=token>"
The f-strings in Python make this quite clean, especially with URLs that always have the same structure and not that many parameters. The BASE_URL initialization strips the tailing forward slash from the environment variable. This way the user doesn’t have to remember if it should be included or not.
Note that I haven’t added any validations for the input parameters in these examples so you may need take that into consideration.
The Furl way
Then there is a library called furl which aims to make URL parsing and manipulation easy. It can be installed with pip:
>> python3 -m pip install furl
import os from furl import furl BASE_URL = os.environ.get("BASE_URL", "https://example.com") def build_api_url(book_id, format, token): f = furl(BASE_URL) f /= f"/api/v1/book/book_id>" f.args["format"] = format f.args["token"] = token return f.url
There are a bit more lines here when compared to the previous example. First we need to initialize a furl object from the base url. The path can be appended using the /= operator which is custom defined by the library.
The query arguments can be set with the args property dictionary. Finally, the final URL can be built by accessing the url property.
Here’s an alternative implementation using the set() method to change the path and query arguments of an existing URL.
def build_api_url(book_id, format, token): return ( furl(BASE_URL) .set(path=f"/api/v1/book/book_id>", args="format": format, "token": token>,) .url )
In addition to building URLs Furl lets you modify existing URLs and parse parts of them. You can find many more examples from the API documentation.
Conclusion
These are just some examples on how to create URLs. Which one do you prefer?
Read next in the Python bites series.
Introduction to Urljoin in Python
This tutorial describes Python urljoin and its behavior when using it. It also demonstrates the use of urljoin in Python using different example codes.
Introduction to urljoin in Python
URLs usually include essential information that could be utilized when evaluating a website, a participant’s search, or the arrangement of the material in each area.
Sometimes whether URLs appear pretty complex, Python comes with various valuable libraries that let parse, join URLs and retrieve the constituent parts of the URLs.
The urllib package in Python 3 enables users to explore websites from within their script and contains several modules for managing URL functions like urljoin() .
The urllib library is crucial when using a URL in Python programming that allows users to visit and interact with websites using their Universal Resource Locator.
Also, this library provides more packages like urllib.request , urllib.error , urllib.parse , and urllib.robotparser .
Use of the urljoin() Method
The urljoin() method is helpful where many related URLs are needed. For instance, URLs for a set of pages to be generated for a website and adding new values to the base URL.
urljoin(baseurl, newurl, allowFrag=None)
After constructing a full URL by combining a base URL( baseurl ) with another URL( newurl ), informally, this uses parts of the base URL as the addressing scheme, the network location and the path to provide missing parts in the relative URL.
>>> from urllib.parse import urljoin >>> urljoin('http://www.cwi.nl:50/%7Eguido/Python.html', 'FAQ.html')
'http://www.cwi.nl:50/%7Eguido/FAQ.html'
The allowFrag argument consists of the same meaning and default as for urlparse() . If newurl is an absolute URL that starts with // or scheme:// , the newurl ’s hostname and/or scheme will be present in the output. As an example:
>>> from urllib.parse import urljoin >>> urljoin('http://www.cwi.nl:50/%7Eguido/Python.html', '//www.python.org/%7Eguido')
In case this is not the output excepted, preprocess the newurl with urlsplit() and urlunsplit() , detaching possible scheme and network location parts.
- urlparse() — This module enables the user to quickly separate URLs into different parts and filter out any particular part from URLs.
- urlsplit() — This module is an alternative to urlparse() but different as it does not split the parameters from the URL. The urlsplit() module is helpful for URLs following RFC 2396 , which supports parameters for each path segment.
- urlunsplit() — The function of this module is to combine the elements of a tuple as returned by urlsplit() to form a complete URL as a string.
Use the urljoin() Module to Build URLs
The requests module in Python can assist in building URLs and manipulating the URL value dynamically. Programmatically, any sub-directory of the URL can be fetched and then can substitute some parts of the URL with the new values to build new URLs.
The following code fence uses urljoin() to fetch different subfolders in a URL path. The urljoin() is used to add new values to the base URL that will build an URL.
from requests.compat import urljoin base='https://stackoverflow.com/questions/10893374' print (urljoin(base,'.')) print (urljoin(base,'..')) print (urljoin(base,'. ')) print (urljoin(base,'/10893374/')) url_query = urljoin(base,'?vers=1.0') print (url_query) url_sec = urljoin(url_query,'#section-5.4') print (url_sec)
https://stackoverflow.com/questions/ https://stackoverflow.com/ https://stackoverflow.com/questions/. https://stackoverflow.com/10893374/ https://stackoverflow.com/questions/10893374?vers=1.0 https://stackoverflow.com/questions/10893374?vers=1.0#section-5.4
Is there a way to split URLs in Python? Of course, yes!
We can split the URLs into many components beyond the primary address. The additional parameters used for a particular query or tags attached to the URL are divided using the urlphase() method, as shown below.
from requests.compat import urlparse url_01='https://docs.python.org/3/library/__main__.html?highlight=python%20hello%20world' url_02 = 'https://docs.python.org/2/py-modindex.html#cap-f' print (urlparse(url_01)) print (urlparse(url_02))
ParseResult(scheme='https', netloc='docs.python.org', path='/3/library/__main__.html', params='', query='highlight=python%20hello%20world', fragment='') ParseResult(scheme='https', netloc='docs.python.org', path='/2/py-modindex.html', params='', query='', fragment='cap-f')
Use urljoin() to Form URLs
The formation of URLs from different parts to understand the behavior of the urljoin() method imported from urllib.parse is shown and explained the below examples.
>>> from urllib.parse import urljoin >>> urljoin('test', 'task')
>>> from urllib.parse import urljoin >>> urljoin('http://test', 'task')
>>> from urllib.parse import urljoin >>> urljoin('http://test/add', 'task')
>>> from urllib.parse import urljoin >>> urljoin('http://test/add/', 'task')
>>> from urllib.parse import urljoin >>> urljoin('http://test/add/', '/task')
>>> from urllib.parse import urljoin >>> urljoin('test', 'task')
In the above snippet, the first argument can be considered as the baseurl (assuming the syntax of the urljoin() ) that can be equal to the page displayed on the browser.
The second argument, newurl , can be considered as the href of an anchor on that page. As the outcome, the final URL directs to a page once clicked by the user.
A person can also consider the baseurl includes a scheme and domain when considering the above snippet.
>>> from urllib.parse import urljoin >>> urljoin('http://test', 'task')
>>> from urllib.parse import urljoin >>> urljoin('http://test/add', 'task')
Adding another part, test/add as above, will create a relative link to the task that will direct the user to the above URL.
>>> from urllib.parse import urljoin >>> urljoin('http://test/add/', 'task')
Here test/add/ is added that will direct to different relative link: test/add/task .
>>> from urllib.parse import urljoin >>> urljoin('http://test/add/', '/task')
If the user is on test/add/ and the href is to /task , it will link the user to test/task . So, we can say that the urljoin() in Python is a handy function that will help work out URLs as necessary.
Nimesha is a Full-stack Software Engineer for more than five years, he loves technology, as technology has the power to solve our many problems within just a minute. He have been contributing to various projects over the last 5+ years and working with almost all the so-called 03 tiers(DB, M-Tier, and Client). Recently, he has started working with DevOps technologies such as Azure administration, Kubernetes, Terraform automation, and Bash scripting as well.