Python find url in text

In this tutorial, we’ll try to understand how to extract links from a text file in Python with the help of some examples.

There are multiple ways to extract URLs from a text file using Python. Some of the commonly used methods are –

📚 Discover Online Data Science Courses & Programs (Enroll for Free)

Introductory ⭐

Intermediate ⭐⭐⭐

🔎 Find Data Science Programs 👨‍💻 111,889 already enrolled

Disclaimer: Data Science Parichay is reader supported. When you purchase a course through a link on this site, we may earn a small commission at no additional cost to you. Earned commissions help support this website and its team of writers.

Let’s now look at both methods in detail.

We’ll be working with a text file “learn.txt” which contains some words and URLs to demonstrate the usage of the above methods. This is how the file looks in a text editor.

the contents of the text file displayed in a text editor

Regular expressions are commonly used to extract information from text using pattern matching. The idea is to define a pattern (or rule) and then scan the entire text to find any matches. Since URLs (links) have a pattern (for example, starting with https:// , etc.) we can utilize regular expressions to extract them from a text file.

You can use Python built-in re module to implement regular expressions in Python. We’ll use the re.findall() function to find all the matching URLs from a text. The following is the syntax –

Basic Syntax:

Parameters: The parameters are, the regex which is the regular expression pattern that we want to match in the text, and the text in which we want to search for the pattern.

Let’s now use this method to extract all the URLs from the text file “learn.txt”. First, let’s read the contents of the file to a string.

# open the text file and read its contents to a string s = "" with open("learn.txt", "r") as text_file: s = text_file.read() # display the text print(s)
This is a sample text file with some words and URLs. For example, you can visit the Python website at https://www.python.org to learn more about the Python programming language. You can also visit the Google website at http://www.google.com to search for information on the internet. Finally, you can visit Data Science Parichay's website at https://datascienceparichay.com to learn about data science via easy to understand tutorials and examples.

Let’s now extract all the above URLs from the above string (the contents of the text file) using regular expressions.

import re # extract the URLs urls = re.findall('http[s]?://(?:[a-zA-Z]|9|[$-_@.&+]|[!*(),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', s) # display the extracted URLs print(urls)
['https://www.python.org', 'http://www.google.com', 'https://datascienceparichay.com']

You can see that we were able to extract all three URLs in the above text file.

The urllib.parse module in Python comes with a urlparse method that is used to parse a URL into its constituents. This corresponds to the general structure of a URL: scheme://netloc/path;parameters?query#fragment .

The idea is to split the string into tokens (or words) and then try to parse each word as a URL, if we’re able to parse it as a URL (using whether it has a scheme or not), we add it to our urls list.

Let’s print out the contents of the text file “learn.txt” that we read above again.

This is a sample text file with some words and URLs. For example, you can visit the Python website at https://www.python.org to learn more about the Python programming language. You can also visit the Google website at http://www.google.com to search for information on the internet. Finally, you can visit Data Science Parichay's website at https://datascienceparichay.com to learn about data science via easy to understand tutorials and examples.

Let’s now extract the URLs from the above text.

from urllib.parse import urlparse # Extract the URLs using the urlparse() function urls = [urlparse(url).geturl() for url in s.split() if urlparse(url).scheme] # display the extracted URLs print(urls)
['https://www.python.org', 'http://www.google.com', 'https://datascienceparichay.com']

We get the same result as above.

For more on the urlparse method, refer to its documentation.

You might also be interested in –

Subscribe to our newsletter for more informative guides and tutorials.
We do not spam and you can opt out any time.

Authors

Piyush is a data professional passionate about using data to understand things better and make informed decisions. He has experience working as a Data Scientist in the consulting domain and holds an engineering degree from IIT Roorkee. His hobbies include watching cricket, reading, and working on side projects. View all posts

I’m an undergrad student at IIT Madras interested in exploring new technologies. I have worked on various projects related to Data science, Machine learning & Neural Networks, including image classification using Convolutional Neural Networks, Stock prediction using Recurrent Neural Networks, and many more machine learning model training. I write blog articles in which I would try to provide a complete guide on a particular topic and try to cover as many different examples as possible with all the edge cases to understand the topic better and have a complete glance over the topic. View all posts

Data Science Parichay is an educational website offering easy-to-understand tutorials on topics in Data Science with the help of clear and fun examples.

Источник

Python: Find urls in a string

Python Regular Expression: Exercise-42 with Solution

Write a Python program to find URLs in a string.

Sample Solution:-

Python Code:

import re text = '

Contents :

Python ExamplesEven More Examples' urls = re.findall('http[s]?://(?:[a-zA-Z]|1|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', text) print("Original string: ",text) print("Urls: ",urls)
Original string: 

Contents :

Python ExamplesEven More Examples Urls: ['https://w3resource.com', 'http://github.com']

Pictorial Presentation:

Flowchart: Regular Expression - Find urls in a string.

Visualize Python code execution:

The following tool visualize what the computer is doing step-by-step as it executes the said program:

Python Code Editor:

Have another way to solve this solution? Contribute your code (and comments) through Disqus.

What is the difficulty level of this exercise?

Test your Programming skills with w3resource’s quiz.

Follow us on Facebook and Twitter for latest update.

Python: Tips of the Day

How to access environment variable values?

Environment variables are accessed through os.environ

import os print(os.environ['HOME'])

Or you can see a list of all the environment variables using:

As sometimes you might need to see a complete list!

# using get will return 'None' if a key is not present rather than raise a 'KeyError' print(os.environ.get('KEY_THAT_MIGHT_EXIST')) # os.getenv is equivalent, and can also give a default value instead of `None` print(os.getenv('KEY_THAT_MIGHT_EXIST', default_value))

Python default installation on Windows is C:\Python. If you want to find out while running python you can do:

import sys print(sys.prefix)
  • Weekly Trends
  • Java Basic Programming Exercises
  • SQL Subqueries
  • Adventureworks Database Exercises
  • C# Sharp Basic Exercises
  • SQL COUNT() with distinct
  • JavaScript String Exercises
  • JavaScript HTML Form Validation
  • Java Collection Exercises
  • SQL COUNT() function
  • SQL Inner Join
  • JavaScript functions Exercises
  • Python Tutorial
  • Python Array Exercises
  • SQL Cross Join
  • C# Sharp Array Exercises

We are closing our Disqus commenting system for some maintenanace issues. You may write to us at reach[at]yahoo[dot]com or visit us at Facebook

Источник

Python Program: How To Find URL in Python String?

The findall() method is used to get a list of all possible matches of patterns in a string whereas the search() method returns the first location where this regular expression produces a match.

Let’s get started and see the use of these methods with the help of program examples.

Program To Find URL in String in Python

In this program, we used findall() method that takes the first argument as a regex and the second is a string in which we will find the URL. It returns a list of possible matches. See the code example.

# Python Program to find URL in String import re # Take a string containing URL string = "Python Programs: https://www.javaexercise.com/" # Apply regex to filter the URL regex = r"(?i)\b((?:https?://|www\d[.]|[a-z0-9.\-]+[.][a-z]/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]<>;:'\".,<>?«»“”‘’]))" url = re.findall(regex,string) result = [x[0] for x in url] # Display Result print("Urls: ", result)

In this program, we used the search() method of re module that returns the desired result as URL. See the code below.

# Python Program to find URL in String import re # Take a string containing URL string = "Visit at : https://www.javaexercise.com/" # Apply regex to filter the URL result = re.search("(?Phttps?://[^\s]+)", string).group("url") # Display Result print("Urls: ", result)

Useful References:

Источник

Как вытянуть url из строки?

Как считать url со строки ввода в виджете и потом при нажатии кнопки этого виджета открыть url браузером
Помогите пожалуйста Балбесу. В Tkinter создал виджет с окном ввода и кнопкой. А как это все связать.

Открыть URL (там ссылка, которая совершает переход на другой URL) и вытянуть новый URL
Стоит задача по одному URL получить другой Пробовал открывать URL через webbrowser, но это не.

Как вытянуть url из БД. (flask)
У меня есть база данных mongodb, в которой в title написаны url разделенные по ld. Так вот, как.

Как вытянуть url из БД. (flask)
У меня есть база данных монгодб, в которой есть несколько url сcылок, которые находятся в title.

Как вытянуть имя домена из URL?
Доброе время суток. Подскажите, как определить адрес сервера? Например: http://www.relib.com/

import re text = 'какая то строка [url]http://vk.com/club123456768[/url] откуда нужно вытянуть url [url]http://vk.com/club42[/url]' url_pattern = r'\[url\](.*?)\[\/url\]' urls = re.findall(url_pattern, text) # ['http://vk.com/club123456768', 'http://vk.com/club42']

Эксперт Python

Лучший ответ

Сообщение было отмечено aalexandrov как решение

Решение

0x10,
У него нет в строке слова [url] :-). Это добавка форума.

text = 'какая то строка http://vk.com/club123456768 откуда нужно вытянуть url http://vk.com/club42' url_pattern = r'http://[\S]+' urls = re.findall(url_pattern, text) print(urls)

aalexandrov,
P.S. В расширенном режиме редактирования в Дополнительных опциях есть чекбокс «Другое»: убирайте галочку «Автоматически вставлять ссылки».

Источник

Читайте также:  Get exception info python
Оцените статью