Extracting substrings in python

How to Get a Substring of a String in Python

Learning anything new can be a challenge. The more you work with Python, the more you notice how often strings pop up. String manipulation in Python is an important skill. In this article, we give you an introduction to generating a substring of a string in Python.

Python is a great language to learn especially if you’re a beginner, as we discuss in this article. We even have a course on working with strings in Python. It contains interactive exercises designed to start from the basic level and teach you all you need to know about this important data type. Once you’re comfortable working with strings, you can work on some interesting data science problems. Take a look at the Python for Data Science course, which gives you an introduction to this diverse topic.

Slicing and Splitting Strings

The first way to get a substring of a string in Python is by slicing and splitting. Let’s start by defining a string, then jump into a few examples:

>>> string = 'This is a sentence. Here is 1 number.'

You can break this string up into substrings, each of which has the str data type. Even if your string is a number, it is still of this data type. You can test this with the built-in type() function. Numbers may be of other types as well, including the decimal data type, which we discuss here.

Читайте также:  Python static method get class

Much like arrays and lists in Python, strings can be sliced by specifying the start and the end indexes, inside square brackets and separated by a colon. This returns a substring of the original string.

Remember indexing in Python starts from 0. To get the first 7 characters from the string, simply do the following:

Notice here we didn’t explicitly specify the start index. Therefore, it takes a default value of 0.

By the way, if you want more information about the print() function, check out this article. There’s probably more to it than you realize.

We can also index relative to the end of the string by specifying a negative start value:

Since we didn’t specify an end value, it takes the default value of len(string) . If you know the start and the end indexes of a particular word, you can extract it from the string like this:

However, this is not optimal for extracting individual words from a string since it requires knowing the indexes in advance.

Another option to get a substring of the string is to break it into words, which can be done with the string.split() method. This takes two optional arguments: a string defining which characters to split at (defaults to any whitespace), and the maximum number of splits (defaults to -1, which means no limit). As an example, if we want to split at a space, you can do the following, which returns a list of strings:

>>> string.split(' ') ['This', 'is', 'a', 'sentence.', 'Here', 'is', '1', 'number.']

But notice the full stop (point character) is included at the end of the words “sentence” and “number”. We’ll come back to this later in the article when we look at regular expressions.

There are plenty of built-in string methods in Python. They allow you to modify a string, test its properties, or search in it. A useful method to generate a more complex substring of a string in Python is the string.join() method. It takes an iterable of strings and joins them. Here’s an example:

>>> print(' and '.join(['one', 'two', 'three'])) one and two and three

With a clever indexing trick, this can be used to print a substring containing every second word from the original:

>>> print(' '.join(string.split(' ')[::2])) This a Here 1

Since the input to the join() method takes a list, you can do a list comprehension to create a substring from all words with a length equal to 4, for example. For those of you looking for a more challenging exercise, try this for yourself. We’ll also show you a different method to do this later in the article. If you want to know how to write strings to a file in Python, check out this article.

The parse Module

There’s a little-known Python module called parse with great functionality for generating a substring in Python. This module doesn’t come standard with Python and needs to be installed separately. The best way is to run the pip install command from your terminal.

Here’s how to get a substring using the parse function, which accepts two arguments:

>>> import parse >>> substring = parse.parse('This is <>. Here is 1 <>.', 'This is a sentence. Here is 1 number.') >>> substring.fixed ('a sentence', 'number')

Calling the fixed method on substring returns a tuple with the substrings extracted from the second argument at the position of the curly braces <> in the first argument. For those of you familiar with string formatting, this may look suspiciously familiar. Indeed, the parse module is the opposite of format() . Check this out, which does the opposite of the above code snippet:

>>> print('This is <>. Here is 1 <>.'.format('a sentence', 'number')) This is a sentence. Here is 1 number.

While we’re talking about the parse module, it’s worth discussing the search function, since searching is a common use case when working with strings. The first argument of search defines what you’re looking for by specifying the search term with curly braces. The second defines where to look.

>>> result = parse.search('is a <>.', 'This is a sentence. Here is 1 number') >>> result.fixed ('sentence',)

Once again, calling the fixed method returns a tuple with the results. If you want the start and the end indexes of the result, call the spans method. Using the parse module to search in a string is nice – it’s pretty robust to how you define what you’re searching for (i.e., the first argument).

Regular Expressions

The last Python module we want to discuss is re, which is short for “regex,” which is itself short for “regular expression.” Regular expressions can be a little intimidating – they involve defining highly specialized and sometimes complicated patterns to search in strings.

You can use regex to extract substrings in Python. The topic is too deep to cover here comprehensively, so we’ll just mention some useful functions and give you a feel for how to define the search patterns. For more information on this module and its functionality, see the documentation.

The findall() function takes two required arguments: pattern and string. Let’s start by extracting all words from the string we used above:

>>> re.findall(r'[a-z]+', 'This is a sentence. Here is 1 number.', flags=re.IGNORECASE) ['This', 'is', 'a', 'sentence', 'Here', 'is', 'number']

The [a-z] pattern matches all lowercase letters, the + indicates the words may be of any length, and the flag tells you to ignore the case. Compare this to the result we got above by using string.split() , and you notice the full stop is not included.

Now, let’s extract all numbers from the string:

>>> re.findall(r'\b\d+\b', 'This is a sentence. Here is 1 number.') ['1']

\b matches a boundary at the start and end of the pattern, \d matches any digit from 0 to 9, and again the + indicates the numbers may be of any length. For example, we find all words with a length of 4 characters with the following:

>>> re.findall(r'\b\w\b', 'This is a sentence. Here is 1 number.') ['This', 'Here']

\w matches any words, and defines the length of the words to match. To generate a substring, you just need to use string.join() as we did above. This is an alternative approach to the list comprehension we mentioned earlier, which may also be used to generate a substring with all words of length 4.

There are other functions in this module worth taking a look at. match() may be used to determine if the pattern matches at the beginning of the string, and search() scans through the string to look for any location where the pattern occurs.

Closing Thoughts on Generating Substrings in Python

In this article, we have discussed extracting and printing substrings of strings in Python. Use this as a foundation to explore other topics such as scraping data from a website. Can you define a regex pattern to extract an email address from a string? Or remove punctuation from this paragraph? If you can, you’re on your way to becoming a data wrangler!

If you also work a lot with tabular data, we have an article that shows you how to pretty-print tables in Python. Slowly adding all these skills to your toolbox will turn you into an expert programmer.

Источник

Extract Substring From a String in Python

Extract Substring From a String in Python

  1. Extract Substring Using String Slicing in Python
  2. Extract Substring Using the slice() Constructor in Python
  3. Extract Substring Using Regular Expression in Python

The string is a sequence of characters. We deal with strings all the time, no matter if we are doing software development or competitive programming. Sometimes, while writing programs, we have to access sub-parts of a string. These sub-parts are more commonly known as substrings. A substring is a subset of a string.

In Python, we can easily do this task using string slicing or using regular expression or regex.

Extract Substring Using String Slicing in Python

There are a few ways to do string slicing in Python. Indexing is the most basic and the most commonly used method. Refer to the following code.

myString = "Mississippi" print(myString[:]) # Line 1 print(myString[4 : ]) # Line 2 print(myString[ : 8]) # Line 3 print(myString[2 : 7]) # Line 4 print(myString[4 : -1]) # Line 5 print(myString[-6 : -1]) # Line 6 
Mississippi issippi Mississi ssiss issipp ssipp 

In the above code, we add [] brackets at the end of the variable storing the string. We use this notation for indexing. Inside these brackets, we add some integer values that represent indexes.

This is the format for the brackets [start : stop : step] (seperated by colons ( : )).

By default, the value of start is 0 or the first index, the value of stop is the last index, and the value of step is 1 . start represents the starting index of the substring, stop represents the ending index of the substring, and step represents the value to use for incrementing after each index.

The substring returned is actually between start index and stop — 1 index because the indexing starts from 0 in Python. So, if we wish to retrieve Miss from Mississippi , we should use [0 : 4]

  • [:] -> Returns the whole string.
  • [4 : ] -> Returns a substring starting from index 4 till the last index.
  • [ : 8] -> Returns a substring starting from index 0 till index 7 .
  • [2 : 7] -> Returns a substring starting from index 2 till index 6 .
  • [4 : -1] -> Returns a substring starting from index 4 till second last index. -1 can be used to define the last index in Python.
  • [-6 : -1] -> Returns a substring starting from the sixth index from the end till the second last index.

Источник

Оцените статью