Bytestring to string python

Содержание

Преобразование байтов в строку в Python
Преобразование байтов в строку в Python 3
Преобразование байтов в строку с помощью decode()
Преобразование байтов в строку с кодеками
Преобразование байтов в строку с помощью str()
Преобразование байтов в строку в Python 2
Преобразование байтов в Unicode (Python 2)
Преобразование байтов в строку с помощью decode() (Python 2)
Преобразование байтов в строку с помощью кодеков (Python 2)
Помните о своей кодировке
Python Bytes to String – How to Convert a Bytestring
What is a bytestring?
How to Convert Bytes to a String in Python
Using the decode() method
Using the str() constructor
Using the bytes() constructor
Using the codecs module
Conclusion

Преобразование байтов в строку в Python

В этой статье мы рассмотрим, как преобразовать байты в строку в Python. К концу этой статьи у вас будет четкое представление о том, что это за типы и как эффективно обрабатывать данные с их помощью.

В зависимости от версии Python, которую вы используете, эта задача будет отличаться. Хотя Python 2 подошел к концу, многие проекты все еще используют его, поэтому мы включим оба подхода — Python 2 и Python 3.

Преобразование байтов в строку в Python 3

Начиная с Python 3, пришлось отказаться от старого способа работы с ASCII, и Python стал полностью Unicode.

Это означает, что мы потеряли явный тип Unicode: u»string» — каждая строка — это u»string» !

Чтобы отличить эти строки от старых добрых строк байтов, мы познакомились с новым спецификатором для них — b»string» .

Это было добавлено в Python 2.6, но не служило реальной цели, кроме подготовки к Python 3, поскольку все строки были байтовыми строками в 2.6.

Строки байтов в Python 3 официально называются bytes , неизменной последовательностью целых чисел в диапазоне 0

Преобразование байтов в строку с помощью decode()

Давайте посмотрим, как мы можем преобразовать байты в String, используя встроенный метод decode() для класса bytes :

b = b"Lets grab a \xf0\x9f\x8d\x95!" # Let's check the type print(type(b)) # # Now, let's decode/convert them into a string s = b.decode('UTF-8') print(s) # "Let's grab a 🍕!"

Передав формат кодирования, мы преобразовали объект bytes в строку и распечатали ее.

Преобразование байтов в строку с кодеками

Как вариант, для этой цели мы можем использовать встроенный модуль codecs :

import codecs b = b'Lets grab a \xf0\x9f\x8d\x95!' print(codecs.decode(b, 'UTF-8')) # "Let's grab a 🍕!"

Вам действительно не нужно передавать параметр кодировки, однако рекомендуется передавать его:

print(codecs.decode(b)) # "Let's grab a 🍕!"

Преобразование байтов в строку с помощью str()

Наконец, вы можете использовать str() функцию, которая принимает различные значения и преобразует их в строки:

b = b'Lets grab a \xf0\x9f\x8d\x95!' print(str(b, 'UTF-8')) # "Let's grab a 🍕!"

Не забудьте указать аргумент кодировки str() , иначе вы можете получить неожиданные результаты:

print(str(b)) # b'Lets grab a \xf0\x9f\x8d\x95!'

Это снова подводит нас к кодировкам. Если вы укажете неправильную кодировку, в лучшем случае произойдет сбой вашей программы, потому что она не может декодировать данные. Например, если бы мы попытались использовать функцию str() с UTF-16 , нас бы встретили:

print(str(b, 'UTF-16')) # '敌❴\u2073牧扡愠\uf020趟↕'

Это даже более важно, учитывая, что Python 3 любит использовать Unicode, поэтому, если вы работаете с файлами или источниками данных, которые используют непонятную кодировку, обязательно обратите на это особое внимание.

Преобразование байтов в строку в Python 2

В Python 2 набор байтов и строка — это практически одно и то же: строки — это объекты, состоящие из однобайтовых символов, что означает, что каждый символ может хранить 256 значений. Вот почему их иногда называют строками байтов.

Это замечательно при работе с байтовыми данными — мы просто загружаем их в переменную и готовы к печати:

s = "Hello world!" print(s) # 'Hello world!' print(len(s)) # 12

Однако использование символов Unicode в строках байтов немного меняет это поведение:

s = "Let's grab a 🍕!" print(s) # 'Lets grab a \xf0\x9f\x8d\x95!' # Where has the pizza gone to? print(len(s)) # 17 # Shouldn't that be 15?

Преобразование байтов в Unicode (Python 2)

Здесь нам придется использовать тип Python 2 Unicode , который предполагается и автоматически используется в Python 3. В нем строки хранятся как последовательность кодовых точек, а не байтов.

Представляет собой байты \xf0\x9f\x8d\x95 , последовательность шестнадцатеричных чисел и Python не знает, как представить их в виде ASCII:

>>> u = u"Let's grab a 🍕!" u"Let's grab a \U0001f355!"" >>> u "Let's grab a 🍕!" # Yum. >>> len(u) 15

Как вы можете видеть выше, строка Unicode содержит \U0001f355 — экранированный символ Unicode, который наш терминал распечатывает как кусок пиццы! Установить это было так же просто, как использовать спецификатор u перед значением байтовой строки.

Итак, как мне переключаться между ними?

Вы можете получить строку Unicode, расшифровав свою байтовую строку. Это можно сделать, создав объект Unicode, предоставив байтовую строку и строку, содержащую имя кодировки в качестве аргументов, или вызвав .decode(encoding) у байтовой строки.

Преобразование байтов в строку с помощью decode() (Python 2)

Вы также можете использовать codecs.encode(s, encoding) из модуля codecs .

>>> s = "Let's grab a \xf0\x9f\x8d\x95!" >>> u = unicode(s, 'UTF-8') >>> u "Let's grab a 🍕!" >>> s.decode('UTF-8') "Let's grab a 🍕!"

Преобразование байтов в строку с помощью кодеков (Python 2)

Или, используя модуль codecs :

import codecs >>> codecs.decode(s, 'UTF-8') "Let's grab a 🍕!"

Помните о своей кодировке

Здесь следует предостеречь — байты могут по-разному интерпретироваться в разных кодировках. Из- за того, что из коробки доступно около 80 различных кодировок, может быть нелегко узнать, есть ли у вас правильная!

s = '\xf8\xe7' # This one will let us know we used the wrong encoding >>> s.decode('UTF-8') UnicodeDecodeError: 'utf8' codec can't decode byte 0xf8 in position 0: invalid start byte # These two overlaps and this is a valid string in both >>> s.decode('latin1') øç s.decode('iso8859_5') јч

Исходное сообщение было либо, øç либо јч , и оба кажутся допустимыми преобразованиями.

Источник

Python Bytes to String – How to Convert a Bytestring

Shittu Olumide

In this article, you will learn how to convert a bytestring. I know the word bytestring might sound technical and difficult to understand. But trust me – we will break the process down and understand everything about bytestrings before writing the Python code that converts bytes to a string.

So let’s start by defining a bytestring.

What is a bytestring?

A bytestring is a sequence of bytes, which is a fundamental data type in computing. They are typically represented using a sequence of characters, with each character representing one byte of data.

Bytes are often used to represent information that is not character-based, such as images, audio, video, or other types of binary data.

In Python, a bytestring is represented as a sequence of bytes, which can be encoded using various character encodings such as UTF-8, ASCII, or Latin-1. It can be created using the bytes() or bytearray() functions, and can be converted to and from strings using the encode() and decode() methods.

Note that in Python 3.x, bytestrings and strings are distinct data types, and cannot be used interchangeably without encoding or decoding.

This is because Python 3.x uses Unicode encoding for strings by default, whereas previous versions of Python used ASCII encoding. So when working with bytestrings in Python 3.x, it’s important to be aware of the encoding used and to properly encode and decode data as needed.

How to Convert Bytes to a String in Python

Now that we have the basic understanding of what bytestring is, let’s take a look at how we can convert bytes to a string using Python methods, constructors, and modules.

Using the decode() method

decode() is a method that you can use to convert bytes into a string. It is commonly used when working with text data that is encoded in a specific character encoding, such as UTF-8 or ASCII. It simply works by taking an encoded byte string as input and returning a decoded string.

decoded_string = byte_string.decode(encoding)

Where byte_string is the input byte string that we want to decode and encoding is the character encoding used by the byte string.

Here is some example code that demonstrates how to use the decode() method to convert a byte string to a string:

# Define a byte string byte_string = b"hello world" # Convert the byte string to a string using the decode() method decoded_string = byte_string.decode("utf-8") # Print the decoded string print(decoded_string)

In this example, we define a byte string b»hello world» and convert it to a string using the decode() method with the UTF-8 character encoding. The resulting decoded string is «hello world» , which is then printed to the console.

Note that the decode() method can also take additional parameters, such as errors and final , to control how decoding errors are handled and whether the decoder should expect more input.

Using the str() constructor

You can use the str() constructor in Python to convert a byte string (bytes object) to a string object. This is useful when we are working with data that has been encoded in a byte string format, such as when reading data from a file or receiving data over a network socket.

The str() constructor takes a single argument, which is the byte string that we want to convert to a string. If the byte string is not valid ASCII or UTF-8, we will need to specify the encoding format using the encoding parameter.

# Define a byte string byte_string = b"Hello, world!" # Convert the byte string to a string using the str() constructor string = str(byte_string, encoding='utf-8') # Print the string print(string)

In this example, we define a byte string b»Hello, world!» and use the str() constructor to convert it to a string object. We specify the encoding format as utf-8 using the encoding parameter. Finally, we print the resulting string to the console.

Using the bytes() constructor

We can also use the bytes() constructor, a built-in Python function used to create a new bytes object. It takes an iterable of integers as input and returns a new bytes object that contains the corresponding bytes. This is useful when we are working with binary data, or when converting between different types of data that use bytes as their underlying representation.

# Define a string string = "Hello, world!" # Convert the string to a bytes object bytes_object = bytes(string, 'utf-8') # Print the bytes object print(bytes_object) # Convert the bytes object back to a string decoded_string = bytes_object.decode('utf-8') # Print the decoded string print(decoded_string)

In this example, we start by defining a string variable string . We then use the bytes() constructor to convert the string to a bytes object, passing in the string and the encoding ( utf-8 ) as arguments. We print the resulting bytes object to the console.

Next, we use the decode() method to convert the bytes object back to a string, passing in the same encoding ( utf-8 ) as before. We print the decoded string to the console as well.

Using the codecs module

The codecs module in Python provides a way to convert data between different encodings, such as between byte strings and Unicode strings. It contains a number of classes and functions that you can use to perform various encoding and decoding operations.

For us to be able to convert Python bytes to a string, we can use the decode() method provided by the codecs module. This method takes two arguments: the first is the byte string that we want to decode, and the second is the encoding that we want to use.

import codecs # byte string to be converted b_string = b'\xc3\xa9\xc3\xa0\xc3\xb4' # decoding the byte string to unicode string u_string = codecs.decode(b_string, 'utf-8') print(u_string)

In this example, we have a byte string b_string which contains some non-ASCII characters. We use the codecs.decode() method to convert this byte string to a Unicode string.

The first argument to this method is the byte string to be decoded, and the second argument is the encoding used in the byte string (in this case, it is utf-8 ). The resulting Unicode string is stored in u_string .

To convert a Unicode string to a byte string using the codecs module, we use the encode() method. Here is an example:

import codecs # unicode string to be converted u_string = 'This is a test.' # encoding the unicode string to byte string b_string = codecs.encode(u_string, 'utf-8') print(b_string)

In this example, we have a Unicode string u_string . We use the codecs.encode() method to convert this Unicode string to a byte string. The first argument to this method is the Unicode string to be encoded, and the second argument is the encoding to use for the byte string (in this case, it is utf-8 ). The resulting byte string is stored in b_string .

Conclusion

Understanding bytestrings and string conversion is important because it is a fundamental aspect of working with text data in any programming language.

In Python, this is particularly relevant due to the increasing popularity of data science and natural language processing applications, which often involve working with large amounts of text data.

For further learning, check out these helpful resources:

Let’s connect on Twitter and on LinkedIn. You can also subscribe to my YouTube channel.

Источник