Python encoding bytes to str

Python How to Convert Bytes to String (5 Approaches)

To convert bytes into a string in Python, use the bytes.decode() method.

name_byte = b'Alice' name_str = name_byte.decode() print(name_str)

However, depending on the context and your needs, there are other ways to convert bytes to strings.

In this guide, you learn how to convert bytes to string in 5 different ways in different situations.

Here’s a short review of the byte-to-string converting methods:

Method Example
1. The decode() method of a byte string byte_string.decode(‘UTF-8’)
2. The built-in str() method str(byte_string, ‘UTF-8’)
3. Codecs decode() function codecs.decode(byte_string)
4. Pandas dataframe decode() method df[‘column’].str.decode(“utf-8”)
5. The join() method with map() function “”.join(map(chr, byte_str))

Bytes vs Strings in Python

There is a chance you are looking to convert bytes to strings because you do not know what they are. Before jumping into the conversions, let’s take a quick look at what are bytes in the first place.

Why Bytes?

A computer doesn’t understand the notion of “text” or “number” as is. This is because computers operate on bits, that is, 0s and 1s.

0s and 1s in a screen

Storing data to a computer happens by using groups of bits, also known as bytes. Usually, there are 8 bits in a byte. But this might vary depending on what system you’re using.

Читайте также:  PHP Program to show current page URL

Byte Strings in Python

In Python, a byte string is a sequence of bytes that the computer understands but humans can’t.

A string is a sequence of characters and is something we humans can understand but cannot directly store in a computer.

This is why any string needs to be converted to a byte string before the computer can use it.

In Python, a bytes object is a byte representation of a string. A bytes object is prefixed with the letter ‘b‘.

For example, take a look at these two variables:

name1 = 'Alice' name2 = b'Alice'

You can verify this by printing out the data types of these variables:

name1 = 'Alice' name2 = b'Alice' print(type(name1)) print(type(name2))

As I mentioned earlier, the byte string is something that is hard to understand. In the above code, this isn’t clear as you can just read the b’Alice’ very clearly.

Byte String vs String in Python

To see the main difference between the byte string and a string, let’s print the words character by character.

First, let’s do the name1 variable:

name1 = 'Alice' name2 = b'Alice' for c in name1: print(c)

Now, let’s print each byte in the name2 bytes object:

name1 = 'Alice' name2 = b'Alice' for c in name2: print(c)

Here you can see there is no way for you to tell what those numbers mean. Those numbers are the byte values of the characters in a string. Something that a computer can understand.

To make one more thing clear, let’s see what happens if we print the bytes object name2 as-is:

name1 = 'Alice' name2 = b'Alice' print(name2)

As your surprize, it clearly says “Alice”. This isn’t too hard to read, is it?

The reason why the byte string prints out as a readable string is because what you see is actually a string representation of the bytes object.

Python does this for the developer’s convenience.

If there was no special string representation for a bytes object, printing bytes would be nonsense.

Anyway, now you understand what is a bytes object in Python, and how it differs from the str object.

Now, let’s see how to convert between bytes and string.

1. The decode() Function

decode bytes to string in Python

Given a bytes object, you can use the built-in decode() method to convert the byte to a string.

You can also pass the encoding type to this function as an argument.

For example, let’s use the UTF-8 encoding for converting bytes to a string:

byte_string = b"Do you want a slice of \xf0\x9f\x8d\x95?" string = byte_string.decode('UTF-8') print(string)

2. The str() Function

converting a byte to string with the str() function

Another approach to convert bytes to string is by using the built-in str() function.

This method does the exact same thing as the decode() method in the previous example.

byte_string = b"Do you want a slice of \xf0\x9f\x8d\x95?" string = str(byte_string, 'UTF-8') print(string)

Perhaps the only downside to this approach is in the code readability.

If you compare these two lines:

name_str = str(byte_string, 'UTF-8') name_str = byte_string.decode('UTF-8')

You can see the latter is more explicit about decoding the bytes to a string.

3. Codecs decode() Function

Python codecs decode function converts a byte to string

Python also has a built-in codecs module for text decoding and encoding.

This module also has its own decode() function. You can use this function to convert bytes to strings (and vice versa).

import codecs byte_string = b"Do you want a slice of \xf0\x9f\x8d\x95?" name_byte = codecs.decode(byte_string) print(name_byte)

4. Pandas decode() Function

Using pandas to covnert bytes to string using the decode function

If you are working with pandas and you have a data frame that consists of bytes, you can easily convert them to strings by calling the str.decode() function on a column.

import pandas as pd data_bytes = df = pd.DataFrame(data=data_bytes) data_strings = df['column'].str.decode("utf-8") print(data_strings)
0 Alice 1 Bob 2 Charlie Name: column, dtype: object

5. map() Function: Convert a Byte List to String

Using str.join and map() function to convert bytes to string in Python

In Python, a string is a group of characters.

Each Python character is associated with a Unicode value, which is an integer.

Thus, you can convert an integer to a character in Python.

To do this, you can call the built-in chr() function on an integer.

Given a list of integers, you can use the map() function to map each integer to a character.

Here is how it looks in code:

byte_data = [65, 108, 105, 99, 101] strings = "".join(map(chr, byte_data)) print(strings)
  1. Converts the integers to corresponding characters.
  2. Returns a list of characters.
  3. Merges the list of characters to a single string.

To learn more about the map() function in Python, feel free to read this article.

Be Careful with the Encoding

There are dozens of byte-to-string encodings out there.

In this guide, we only used the UTF-8 encoding, which is the most popular encoding type.

The UTF-8 is also the default encoding type in Python. However, UTF-8 encoding is not always the correct one.

s = b"test \xe7\xf8\xe9" s.decode('UTF-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe7 in position 5: invalid continuation byte

This error means there is no character in the UTF-8 encoding that corresponds to the bytes in the string.

In other words, you should be using a different encoding.

You can use a module like chardet to detect the character encodings. (Notice that this module is not maintained, but most of the info you learn about it is still applicable.)

However, no approach is 100% foolproof. This module gives you its best guess about the encoding and the probability associated with it.

Anyway, let’s say the above byte string can be decoded using the latin1 encoding as well as the iso_8559_5 encoding.

Now let’s make the conversion:

s = b"test \xe7\xf8\xe9" print(s.decode('latin1')) print(s.decode('iso8859_5'))

This time there is no error. Instead, it works with both encodings and produces a different result.

So be careful with the encodings!

If you see an error when doing a conversion, the first thing you need to do is to figure out the encoding used. Then you should use that particular encoding to encode/decode your values to get it right.

Conclusion

Today you learned how to convert bytes to strings in Python.

To recap, there is a bunch of ways to convert bytes to strings in Python.

  • To convert a byte string to a string, use the bytes.decode() method.
  • If you have a list of bytes, call chr() function on each byte using the map() function (or a for loop)
  • If you have a pandas dataframe with bytes, call the .str.decode() method on the column with bytes.

By default, the Python character encoding is usually UTF-8.

However, this is not always applicable. Trying to encode a non-UTF-8 byte with UTF-8 produces an error. In this situation, you should determine the right character encoding before encoding/decoding. You can use a module like chardet to do this.

Источник

Конвертация между байтами и строками#

Избежать работы с байтами нельзя. Например, при работе с сетью или файловой системой, чаще всего, результат возвращается в байтах.

Соответственно, надо знать, как выполнять преобразование байтов в строку и наоборот. Для этого и нужна кодировка.

Кодировку можно представлять как ключ шифрования, который указывает:

  • как «зашифровать» строку в байты (str -> bytes). Используется метод encode (похож на encrypt)
  • как «расшифровать» байты в строку (bytes -> str). Используется метод decode (похож на decrypt)

Эта аналогия позволяет понять, что преобразования строка-байты и байты-строка должны использовать одинаковую кодировку.

encode, decode#

Для преобразования строки в байты используется метод encode:

In [1]: hi = 'привет' In [2]: hi.encode('utf-8') Out[2]: b'\xd0\xbf\xd1\x80\xd0\xb8\xd0\xb2\xd0\xb5\xd1\x82' In [3]: hi_bytes = hi.encode('utf-8') 

Чтобы получить строку из байт, используется метод decode:

In [4]: hi_bytes Out[4]: b'\xd0\xbf\xd1\x80\xd0\xb8\xd0\xb2\xd0\xb5\xd1\x82' In [5]: hi_bytes.decode('utf-8') Out[5]: 'привет' 

str.encode, bytes.decode#

Метод encode есть также в классе str (как и другие методы работы со строками):

In [6]: hi Out[6]: 'привет' In [7]: str.encode(hi, encoding='utf-8') Out[7]: b'\xd0\xbf\xd1\x80\xd0\xb8\xd0\xb2\xd0\xb5\xd1\x82' 

А метод decode есть у класса bytes (как и другие методы):

In [8]: hi_bytes Out[8]: b'\xd0\xbf\xd1\x80\xd0\xb8\xd0\xb2\xd0\xb5\xd1\x82' In [9]: bytes.decode(hi_bytes, encoding='utf-8') Out[9]: 'привет' 

В этих методах кодировка может указываться как ключевой аргумент (примеры выше) или как позиционный:

In [10]: hi_bytes Out[10]: b'\xd0\xbf\xd1\x80\xd0\xb8\xd0\xb2\xd0\xb5\xd1\x82' In [11]: bytes.decode(hi_bytes, 'utf-8') Out[11]: 'привет' 

Как работать с Юникодом и байтами#

Есть очень простое правило, придерживаясь которого, можно избежать, как минимум, части проблем. Оно называется «Юникод-сэндвич»:

  • байты, которые программа считывает, надо как можно раньше преобразовать в Юникод (строку)
  • внутри программы работать с Юникод
  • Юникод надо преобразовать в байты как можно позже, перед передачей

Источник

Оцените статью