Python read parquet pandas

Содержание

pandas.read_parquet#
pandas.read_parquet#
pd.read_parquet: Read Parquet Files in Pandas
What are Apache Parquet Files?
Understanding the Pandas read_parquet Function

pandas.read_parquet#

pandas. read_parquet ( path , engine = ‘auto’ , columns = None , storage_options = None , use_nullable_dtypes = _NoDefault.no_default , dtype_backend = _NoDefault.no_default , filesystem = None , ** kwargs ) [source] #

Load a parquet object from the file path, returning a DataFrame.

Parameters : path str, path object or file-like object

String, path object (implementing os.PathLike[str] ), or file-like object implementing a binary read() function. The string could be a URL. Valid URL schemes include http, ftp, s3, gs, and file. For file URLs, a host is expected. A local file could be: file://localhost/path/to/table.parquet . A file URL can also be a path to a directory that contains multiple partitioned parquet files. Both pyarrow and fastparquet support paths to directories as well as file URLs. A directory path could be: file://localhost/path/to/tables or s3://bucket/partition_dir .

Parquet library to use. If ‘auto’, then the option io.parquet.engine is used. The default io.parquet.engine behavior is to try ‘pyarrow’, falling back to ‘fastparquet’ if ‘pyarrow’ is unavailable.

When using the ‘pyarrow’ engine and no storage options are provided and a filesystem is implemented by both pyarrow.fs and fsspec (e.g. “s3://”), then the pyarrow.fs filesystem is attempted first. Use the filesystem keyword with an instantiated fsspec filesystem if you wish to use its implementation.

columns list, default=None

If not None, only these columns will be read from the file.

storage_options dict, optional

Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to urllib.request.Request as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded to fsspec.open . Please see fsspec and urllib for more details, and for more examples on storage options refer here.

If True, use dtypes that use pd.NA as missing value indicator for the resulting DataFrame. (only applicable for the pyarrow engine) As new dtypes are added that support pd.NA in the future, the output with this option will change to use those dtypes. Note: this is an experimental option, and behaviour (e.g. additional support dtypes) may change without notice.

Deprecated since version 2.0.

Back-end data type applied to the resultant DataFrame (still experimental). Behaviour is as follows:

«numpy_nullable» : returns nullable-dtype-backed DataFrame (default).
«pyarrow» : returns pyarrow-backed nullable ArrowDtype DataFrame.

Filesystem object to use when reading the parquet file. Only implemented for engine=»pyarrow» .

Any additional kwargs are passed to the engine.

Create a parquet object that serializes a DataFrame.

>>> original_df = pd.DataFrame( . "foo": range(5), "bar": range(5, 10)> . ) >>> original_df foo bar 0 0 5 1 1 6 2 2 7 3 3 8 4 4 9 >>> df_parquet_bytes = original_df.to_parquet() >>> from io import BytesIO >>> restored_df = pd.read_parquet(BytesIO(df_parquet_bytes)) >>> restored_df foo bar 0 0 5 1 1 6 2 2 7 3 3 8 4 4 9 >>> restored_df.equals(original_df) True >>> restored_bar = pd.read_parquet(BytesIO(df_parquet_bytes), columns=["bar"]) >>> restored_bar bar 0 5 1 6 2 7 3 8 4 9 >>> restored_bar.equals(original_df[['bar']]) True

The function uses kwargs that are passed directly to the engine. In the following example, we use the filters argument of the pyarrow engine to filter the rows of the DataFrame.

Since pyarrow is the default engine, we can omit the engine argument. Note that the filters argument is implemented by the pyarrow engine, which can benefit from multithreading and also potentially be more economical in terms of memory.

>>> sel = [("foo", ">", 2)] >>> restored_part = pd.read_parquet(BytesIO(df_parquet_bytes), filters=sel) >>> restored_part foo bar 0 3 8 1 4 9

Источник

pandas.read_parquet#

Load a parquet object from the file path, returning a DataFrame.

Parameters : path str, path object or file-like object

columns list, default=None

If not None, only these columns will be read from the file.

storage_options dict, optional

Deprecated since version 2.0.

Back-end data type applied to the resultant DataFrame (still experimental). Behaviour is as follows:

«numpy_nullable» : returns nullable-dtype-backed DataFrame (default).
«pyarrow» : returns pyarrow-backed nullable ArrowDtype DataFrame.

Filesystem object to use when reading the parquet file. Only implemented for engine=»pyarrow» .

Any additional kwargs are passed to the engine.

Create a parquet object that serializes a DataFrame.

>>> original_df = pd.DataFrame( . "foo": range(5), "bar": range(5, 10)> . ) >>> original_df foo bar 0 0 5 1 1 6 2 2 7 3 3 8 4 4 9 >>> df_parquet_bytes = original_df.to_parquet() >>> from io import BytesIO >>> restored_df = pd.read_parquet(BytesIO(df_parquet_bytes)) >>> restored_df foo bar 0 0 5 1 1 6 2 2 7 3 3 8 4 4 9 >>> restored_df.equals(original_df) True >>> restored_bar = pd.read_parquet(BytesIO(df_parquet_bytes), columns=["bar"]) >>> restored_bar bar 0 5 1 6 2 7 3 8 4 9 >>> restored_bar.equals(original_df[['bar']]) True

The function uses kwargs that are passed directly to the engine. In the following example, we use the filters argument of the pyarrow engine to filter the rows of the DataFrame.

>>> sel = [("foo", ">", 2)] >>> restored_part = pd.read_parquet(BytesIO(df_parquet_bytes), filters=sel) >>> restored_part foo bar 0 3 8 1 4 9

Источник

pd.read_parquet: Read Parquet Files in Pandas

In this tutorial, you’ll learn how to use the Pandas read_parquet function to read parquet files in Pandas. While CSV files may be the ubiquitous file format for data analysts, they have limitations as your data size grows. This is where Apache Parquet files can help!

By the end of this tutorial, you’ll have learned:

What Apache Parquet files are
How to read parquet files with Pandas using the pd.read_parquet() function
How to specify which columns to read in a parquet file
How to speed up reading parquet files with PyArrow
How to specify the engine used to read a parquet file in Pandas

What are Apache Parquet Files?

The Apache Parquet format is a column-oriented data file format. This means data are stored based on columns, rather than by rows. The benefits of this include significantly faster access to data, especially when querying only a subset of columns. This is because only particular can be read, rather than entire records.

The format is an open-source format that is specifically designed for data storage and retrieval. Because of this, its encoding schema is designed for handling massive amounts of data, especially spread across different files.

Understanding the Pandas read_parquet Function

Before diving into using the Pandas read_parquet() function, let’s take a look at the different parameters and default arguments of the function. This will give you a strong understanding of the function’s abilities.

Let’s take a look at the pd.read_parquet() function:

# Understanding the Pandas read_parquet() Function import pandas as pd pd.read_parquet( path, engine='auto', columns=None, storage_options=None, use_nullable_dtypes=False )

We can see that the function offers 5 parameters, 4 of which have default arguments provided. The table below breaks down the function’s parameters and provides descriptions of how they can be used.

Источник