pandas.read_parquet#
pandas. read_parquet ( path , engine = ‘auto’ , columns = None , storage_options = None , use_nullable_dtypes = _NoDefault.no_default , dtype_backend = _NoDefault.no_default , filesystem = None , ** kwargs ) [source] #
Load a parquet object from the file path, returning a DataFrame.
Parameters : path str, path object or file-like object
String, path object (implementing os.PathLike[str] ), or file-like object implementing a binary read() function. The string could be a URL. Valid URL schemes include http, ftp, s3, gs, and file. For file URLs, a host is expected. A local file could be: file://localhost/path/to/table.parquet . A file URL can also be a path to a directory that contains multiple partitioned parquet files. Both pyarrow and fastparquet support paths to directories as well as file URLs. A directory path could be: file://localhost/path/to/tables or s3://bucket/partition_dir .
Parquet library to use. If ‘auto’, then the option io.parquet.engine is used. The default io.parquet.engine behavior is to try ‘pyarrow’, falling back to ‘fastparquet’ if ‘pyarrow’ is unavailable.
When using the ‘pyarrow’ engine and no storage options are provided and a filesystem is implemented by both pyarrow.fs and fsspec (e.g. “s3://”), then the pyarrow.fs filesystem is attempted first. Use the filesystem keyword with an instantiated fsspec filesystem if you wish to use its implementation.
columns list, default=None
If not None, only these columns will be read from the file.
storage_options dict, optional
Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to urllib.request.Request as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded to fsspec.open . Please see fsspec and urllib for more details, and for more examples on storage options refer here.
If True, use dtypes that use pd.NA as missing value indicator for the resulting DataFrame. (only applicable for the pyarrow engine) As new dtypes are added that support pd.NA in the future, the output with this option will change to use those dtypes. Note: this is an experimental option, and behaviour (e.g. additional support dtypes) may change without notice.
Deprecated since version 2.0.
Back-end data type applied to the resultant DataFrame (still experimental). Behaviour is as follows:
- «numpy_nullable» : returns nullable-dtype-backed DataFrame (default).
- «pyarrow» : returns pyarrow-backed nullable ArrowDtype DataFrame.
Filesystem object to use when reading the parquet file. Only implemented for engine=»pyarrow» .
Any additional kwargs are passed to the engine.
Create a parquet object that serializes a DataFrame.
>>> original_df = pd.DataFrame( . "foo": range(5), "bar": range(5, 10)> . ) >>> original_df foo bar 0 0 5 1 1 6 2 2 7 3 3 8 4 4 9 >>> df_parquet_bytes = original_df.to_parquet() >>> from io import BytesIO >>> restored_df = pd.read_parquet(BytesIO(df_parquet_bytes)) >>> restored_df foo bar 0 0 5 1 1 6 2 2 7 3 3 8 4 4 9 >>> restored_df.equals(original_df) True >>> restored_bar = pd.read_parquet(BytesIO(df_parquet_bytes), columns=["bar"]) >>> restored_bar bar 0 5 1 6 2 7 3 8 4 9 >>> restored_bar.equals(original_df[['bar']]) True
The function uses kwargs that are passed directly to the engine. In the following example, we use the filters argument of the pyarrow engine to filter the rows of the DataFrame.
Since pyarrow is the default engine, we can omit the engine argument. Note that the filters argument is implemented by the pyarrow engine, which can benefit from multithreading and also potentially be more economical in terms of memory.
>>> sel = [("foo", ">", 2)] >>> restored_part = pd.read_parquet(BytesIO(df_parquet_bytes), filters=sel) >>> restored_part foo bar 0 3 8 1 4 9
pandas.read_parquet#
pandas. read_parquet ( path , engine = ‘auto’ , columns = None , storage_options = None , use_nullable_dtypes = _NoDefault.no_default , dtype_backend = _NoDefault.no_default , filesystem = None , ** kwargs ) [source] #
Load a parquet object from the file path, returning a DataFrame.
Parameters : path str, path object or file-like object
String, path object (implementing os.PathLike[str] ), or file-like object implementing a binary read() function. The string could be a URL. Valid URL schemes include http, ftp, s3, gs, and file. For file URLs, a host is expected. A local file could be: file://localhost/path/to/table.parquet . A file URL can also be a path to a directory that contains multiple partitioned parquet files. Both pyarrow and fastparquet support paths to directories as well as file URLs. A directory path could be: file://localhost/path/to/tables or s3://bucket/partition_dir .
Parquet library to use. If ‘auto’, then the option io.parquet.engine is used. The default io.parquet.engine behavior is to try ‘pyarrow’, falling back to ‘fastparquet’ if ‘pyarrow’ is unavailable.
When using the ‘pyarrow’ engine and no storage options are provided and a filesystem is implemented by both pyarrow.fs and fsspec (e.g. “s3://”), then the pyarrow.fs filesystem is attempted first. Use the filesystem keyword with an instantiated fsspec filesystem if you wish to use its implementation.
columns list, default=None
If not None, only these columns will be read from the file.
storage_options dict, optional
Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to urllib.request.Request as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded to fsspec.open . Please see fsspec and urllib for more details, and for more examples on storage options refer here.
If True, use dtypes that use pd.NA as missing value indicator for the resulting DataFrame. (only applicable for the pyarrow engine) As new dtypes are added that support pd.NA in the future, the output with this option will change to use those dtypes. Note: this is an experimental option, and behaviour (e.g. additional support dtypes) may change without notice.
Deprecated since version 2.0.
Back-end data type applied to the resultant DataFrame (still experimental). Behaviour is as follows:
- «numpy_nullable» : returns nullable-dtype-backed DataFrame (default).
- «pyarrow» : returns pyarrow-backed nullable ArrowDtype DataFrame.
Filesystem object to use when reading the parquet file. Only implemented for engine=»pyarrow» .
Any additional kwargs are passed to the engine.
Create a parquet object that serializes a DataFrame.
>>> original_df = pd.DataFrame( . "foo": range(5), "bar": range(5, 10)> . ) >>> original_df foo bar 0 0 5 1 1 6 2 2 7 3 3 8 4 4 9 >>> df_parquet_bytes = original_df.to_parquet() >>> from io import BytesIO >>> restored_df = pd.read_parquet(BytesIO(df_parquet_bytes)) >>> restored_df foo bar 0 0 5 1 1 6 2 2 7 3 3 8 4 4 9 >>> restored_df.equals(original_df) True >>> restored_bar = pd.read_parquet(BytesIO(df_parquet_bytes), columns=["bar"]) >>> restored_bar bar 0 5 1 6 2 7 3 8 4 9 >>> restored_bar.equals(original_df[['bar']]) True
The function uses kwargs that are passed directly to the engine. In the following example, we use the filters argument of the pyarrow engine to filter the rows of the DataFrame.
Since pyarrow is the default engine, we can omit the engine argument. Note that the filters argument is implemented by the pyarrow engine, which can benefit from multithreading and also potentially be more economical in terms of memory.
>>> sel = [("foo", ">", 2)] >>> restored_part = pd.read_parquet(BytesIO(df_parquet_bytes), filters=sel) >>> restored_part foo bar 0 3 8 1 4 9
pd.read_parquet: Read Parquet Files in Pandas
In this tutorial, you’ll learn how to use the Pandas read_parquet function to read parquet files in Pandas. While CSV files may be the ubiquitous file format for data analysts, they have limitations as your data size grows. This is where Apache Parquet files can help!
By the end of this tutorial, you’ll have learned:
- What Apache Parquet files are
- How to read parquet files with Pandas using the pd.read_parquet() function
- How to specify which columns to read in a parquet file
- How to speed up reading parquet files with PyArrow
- How to specify the engine used to read a parquet file in Pandas
What are Apache Parquet Files?
The Apache Parquet format is a column-oriented data file format. This means data are stored based on columns, rather than by rows. The benefits of this include significantly faster access to data, especially when querying only a subset of columns. This is because only particular can be read, rather than entire records.
The format is an open-source format that is specifically designed for data storage and retrieval. Because of this, its encoding schema is designed for handling massive amounts of data, especially spread across different files.
Understanding the Pandas read_parquet Function
Before diving into using the Pandas read_parquet() function, let’s take a look at the different parameters and default arguments of the function. This will give you a strong understanding of the function’s abilities.
Let’s take a look at the pd.read_parquet() function:
# Understanding the Pandas read_parquet() Function import pandas as pd pd.read_parquet( path, engine='auto', columns=None, storage_options=None, use_nullable_dtypes=False )
We can see that the function offers 5 parameters, 4 of which have default arguments provided. The table below breaks down the function’s parameters and provides descriptions of how they can be used.