Тепловая карта корреляции python matplotlib

Содержание

How to create a seaborn correlation heatmap in Python?
Installation
Correlation heatmap
Mataplotlib heatmap for correlation matrix using dataframe
Create a dataframe
Create a correlation matrix of the Dataframe
what is a correlation matrix?
Plot Matplotlib heatmap of correlation matrix
Using Pandas background_gradient for heatmap
Share on
You may also enjoy
pandas count duplicate rows
Pandas value error while merging two dataframes with different data types
How to get True Positive, False Positive, True Negative and False Negative from confusion matrix in scikit learn
Pandas how to use list of values to select rows from a dataframe

How to create a seaborn correlation heatmap in Python?

Seaborn is a Python library that is based on matplotlib and is used for data visualization. It provides a medium to present data in a statistical graph format as an informative and attractive medium to impart some information. A heatmap is one of the components supported by seaborn where variation in related data is portrayed using a color palette. This article centrally focuses on a correlation heatmap and how seaborn in combination with pandas and matplotlib can be used to generate one for a dataframe.

Installation

Like any another Python library, seaborn can be easily installed using pip:

This library is a part of Anaconda distribution and usually works just by import if your IDE is supported by Anaconda, but it can be installed too by the following command:

Correlation heatmap

A correlation heatmap is a heatmap that shows a 2D correlation matrix between two discrete dimensions, using colored cells to represent data from usually a monochromatic scale. The values of the first dimension appear as the rows of the table while of the second dimension as a column. The color of the cell is proportional to the number of measurements that match the dimensional value. This makes correlation heatmaps ideal for data analysis since it makes patterns easily readable and highlights the differences and variation in the same data. A correlation heatmap, like a regular heatmap, is assisted by a colorbar making data easily readable and comprehensible.

The following steps show how a correlation heatmap can be produced:

Import all required modules first
Import the file where your data is stored
Plot a heatmap
Display it using matplotlib

For plotting heatmap method of the seaborn module will be used.

Except for data all other attributes are optional and data obviously will be the data to be plotted. The data here has to be passed with corr() method to generate a correlation heatmap. Also, corr() itself eliminates columns which will be of no use while generating a correlation heatmap and selects those which can be used.

For the example given below, here a dataset downloaded from kaggle.com is being used. The plot shows data related to bestseller novels on amazon.

Dataset used – Bestsellers

Источник

Mataplotlib heatmap for correlation matrix using dataframe

We will first create a dataframe of list of countries and their GDP, Population, GDP per capita, Agricutural land and CO2 emission as a separate columns in this dataframe. The values for these columns that I will be showing here is fake and doesn’t represent their real worth.

Once this dataframe is created then we will generate a correlation matrix to find out the correlation between each column of the dataframe and plot this correlation matrix heatmap using Matplotlib. Finally, we will also explore the pandas background_gradient style function that colors the background in a gradient style.

Create a dataframe

Let’s create a dataframe with all the following six columns: countries, GDP_trillion, population, GDP_per_capita, Agricultural_land and Co2_emission

import numpy as np import matplotlib import matplotlib.pyplot as plt import pandas as pd countries = ["china", "usa", "france", "russia", "japan", "india", "UK"] GDP_trillion = [16, 23, 2, 1, 5, 3, 3] Population = [1.4, 0.3, 0.068, 0.14, 0.12, 1.3, 0.067] GDP_per_capita = [2.12, 3.45, 6.23, 7.89, 4.23, 5.34, 6.9] Agricultural_land = [0.92, 1.2, 0.45, 0.73, 0.21, 0.34, 0.91] Co2_emission = [0.40, 0.45, 0.34, 0.23, 0.97, 0.21, 0.74] df=pd.DataFrame( 'countries': countries, 'GDP_trillion': GDP_trillion, 'Population': Population, 'GDP_per_capita': GDP_per_capita, 'Agricultural_land': Agricultural_land, 'Co2_emission': Co2_emission >)

Create a correlation matrix of the Dataframe

what is a correlation matrix?

A correlation matrix basically shows the degree of correlations of every variable in the dataset with every other variable in the dataset.

Ideally, the correlation matrix is a representation of all these correlation coefficients of every single variable in the data with every other variable in the data.

The degree of correlation among any two variables has been depicted in two ways, the color of the box and the number inside the box.

The closer the number to 1, the greater the correlation. If the number is positive it states a positive correlation. If it is negative it states a negative correlation. 1 and -1 states perfect correlations among variables.

pear_corr=df.corr(method='pearson') pear_corr

Plot Matplotlib heatmap of correlation matrix

we will create the heatmap of correlation matrix using matplotlib and we have to just pass the pear_corr matrix defined above in the matplotlib imshow function.

so we have first created a subplot of size 8×8 and then pass the pear_corr in the imshow function and set the interpolation to nearest. since we want a colorbar to represent the intensity of correlation values in this heatmap we have added that as well.

fig, ax = plt.subplots(figsize=(8,8)) im = ax.imshow(pear_corr, interpolation='nearest') fig.colorbar(im, orientation='vertical', fraction = 0.05)

The output is just a color-coded heatmap but the axes labels and correlation score for each cell in the heatmap is missing

To create the axes ticks and to label them, we will use set_xticklabels and set_yticklabels function and the list of labels will be our dataframe column names i.e. df.columns

And to annotate the correlation score on the cells of this heatmap we will use text method of matplotlib to position and color the score labels

fig, ax = plt.subplots(figsize=(8,8)) im = ax.imshow(pear_corr, interpolation='nearest') fig.colorbar(im, orientation='vertical', fraction = 0.05) # Show all ticks and label them with the dataframe column name ax.set_xticklabels(df.columns, rotation=65, fontsize=15) ax.set_yticklabels(df.columns, rotation=0, fontsize=15) # Loop over data dimensions and create text annotations for i in range(len(df.columns)-1): for j in range(len(df.columns)-1): text = ax.text(j, i, round(pear_corr.to_numpy()[i, j], 2), ha="center", va="center", color="black") plt.show()

Using Pandas background_gradient for heatmap

Alternatively, if you are working on a dataframe like the correlation matrix dataframe(pears_corr) created above we can use pandas background_gradient style function as well.

The background color is determined according to the data in each column, row or frame, or by a given gradient map. It requires a matplotlib too as a background.

Across Column

Just in case you wanted to only apply the color code heatmap along the column of the dataframe then set the axis=0. Also, in the below figure, the darker the green color, the greater the positive correlation, the darker the red color of the box the greater the negative correlation across each column.

pear_corr.style.background_gradient(cmap='Greens', axis=0)

Across Rows

if we want to apply the color code heatmap just across each rows in the dataframe then set the axis=1

pear_corr.style.background_gradient(cmap='Greens', axis=1)

For entire Dataframe

Finally, if axis = None then it is applied across the entire dataframe

pear_corr.style.background_gradient(cmap='Greens')

There is an optional parameter called gmap or Gradient map for determining the background colors. You can either pass the entire dataframe as a gmap or pass a ndarray or list-like must be an identical shape to the underlying data

pear_corr.style.background_gradient(cmap='Greens', gmap=gmap)

Updated: January 17, 2022

pandas count duplicate rows

DataFrames are a powerful tool for working with data in Python, and Pandas provides a number of ways to count duplicate rows in a DataFrame. In this article.

Pandas value error while merging two dataframes with different data types

If you’re encountering a “value error” while merging Pandas data frames, this article has got you covered. Learn how to troubleshoot and solve common issues .

How to get True Positive, False Positive, True Negative and False Negative from confusion matrix in scikit learn

In machine learning, we often use classification models to predict the class labels of a set of samples. The predicted labels may or may not match the true .

Pandas how to use list of values to select rows from a dataframe

In this post we will see how to use a list of values to select rows from a pandas dataframe We will follow these steps to select rows based on list of value.