Yes Energy News and Insights

Exploratory Data Analysis: using Pandas Profiling for Easier & More Precise Data Science Projects

Exploratory Data Analysis (EDA) is an approach to data analysis that can help shed more light on your data set before moving onto the rest of your data science project.  Adding EDA to your process can make your analysis easier, more precise, and can suggest new hypothesis to test. EDA can take on many forms, but it often includes analysis around datatypes, statistical summaries, histograms, correlation plots, and missing data analysis. 

The data scientists here at Yes Energy that use Python are using pandas to perform EDA.  Pandas is a package that allows you to create dataframes and perform analysis.  It provides great information by running functions like describe, info, corr and isnull within the pandas package to start your EDA. Using pandas you can also leverage histograms and other charts to visualize the data relationships and distribution as part of your EDA. 

We’ve taken things even further with our EDA by starting to rely on pandas_profiling for our basic EDA. Pandas_profiling returns a report with everything mentioned above - and more!

In this blog post we’ll walk you through an EDA example using panda_profiling on PJM DART Spreads, Wind Generation and Load.

Start your analysis with a simple installation of the library:

pip install pandas-profiling

Once installed, you can run a profile against any dataframe. Here’s an example of the report returned based upon our dataset -we’ve named this dataframe “data”- that we’re preparing for some machine learning.


Data set info pandas profiling

In the first section of the report we can learn about the size of our data. It is helpful to identify the severity of missing data within the dataframe. We also have a summary of the variable types. The Warnings section is useful to identify data concerns and potential areas where you might want to do additional processing and engineering. This can include constants, potential date type transformations needed and identification of highly correlated variables.

Wind data type

The next section of the report includes valuable statistical reports on each variable. The example above includes statistics on wind data. It allows us to understand not only the mean, min, and max, but also the skewness of the data and shape via histograms.

Correlations pandas profiling

The next section will include correlations on the various items in the dataframe. Here we can see correlations between schedules and actual data. We can also see relationships between wind and capacity as well as offline capacity and scheduled capacity.

The other sections of the profile report include deeper analysis around missing data and snapshots of the first and last rows.

Once you’ve run a profile of the data, it’s much easier to identify where you might need to review inputs to improve your models.

With over 1400 datatypes in DataSignals, the use cases of the type of EDA are truly endless. But here are a few examples of other data science projects that you could apply panda_profiling to:

  • Reviewing data drivers associated with DART Spreads

  • Analyzing indicators of Real-Time pricing events

  • Identifying variables that result in stronger constraint binding intervals

  • Any data prep for an econometric model

If you want to see an example of pandas_profiling in action, let us know. We’re happy to share more examples of this using our DataSignals API!