Share this
Exploratory Data Analysis: Using Pandas Profiling for Easier and More Precise Data Science Projects
by Gaby Flores
Exploratory data analysis (EDA) is an approach to data analysis that can help shed more light on your data set before moving onto the rest of your data science project. Adding EDA to your process can make your analysis easier and more precise and can suggest new hypotheses to test. EDA can take many forms, but it often includes analysis around datatypes, statistical summaries, histograms, correlation plots, and missing data analysis.
Yes Energy data scientists using Python are using pandas to perform EDA. Pandas is a package that allows you to create dataframes and perform analysis. It provides great information by running functions like describe, info, corr, and isnull within the pandas package to start your EDA. Using pandas, you can also leverage histograms and other charts to visualize the data relationships and distribution as part of your EDA.
We’ve taken things even further with our EDA by starting to rely on pandas_profiling for our basic EDA. Pandas_profiling returns a report with everything mentioned above – and more!
An Example
Look at an EDA example using panda_profiling on PJM DART spreads, wind generation, and load.
Start your analysis with a simple installation of the library:
pip install pandas-profiling
Once installed, you can run a profile against any dataframe. Here’s an example of the report returned based on our dataset – we've named this dataframe “data” – that we’re preparing for some machine learning.
pandas_profiling.ProfileReport(data)
In the first section of the report we can learn about the size of our data. It's helpful to identify the severity of missing data within the dataframe. We also have a summary of the variable types. The warnings section is useful to identify data concerns and potential areas where you might want to do additional processing and engineering. This can include constants, potential date type transformations needed and identification of highly correlated variables.
The next section of the report includes valuable statistical reports on each variable. The example above includes statistics on wind data. It allows us to understand not only the mean, min, and max but also the skewness of the data and shape via histograms.
The next section will include correlations on the various items in the dataframe. Here we can see correlations between schedules and actual data. We can also see relationships between wind and capacity as well as offline capacity and scheduled capacity.
The other sections of the profile report include deeper analysis around missing data and snapshots of the first and last rows.
Once you’ve run a profile of the data, it’s much easier to identify where you might need to review inputs to improve your models.
More Examples
With over 1,400 datatypes in DataSignals, the use cases of the type of EDA are truly endless. But here are a few examples of other data science projects that you could apply panda_profiling to:
-
Reviewing data drivers associated with DART Spreads
-
Analyzing indicators of real-time pricing events
-
Identifying variables that result in stronger constraint binding intervals
-
Any data prep for an econometric model
If you want to see an example of pandas_profiling in action, let us know. We’re happy to share more examples of this using our DataSignals API!
Share this
- Industry News & Trends (98)
- Power Traders (72)
- Data, Digital Transformation & Data Journey (44)
- Asset Managers (42)
- Market Events (30)
- Asset Developers (28)
- Utilities (28)
- Market Driver Alerts - Live Power (25)
- ERCOT (24)
- ISO Changes & Expansion (22)
- Renewable Energy (21)
- PowerSignals (20)
- Infrastructure Insights Dataset (18)
- Energy Storage / Battery Technology (17)
- Live Power (17)
- DataSignals (16)
- Risk Management (16)
- TESLA Forecasting (16)
- Data Scientists (13)
- CAISO (12)
- PJM (9)
- Power Markets 101 (9)
- QuickSignals (9)
- MISO (8)
- Position Management (8)
- SPP (8)
- EnCompass (7)
- Financial Transmission Rights (6)
- Snowflake (6)
- Submission Services (6)
- Powered by Yes Energy (5)
- Asset Developers/Managers (4)
- Data Centers (4)
- Solutions Developers (4)
- Commercial Vendors (3)
- FTR Positions Dataset (3)
- Geo Data (3)
- Battery Operators (2)
- Independent Power Producers (2)
- PeopleOps (2)
- AI and Machine Learning (1)
- Crypto Mining (1)
- Europe (1)
- FERC (1)
- ISO-NE (1)
- Japanese Power Markets (1)
- Natural Gas (1)
- Western Markets (1)
- hydro storage (1)
- November 2024 (3)
- October 2024 (6)
- September 2024 (5)
- August 2024 (7)
- July 2024 (9)
- June 2024 (5)
- May 2024 (7)
- April 2024 (8)
- March 2024 (6)
- February 2024 (9)
- January 2024 (7)
- December 2023 (4)
- November 2023 (5)
- October 2023 (6)
- September 2023 (2)
- August 2023 (6)
- July 2023 (3)
- May 2023 (4)
- April 2023 (2)
- March 2023 (2)
- February 2023 (2)
- January 2023 (5)
- December 2022 (2)
- November 2022 (1)
- October 2022 (3)
- September 2022 (5)
- August 2022 (5)
- July 2022 (3)
- June 2022 (3)
- May 2022 (1)
- April 2022 (3)
- March 2022 (3)
- February 2022 (6)
- January 2022 (3)
- December 2021 (2)
- November 2021 (4)
- October 2021 (4)
- September 2021 (3)
- August 2021 (2)
- July 2021 (4)
- June 2021 (5)
- May 2021 (3)
- April 2021 (3)
- March 2021 (4)
- February 2021 (3)
- December 2020 (3)
- November 2020 (4)
- October 2020 (2)
- September 2020 (5)
- August 2020 (2)
- July 2020 (2)
- June 2020 (1)
- May 2020 (9)
- November 2019 (1)
- August 2019 (2)
- June 2019 (2)
- May 2019 (2)
- January 2019 (1)