Your data science model is only as good as the data you use. You can get better, cleaner data by using DataSignals to streamline everything from data corrections to standardizing over reporting intervals. But what about the data not published by the ISO? And how does your own insight and knowledge fit into data science models? This is where feature engineering comes in!
Feature engineering is the process associated with creating additional variables in your analysis based on your expertise and knowledge of the energy markets. A quick example - many of the ISOs don’t publish net load, but net load is often just as important if not more important than each of the variables that make it up (actual load & renewable generation). Other good examples of important user driven features you can create include lagged variables, leading variables, statistical summaries, binning/categorical feature creation, normalization and logging.
Here is an ERCOT specific application of feature engineering to improve our understanding of the ERCOT price adder. We started our analysis by bringing in three series from the API:
We can run some quick statistical analyses to summarize our data and build very basic histograms to learn our data trends.
We can see similar patterns with online capacity and load. And also noticeable is how right-skewed the adder appears. But at this point, the data isn’t actionable. We still don’t know how changes in data drivers, like load or online capacity, can impact whether or not the adder is appearing in the given market interval. To start to understand this, we can look for a simple linear correlation between two of our items - load and online capacity and then shade by the adder.
What can we learn?
The relationship between load and online capacity exists
It is a negative correlation, but maybe it’s not as strong as we may have initially suspected based on just the histogram
There is a lot of noise when load is lower and capacity is higher
The adder appears to hit at some level where load is high and the online capacity is low
We still need to dig further into this data to understand when we will expect adders on the ERCOT LMP. For that, let’s start by establishing a new feature or variable - specifically a boolean indicator for whether or not the adder bound. This will allow us to set thresholds for alerts, and early binding intervals.
Now we have a strong starting point for analyzing when the adder will hit:
Load needs to be above 55GWs and
Online capacity needs to be below 15GWs.
There are some areas where load could be lower and the adder is even more likely. In this case we want to further target load as a category and to drill into this more. For this, let’s create one more feature, load category. This will have five values:
Creating this as a facet or additional partition on our data also allows us to easily display other data drivers. Let’s bring in the ERCOT HASL data now. In this example, we’ve now updated our results to display HASL as the Y-Axis.
Now we can further break down our data to understand the relationship between HASL, Load and Capacity on the adder. Based on just these charts we can start to set thresholds for multi-conditional alerts in QuickSignals. We can even take this a step further by building machine learning models off of our data. But that’s for another blog. This is the first in a series of Data Science blog posts so please subscribe to our blog to get those sent to your inbox!
If you’re a Yes Energy customer and would like to receive code samples (in R) to try this on your own click here and we’ll send you the code. Unfortunately, the code will not work if you aren’t a Yes Energy customer BUT if you’re interested in learning more we’d love to chat! Fill out this form and we’ll be in touch!