This is the second article in a four-part series on Energized Data, written by Yes Energy’s Director of Data Products, Sonya Gustafson. In this series, Sonya explains what Energized Data is, how to create it, analyze it, and empower it within your organization.
As a reminder, Energized Data is a standard of best practices by which data is not only collected and engineered but also integrated into systems and processes throughout the organization. It requires a deeper understanding of the data and use case - leveraging technical skills as well as domain expertise. You can amplify the use of Energized Data by collaborating across the organization to understand and execute data plans.
In this article, we’ll cover the first attribute associated with Energized Data - collection and engineering. I’ve seen a number of funny internet memes that compare data engineers to magicians, wizards, and ninjas, and I’d be lying if I didn’t think those were hidden elements of success. But more honestly, I believe Yes Energy is successful because we have a team of experts. Some of our team are incredibly technical with advanced database, SQL, and Python skills, others have years of experience with power markets. We connect the dots with expertise in product and project management. Here are more specifics associated with the best practices Yes Energy applies to its data.
Yes Energy follows an Agile framework methodology for our data development processes. This allows us to move quickly but also ensures we’re always considering the end-user and the use case. Before we collect data, we identify what problems it may solve, both for our customers and ourselves. This allows us to consider not only what we collect, but how quickly it needs to be loaded, any transformations that can occur within the collection, and the downstream data work that needs to be considered when we store the data. By identifying the user needs early, we can avoid situations where we may need to collect more than once as a result of scope changes. And we can avoid loading data that may not have a fully formed end goal.
That’s not to say that a sometimes greedy approach to data collection won’t work - gathering as much data as possible may yield results as well. But Energized Data requires consideration of a long-term data strategy, specifically, will you be able to maintain this data collection? For example, in one hour our collection system processed over 18,000 collection events. This loaded 5.7 million rows of data through 390 unique load mechanisms. We obviously don’t say no to data! Each successful automated collection requires monitors in place to ensure we’re doing right by that data (and our customers) before it lands in our systems. Haphazard data collection can create a maintenance nightmare. Be thoughtful with data collections.
This leads us to the next best practice for data collections: maintaining the system and process. These are the questions we ask ourselves before productionizing data:
Do we expect the data source to change?
How will we monitor for said changes?
Will the data change and how will we capture the revisions?
Are there versions of data that should be prioritized?
What happens if this collection creates 2500 errors, how will we fix it as quickly as it’s needed for analysis?
Are we looking for (and potentially cleaning around) gaps, outliers, and delays?
We’ve found it helps to have a team dedicated to monitoring sources and markets. They provide early warning before data breaks so we aren’t waiting until the last minute to fix problems. We also have embraced data operations and grown both manual and automated quality assurance functions to keep up with the multitude of ways data can evolve over time. The old mantra of garbage in, garbage out is very true here – so quality must be a priority for Energized Data.
Another value-add in data engineering to consider is how the data relates to other data. A row of data in San Diego, California, probably doesn’t need to be compared to a row of data in Portland, Maine, but it should be compared to a row of data in Los Angeles. Geospatial relationships and geographic information allow users to quickly build sets of relevant data.
Reduce data complexity by using hierarchical structures to formulate data relationships. Like the data type example from the prior paragraph, we’ve learned relationship mapping helps when our customers need to solve a problem in ERCOT (the Texas power market) that they already solved in NYISO (the New York power market). The customer’s time-to-insight is significantly reduced if they don’t have to “get on the ground” to learn the ins and outs of the local geography.
We also perform repeated transformations early. One example here: many weather data providers report data in Celsius and the easiest way to collect this data would be to throw the data into a weather data table in its originally published format. However, the bulk of our data is centered around US markets and the default unit of temperature measurement here is Fahrenheit. Storing the data in Celsius could lead to inaccurate data analysis and confusion. Furthermore, if every downstream user of the data is ultimately converting this to Fahrenheit – why not do that early on to create more efficiency?
Yes Energy converts it once so everyone can save time. This is also true for unit conversions (like $/MWh to $/MW-month) and other common calculations. In the next part of this series, the pipeline discussion, I’ll provide more insight into these conversions – including how to evaluate when it makes sense to store the conversion versus having it easily accessible and available on the fly.
It’s important to layer in domain expertise throughout this process. Some of this is natural if you’re following best Agile practices for the data engineering side of the business. We bring in customers and context experts to help create the development stories. We say “Yes” to our customers here at Yes Energy and a big part of that is asking a lot of questions –
Some of our best practices were learned through mistakes – we’ve gotten deep into data engineering projects and realized at the end phase that our entire calculation was incorrect because we missed a component early on in the design. Leveraging our team’s domain expertise earlier would have allowed us to put these requirements in the initial project spec.
This gets us to the final step in the data side of Energized Data – embrace flexibility and be ready to pivot. We specifically hire curious people who want to roll up their sleeves and solve extremely complex data problems. It requires confidence within the team to create the space and flexibility to solve problems in entirely new ways. These nimble, creative teams create faster, better results.
The next step in Energized Data builds off of this – finding the right tools for the job, which requires that same level of flexibility. But we’ll save that for our next blog post, Analyzing Energized Data.
Want to receive our latest blog insights?