This is the third article in a four-part series on Energized Data, written by Yes Energy’s Director of Data Products, Sonya Gustafson. In this series, Sonya explains what Energized Data is, how to create it, analyze it, and empower it within your organization.
In the previous article, Creating Energized Data, I talked about data engineering. Having data that’s ready for analysis is the foundation of building an Energized Data organization. Collecting and creating data is only part of the process though – solving problems requires so much more than data or even information. To solve problems, you must have the ability to discover and use the right tool for the job.
I’m an avid fly fisherwoman and bringing the right tool to the job is analogous to bringing the right equipment to catch the targeted species. I’m not going to float a boat capable of crossing an ocean to fish in a tiny, high mountain lake. I won’t wear wool-lined waders on my next saltwater fly fishing trip either.
The same is true for data analysis. We aren’t going to use Excel to crunch a few billion rows of data. Our best out-of-the-box reporting tool probably isn’t a data science library in R or Python either. That’s not to say these shouldn’t be tools in our toolkit though, and with the right data pipeline – we can find a solution for each of our data problems.
What's a Data Pipeline?
Let’s start by defining the pipeline. A data pipeline is a set of tools and processes used to automate the movement and transformation of data between a source system and a target repository.
Since we already covered some of this in our Creating Energized Data article, we’ll focus on moving data from the system where the data is stored to the location where the data will be analyzed. It’s not to say the storage system itself isn’t incredibly valuable. Given the volume of data we collect at Yes Energy, a transactional, relational database is a critical first landing spot for data. Keys allow us to enforce data quality standards and our collection engines can be optimized for the lowest latency collections. This massive database also serves up the data and analytics in many of our solutions. But to make use of that data, there will be specific use cases where the data will need to be made more intuitive and the analysis more performant. Sometimes that requires moving data to another system.
When we’re first getting to know our customers, we like to ask a number of questions to understand the problems they are trying to solve, the data they expect to use, the tools they are comfortable using, and the technology their organization supports.
Using an API
One of the fastest and easiest ways to get our customers up and running with automated data analysis is through Application Programming Interfaces (APIs). Creating these APIs can take a decent amount of programming on the development side, but it also has the lowest bar to entry for our customers because we can deeply embed our domain expertise, data standardization, and data best practices in each URL or endpoint. APIs don’t always require a complex data pipeline – they can often be built on a company’s primary storage system, like a relational database.
We often select APIs because so many tools can be built with them with little additional work. It can be as simple as installing and calling a library. We can support everything from Python and R to PowerBI and Matlab with our API. And of course, as the name API suggests, additional applications can be layered on top of the API. We lean towards recommending API solutions when customers are asking us about reporting, automation, data exploration, and data science. Since APIs are so fast and easy to get started with, they can also be enormously useful for ad-hoc analyses. APIs can also help when you don’t want to store large quantities of data because you can query what you need when you need it.
The Downside of APIs
There’s a caveat to APIs – when you’re doing really big data analysis, sometimes they aren’t the most efficient method to utilize information in bulk. Let’s say you’re working on a problem that requires a data frame of 25 data types, and each data type needs five years of history, with hundreds of locations for analysis. APIs often require data to be split into monthly, quarterly, or yearly extracts, and they may or may not be able to handle the flexibility of each data type in a single call. This can create a situation where you are spending a lot of data downloading data. In these scenarios, going straight to bulk data and big data solutions will work better. In this situation, we often recommend using the data on the cloud that can better stream, store and analyze this volume of data.
Using a Data Lake
We’ve established a pipeline to move our clean, engineered data from our relational database to a data lake. Data lakes can show up in a variety of stages in a pipeline – sometimes a data lake is where all raw data is stored and each business unit runs its own transformations from the lake. Other times, a data lake is a landing spot for data after some level of engineering and processing. This is incredibly helpful if you have common transformations you need to leverage across use cases and units. This is the direction we took at Yes Energy. This means the data can be used immediately and is available in bulk like you’d expect from a typical data lake. Layering data engineering into a data lake improves data quality, but it's not always the fastest solution, so understanding latency needs is important when building pipelines and determining where a lake fits into your architecture.
A data lake works well for the biggest data problems, but it mostly solves collection and storage needs – often the volume of data you can crunch from a data lake can overwhelm local systems. Python and R have a number of libraries that can read data directly from a data lake and they are optimized for the big data volumes you would expect from a data lake. Apache Spark is one of our favorites for analysis. You also can access data on a data lake with other programming languages like Scala and Java. Consider layering cloud computing or virtual machines with the lake to amp up the performance, and scale the analysis along with the volume of data. AWS and other cloud computing platforms even have data pipeline tools that can act as another layer of ETL (or ELT) if you want to fully automate this type of workflow. Data lakes often require a higher level of technical expertise to make them useful. Thousands of files in a storage bucket aren’t always intuitive to use without the right set of tools and programming skills.
One final place where a data lake can fit into your ecosystem is to quickly populate additional, internal data stores. If your analysis requires highly customized or on-premise storage because of security and data governance requirements (or it will be blended with data that needs to be siloed from some users) consider sourcing common data from a data lake before layering in your private pipelines. The key is to reduce the redundancy of common tasks while also creating the flexibility required for highly secure data. This can work well within the electric utility and government sectors.
Data warehouses are often built on top of data lakes and this may be another step in your data pipeline. We often prefer to replicate from a database to our warehouse to reduce latency associated with writing and reading to the lake (and to enforce keys). This allows us to get real-time data to our warehouse in real-time. Warehouses can also serve as both the storage mechanism and the analysis platform. They often have the look and feel of a relational database. Warehouses are available on the cloud which means they can quickly scale up and out. Need more storage? Not a problem. Add compute? Also not an issue. These are literally one click of the mouse away. We prefer Snowflake’s data warehouse – it allows intuitive access through SQL and its processing engine is stellar, with some of the cartesian-type join problems we commonly find in electric power market analysis.
Dashboard and reporting tools like Tableau and PowerBI connect seamlessly to many warehouses, including Snowflake, so dashboards are created with a simple drag and drop interface. We also recommend Snowflake for machine learning experts that prefer SQL for creating dataframes – Snowflake quickly connects to a variety of downstream tools through easy-to-install drivers and libraries. We know setting up data warehouses can sometimes be cumbersome, so we built a market data system that can be immediately shared with any of our customers. This allows our customers to completely remove themselves from the data and system management side of the equation. That said, there is endless customization, so if you want something unique and proprietary, a data warehouse affords you this benefit too. The barrier to getting started on Snowflake is often lower than other traditional relational databases.
This article has been all about storage and analysis you can build on top of storage, but don’t forget additional transformations you may make in the pipeline. Your pipeline may allow for additional segmentation or can include materializations to support aggregations. And the landing spot for the data may be the best place to run that analysis. Some analysis is very customized for the individual, but the process to solve it is the same. In these cases, we create standardized functions at some point in the pipeline to calculate the customized results.
One example specifically pertinent to Yes Energy’s customer base is correlating outages to constraints and running back cast analysis around specific look-back intervals. The place where the function lives in the pipeline varies based on the customization, tool, and how often it will run. If any of your end-users need the same outputs from a Python machine learning example, position Python early in the pipeline so you aren’t creating redundant steps. If an OLAP data warehouse performs 30x better than your OLTP database – use the warehouse. If you only run that process once a quarter, skip the warehouse if that reduces the data hops.
The end goal with any pipeline is to deliver the data where it's needed. Energize the data so it’s as useful as possible, as quickly as is necessary, and there are no hurdles to implementing solutions. Flexibility is empowering, so consider a variety of solutions, each hop and hurdle can become a breakpoint or bottleneck. Monitoring business protocols and redundancy can create the most energized solution for your business.
Are you interested in exploring these solutions with the Yes Energy team? Sign up for a complimentary consultation or demo.
Sonya Gustafson is the Director of Data Products at Yes Energy. In this role, Sonya works closely with our customers to identify the new data we should collect and the next technology our customers require to utilize data. She's passionate about data engineering and creating actionable data from the data we collect. Sonya's an avid eco-tourist who loves experiencing new places with her fly rod.
To receive our latest blog insights, subscribe to our blog!