EMS architecture: when local control beats the cloud round trip
An EMS that round-trips every decision through the cloud is going to disappoint someone, eventually, in a way that costs real money. The right split between edge and cloud is a latency budget question, and once you frame it that way the architecture writes itself.
tl;dr. An energy management system that round-trips every decision through the cloud is going to disappoint someone, eventually, in a way that costs real money. The right split between edge and cloud is not a religious question, it is a latency budget question, and once you frame it that way the architecture writes itself. Here is the framework I use, with the trade-offs we have actually had to make at Pstryk and where I would push back on the gridX-style hybrid model that has become the default.
If your EMS depends on the internet to keep the lights on, it is not an EMS, it is a dashboard with delusions of grandeur. That sentence is unfair to a lot of products on the market right now, and it is also basically correct.
An energy management system has a job that is fundamentally different from most software. It is not deciding which content to recommend, or which ad to serve, or which row in a table to update. It is deciding when to run a heat pump, when to charge an EV, when to discharge a battery, and how to ride through a sudden voltage drop on the local feeder. Some of those decisions can wait a few minutes. Some of them cannot wait a few hundred milliseconds. Architecting an EMS as if all decisions live on the same time axis is the most common and most expensive mistake I see in the field.
The latency budget framing
Before you decide what runs where, you should write down what each decision actually has to do, and how fast.
There is a decision class that has to happen in under one hundred milliseconds. Voltage and frequency response, hardware safety interlocks, anything where a delay would either damage equipment or cause it to disconnect from the grid. This class belongs at the edge. Not the regional edge, not the city edge, the actual physical device edge, because round-trip latency to anywhere else on the planet starts at twenty milliseconds and ends nowhere good.
There is a decision class that has to happen in under five seconds. Local control loops, response to user actions in the app, integration with a local protocol like Modbus or OCPP. This class can technically live in the cloud, and many products do put it there, but you pay for that choice in user experience and in any moment of internet flakiness. We keep this class on the edge as well, with the cloud as a configuration source rather than a control source.
There is a decision class that has to happen in under five minutes. Dynamic tariff optimization, smart charging schedules, pre-heating decisions, demand response signals. This class can sit in the cloud comfortably, with the edge holding a recent cache of the schedule so it keeps working when the connection drops.
There is a decision class that can happen on a one-hour or one-day cadence. Forecast generation, market participation, fleet learning, billing reconciliation. This class is cloud-native and there is no good reason to push it down.
Once you map your features into those four classes, the architecture is mostly determined. The mistake people make is skipping this exercise and deciding the architecture by where their team is most comfortable working, which usually means everything ends up in the cloud and the edge becomes a passive telemetry pipe. That works until you have a customer with a flaky 4G connection and a heat pump that turns off every time their router restarts.
The hybrid model and where I push back on it
The default architecture in 2026 for serious EMS products is the gridX-style hybrid: a local Linux gateway that handles protocol translation, local control loops, and a fail-safe ride-through, plus a cloud platform that handles forecasting, optimization, and market integration. We use the same shape at Pstryk, broadly, and I think it is correct for most cases.
What I will push back on is the assumption that the gateway should always be a separate piece of hardware. For a portion of the residential market, the gateway functionality can live inside the inverter, the heat pump controller, or the EV charger that the customer already has, if those devices expose enough control surface and run modern enough firmware. Adding another box to a customer's electrical cabinet is a real cost, both in installation and in support, and the box-free path is worth taking when the hardware ecosystem allows it. We are not yet in a world where this works most of the time, but the boundary is moving every year.
The other place I push back is on what I call gateway-as-a-microservice-cluster. Some teams treat the local gateway as a tiny version of their cloud, with five or six containerized services running on a Raspberry Pi, complete with an internal message bus and a local database. This is engineering over-fit. The gateway should do as little as possible, run as a small number of processes, and be replaceable in twenty minutes by a field technician who has never seen the inside of it. Every additional service on that box is a maintenance cost amortized over thousands of devices in the field, and field maintenance is expensive in a way that cloud maintenance is not.
What belongs on the edge
The edge owns three categories of work and almost nothing else.
Protocol translation is the obvious one. Modbus TCP for inverters and meters, OCPP 1.6J or 2.0.1 for chargers, SG-Ready signaling or EEBus for heat pumps, Shelly local API for relays, increasingly EEBus or Matter for newer integrations. The cloud should never speak any of these directly. The gateway translates them into a uniform internal model and forwards normalized telemetry up. The cloud sends down structured commands that the gateway translates back into the local protocol.
Local control loops are the second category. If the customer's solar production exceeds their consumption by a kilowatt, the heat pump should ramp up within seconds, not minutes, and that decision belongs on the gateway. If a battery hits its lower state-of-charge limit, the discharge command needs to stop now, not after the next cloud sync. The cloud sets the policy. The edge runs the loop.
The fail-safe contract is the third and most important category. What does the system do when the cloud is unreachable? The answer has to be defined before you ship the first device, written down, and tested in a room with the cable physically unplugged. The default fail-safe in our system is to fall back to a static schedule based on the customer's tariff and usage pattern, with all controllable loads set to safe defaults. Battery discharge stops at the conservative minimum. Heat pump runs on its native control. EV charging proceeds at a default rate or pauses, depending on user preference. None of this is dramatic, but the fact that it is defined and tested is the difference between an EMS and a toy.
What belongs in the cloud
The cloud owns the work that benefits from scale and the work that requires data the edge cannot have.
Forecasting belongs in the cloud, both load forecasting for the individual customer and price forecasting for the market. These models train on data from the entire fleet plus external feeds like weather and market prices. There is no version of this that runs sensibly on a per-customer gateway.
Optimization belongs in the cloud, in most cases. Solving a daily schedule for a customer's battery, EV, and heat pump with a tariff curve and a forecast is a five-second job for a small linear program in a cloud worker. Pushing that solver to the edge is possible and some teams do it, but the cost in maintenance and the loss of fleet learning makes the cloud the right home for now.
Market integration belongs entirely in the cloud. Balancing market participation, dynamic tariff settlement, virtual power plant aggregation, all of these involve API calls to grid operators and market platforms that the edge has no business making directly. The cloud aggregates the fleet's flexibility, makes the market call, and pushes the resulting commitments down as schedules.
Fleet learning, observability, billing, customer-facing apps, all of this is cloud work and unremarkable. The interesting part is the boundary between cloud and edge, not the cloud itself.
Hardware reality
If you are picking gateway hardware in 2026, your shortlist is shorter than it used to be and the trade-offs are clearer.
The pragmatic default is a turnkey industrial gateway built on the Raspberry Pi Compute Module 4. The Seeed Studio reComputer R1000 is a good example: CM4-based, a few hundred dollars, with isolated RS485, dual Ethernet, and native Modbus and BACnet support, built specifically for smart building and energy management. The ModBerry 500 CM4, from the Polish vendor Techbase, sits in the same bracket with a longer industrial heritage, DIN rail mounting, and rugged enclosure options, which matters when your devices are going into electrical cabinets that have to pass inspection. Buying one of these gets you the industrial I/O, isolation, and ruggedization out of the box rather than as a hardware project, and for most EMS gateways that is exactly the right call.
If you genuinely need vision or heavier ML inference at the edge, that is a different and more expensive category. You step up to a Jetson-class device rather than stretching a protocol gateway to do work it was not built for. Most EMS gateways do not need this, so be honest about whether yours does before you pay for it.
The roll-your-own option is a bare CM4 carrier board with a custom Linux image. It is still the price-performance leader on paper, with the caveat that you will spend real engineering time on the image and the lifecycle tooling rather than buying a finished product. Whether that trade is worth it depends on the size of your fleet. Below a few thousand devices, buy. Above ten thousand, the build math starts to flip.
The hardware decision you should not optimize aggressively is power consumption. The gateway is going to draw two to five watts continuously. The heat pump it controls draws between 200 and 5,000 watts when running. Optimizing the gateway's power budget at the cost of features or reliability is the wrong trade.
The fail-safe contract, written down
I said earlier that the fail-safe behavior has to be defined before you ship. Here is the format we use, and I would recommend something similar for any team building an EMS.
For each controllable asset, write down the safe default behavior under three loss-of-cloud conditions: cloud unreachable for under five minutes, under one hour, and over one hour. The behavior can differ across those bands, and often should. A heat pump might continue running its last received schedule for an hour and then fall back to the manufacturer's native control. A battery might continue its last schedule for five minutes and then move to a conservative idle state. An EV charger might complete the in-progress session at a default rate and refuse new sessions until the cloud returns.
The document itself is two pages. The fact that it exists, has been reviewed, and has been tested in a controlled disconnection drill is what separates an EMS from a glorified telemetry pipeline. Test it on a schedule, not once. The first time you run the drill you will almost certainly find assets that behave differently from the spec, and that is the point of running it. You fix the spec where it was wrong and the code where it was wrong until the two match, and then you keep testing, because firmware updates and new device models will quietly break the contract again when you are not looking.
Where I have changed my mind
Two years ago I would have argued for keeping more logic in the cloud and treating the edge as a thin translation layer. The last eighteen months at Pstryk have moved me steadily toward more edge.
Part of that move is that residential internet connections in Poland and the rest of CEE are good but not perfect, and even occasional disconnections add up to noticeable user experience degradation once you spread them across a whole fleet. Part of it is that the local control loops we initially ran in the cloud turned out to consume cloud resources out of proportion to their value. And part of it is that the edge hardware in 2026 is genuinely cheap and reliable enough to run more sophisticated logic than I assumed when we started.
I have not moved all the way to a full edge-first religion. The cloud still owns forecasting, optimization, and market integration, and I do not see that changing. But the boundary has shifted, and I would now tell a founder building a new EMS to start by writing the latency budget and the fail-safe contract, not by picking a cloud provider.
If you are building one, the work I do at Pstryk on this exact architecture is the kind of thing I help climate and energy founders with on the side, through the consulting page. The first thirty days of an engagement is usually the latency budget exercise, often with a different conclusion than the founder expected, which is the point.