Introducing Sentinel
One key challenge with RPCs is data quality. Returning incorrect data can have negative consequences as blockchains deal with financial information which requires accuracy.
However, how exactly do you ensure that the data from an RPC node is correct in the first place?
Before we answer that question, we need to understand how blockchains work in the first place. An RPC node is simply a node that is part of a consensus based distributed network called a blockchain. To understand if data is “correct” it has to be relative to other nodes on the network. You cannot understand if the data returned from an RPC is correct in isolation. This can be challenging for individual RPC users to cross-check as you’d need to have access to nodes at a global scale (especially since consensus has to be achieved across the globe).
For the past few months we ran into various issues around this theme such as:
- Not knowing which nodes were synced to the tip of the chain versus which ones were lagging
- Nodes that were listed as being archive nodes not actually being archive nodes
- Knowing whether a null result for a transaction is valid or needs to be retried multiple times adding to latency
- Being confident in our aggregation of providers returning correct information (we have thousands of nodes from tens of providers)
Given the scale of this challenge, we wanted to really ensure we created a robust solution that we can offer as part of our service. That’s why we build Sentinel.
Our Considerations
When designing a solution for Sentinel there were a few issues we wanted to solve but primarily it came down to the following:
- At the time of a user’s request, is correct data being returned?
- Is the node returning data synced up to the tip of the chain?
However, we also had to take the following into consideration:
- Not adding extra latency to the user’s request
- Keeping it to be cost effective as the prices of RPC vary significantly across providers
- Having a dynamic system speed that matches the pace of each chain (some chains produce a block every 250ms versus others that produce blocks every 10s)
- Considering edge case scenarios when determining consensus in data sampling rounds
- Ensuring accurate hash comparison checks between differing response formats from providers
The Solution
Ultimately we landed up on the the bedrock of our system being based on two types of checks.
Replay Checks
Every 10,000th request on a chain-method combo (Ethereum eth_call), in-parallel the param.name and param.values are replayed and sent to three random nodes in our inventory. The results are then compared to test whether the result that was served to the user was what the random nodes reached consensus on. The nodes that were out of consensus and receive a strike. After three strikes, the node is disqualified and is put on cooldown. We will talk more about how it comes back later on.
Lag Checks
Our secondary concern is determining if the nodes are synced to the tip of the chain. To do this, during our replay checks we call eth_blockNumber on a random subset of nodes and record how much each node lags from the consensus. Nodes that are a few seconds behind receive a strike. Nodes that are very delayed are blacklisted immediately as we believe there should be no reason for those nodes to be that out of sync. To determine the number of seconds a node is behind by, we record the 30 day average block time of every chain.
For both of our checks, when nodes are disqualified after receiving enough strikes they are put into a “staging” arena where they have to prove their correctness in consecutive rounds. If they pass, they are added back into the mix of nodes. This algorithm and ruleset may change over time but this is how it works at the time of writing.
Some minor details for the nerds:
- If there is not enough activity on a chain/route, a one hour delay will allow checks to run meaning chains without lots of activity will still be checked enough
- If we don’t have enough nodes to determine consensus then no action is taken against a node
- If two nodes return errors and one gives a valid result, we deem the two nodes that gave errors to be out of consensus
- If nodes error out without a valid 200 response, this does not count against them as in prod we will simply retry to other nodes
All of the above comes baked-in to the RouteMesh system at no additional cost to our users. You can be confident that the data you’re being served is correct. In addition, any faults we work closely with the providers to resolve and help improve their systems. If you’re a provider, you will soon be able to see your node status through our provider dashboard.
Feel free to reach out to us if you would like to ask us any questions or clarify technical assumptions we’ve made.