Revision history and IPFS entry, back to latest
Horo
IPFS What is this

Content Hash

Incident Report about Yoitsu's offline on June 9th, 2022

Horo
·
I rebuilt my validator node called Yoitsu due to I lost this node's private key in this incident.

Details of this incident

The times below are all in UTC+8 time.

About 11am, I noticed my validator node is missing lots of blocks from the Discord bot.

curl -sS http://localhost:26657/net_info | jq -r '.result.n_peers' returned 0, it means my node could not connect to any peer at that time. So I plan move the virtual server to other location for trying to mitigate this problem.

My validator node is hosted on Linode, which it supports migrate a virtual server to a different datacenter, althrough it doesn’t support moving external mounted block storage together.

In order to save time, I did't transfer ~/.liked/data directory, and tried to use state-sync to catch up blocks. While I encountered "content deadline exceeded" errors every time:

cosmovisor[3370]: 12:27PM ERR error on light block request from witness, removing... error="post failed: Post \"https://fotan-node-2.like.co:443/rpc/\": context deadline exceeded" module=light primary={}

Thus I tried to clear old ~/.liked directory for using nnkken's snapshot for catching, while I only taken out ~/.liked/keyring-file directory. So I lost the private key of validator node in ~/.liked/config.

After sync_info.catching_up is turned to false on my node's status. I noticed there is no voting power in my node and BigDipper was still showing My node is missing blocks. So I checked logs:

Jun 09 10:11:12 localhost cosmovisor[2485]: 10:11AM INF This node is not a validator addr=353558D7C7D69DF83A6C9D37BB8204B38561217C module=consensus pubKey=cEwyDK/M1mJ+fJHXASe……

And ~/liked tendermint show-address returned an address different than my validator's operator address. I recognised I had lost my node's private key. So I announced this incident on #mainnet-validators channel on Discord and started to recreate a new node.

Learned from this incident

  • It is not enough for backing up node operator's private key, we should backup node's private key itself either.

Followed up actions

Suggestions

  • There may be lots of failure reports about state syncs. It may be necessary to test this mechanism, even if it will not be frequently used in synchronization.
  • Or we may expand the documentation for covering how to enable state sync for more nodes to improve the robustness of this feature.


CC BY-NC-ND 2.0