Incident Report about Yoitsu's offline on June 9th, 2022
Details of this incident
The times below are all in UTC+8 time.
About 11am, I noticed my validator node is missing lots of blocks from the Discord bot.
curl -sS http://localhost:26657/net_info | jq -r '.result.n_peers'
returned 0, it means my node could not connect to any peer at that time. So I plan move the virtual server to other location for trying to mitigate this problem.
My validator node is hosted on Linode, which it supports migrate a virtual server to a different datacenter, althrough it doesn’t support moving external mounted block storage together.
In order to save time, I did't transfer ~/.liked/data
directory, and tried to use state-sync to catch up blocks. While I encountered "content deadline exceeded" errors every time:
cosmovisor[3370]: 12:27PM ERR error on light block request from witness, removing... error="post failed: Post \"https://fotan-node-2.like.co:443/rpc/\": context deadline exceeded" module=light primary={}
Thus I tried to clear old ~/.liked directory for using nnkken's snapshot for catching, while I only taken out ~/.liked/keyring-file directory. So I lost the private key of validator node in ~/.liked/config.
After sync_info.catching_up is turned to false on my node's status. I noticed there is no voting power in my node and BigDipper was still showing My node is missing blocks. So I checked logs:
Jun 09 10:11:12 localhost cosmovisor[2485]: 10:11AM INF This node is not a validator addr=353558D7C7D69DF83A6C9D37BB8204B38561217C module=consensus pubKey=cEwyDK/M1mJ+fJHXASe……
And ~/liked tendermint show-address returned an address different than my validator's operator address. I recognised I had lost my node's private key. So I announced this incident on #mainnet-validators channel on Discord and started to recreate a new node.
Learned from this incident
- It is not enough for backing up node operator's private key, we should backup node's private key itself either.
Followed up actions
- I wrote a article to announce to delegators on Matters : https://matters.news/@kenookamihoro/294363-续-备份-like-coin-验证人节点的二三事-bafyreiaoo63h4txf2tw2aepdyzr5gkui4djacvn7pcxvjyoux7f2g4n2n4
- I rebuilt a new validator node (https://dao.like.co/validators/likevaloper1r4sv5ea8mhd7q2cp566sh5zvkwg8xf3xwgw6uw) and changed old node's information for alarming current delegators.
- I contacted with CDC and try to redelegate Community Delegations to my new node, through they replied to me It is impossible due to they had redelegate to my validator recently less than 21 days.
Suggestions
- There may be lots of failure reports about state syncs. It may be necessary to test this mechanism, even if it will not be frequently used in synchronization.
- Or we may expand the documentation for covering how to enable state sync for more nodes to improve the robustness of this feature.
喜欢我的作品吗?别忘了给予支持与赞赏,让我知道在创作的路上有你陪伴,一起延续这份热忱!
- 来自作者
- 相关推荐