Troubleshooting Validator Node Restart Issues

jessysaurusrex · November 19, 2022, 12:17am

This morning, the two node operators reported issues with restarting nodes in Discord. Since the initial discussion came up, Agoric core devs have been busy digging into potential causes for this issue, and though the impact to the network appears small (and impacted nodes are back online), we will continue investigating the issue and diagnose the root cause.

For now, the Agoric team recommends that validators who encounter restart issues restore their nodes from a snapshot provider on the network like Polkachu or kjnodes.

If you are a validator that runs into this problem or any other sort of consensus failure, please send us a copy of your node’s state to troubleshoot this issue (e.g. via s3, Google Drive, Dropbox).

To share your state, please run the following command after the node shuts down, but before you delete the state (e.g. to replace it with a snapshot) and then send us the tarball that it creates:
tar czf ~/agoric-swingset-state.tar.gz ~/.agoric/data/ag-cosmos-chain-state
- This file should be about 250MB and takes about a minute to create. It does not contain any private data (the contents should be mostly the same across all nodes).
If you happen to collect slogfiles (i.e. the SLOGFILE= environment variable is set), please compress the output and send us those to us as well.

Finally, whether you have problems or not, we’re interested in how long your validator was running before the restart. One of the avenues that we are investigating is the the possibility that runtime affects the probability of the bug being triggered, so please let us know the last time you restarted your node by adding a comment to this thread.

The Agoric team is working on the issue in Github, and will update this discussion and continue to engage with validators in Discord as we continue our bug hunt.

Colinka · November 19, 2022, 9:03am

Hello, I investigated my sentry nodes, they all worked without rebooting since the last upgrade on block 7179263. I rebooted my nodes one by one, there were no problems.

Craci_BwareLabs · November 21, 2022, 12:45pm

Ours is working fine as well.

dckc · December 19, 2022, 7:52pm

As noted in #6588, we have identified the root cause and provided a fix that has been verified by a few affected validators.

For any validators who are affected by this going forward, we prepared a release including the fix:

Release pismoB · Agoric/agoric-sdk 08ca9d4

Topic		Replies	Views
Agoric Mainnet Update: Cosmos Barberry Patch Validators	2	354	June 12, 2023
Mainnet slowdown on 12/5/22 Validators network-ops	0	364	December 5, 2022
Update on the Recent Slow Blocks on Agoric Mainnet Validators	1	261	May 27, 2023
#28 Agoric-upgrade-9 / pismoC Upgrade Network Upgrades	3	1054	March 6, 2023
[Proposal #74] agoric-upgrade-15 Network Upgrades	4	395	May 15, 2024

Troubleshooting Validator Node Restart Issues

Related topics