Troubleshooting Validator Node Restart Issues

This morning, the two node operators reported issues with restarting nodes in Discord. Since the initial discussion came up, Agoric core devs have been busy digging into potential causes for this issue, and though the impact to the network appears small (and impacted nodes are back online), we will continue investigating the issue and diagnose the root cause.

For now, the Agoric team recommends that validators who encounter restart issues restore their nodes from a snapshot provider on the network like Polkachu or kjnodes.

If you are a validator that runs into this problem or any other sort of consensus failure, please send us a copy of your node’s state to troubleshoot this issue (e.g. via s3, Google Drive, Dropbox).

  • To share your state, please run the following command after the node shuts down, but before you delete the state (e.g. to replace it with a snapshot) and then send us the tarball that it creates:
    tar czf ~/agoric-swingset-state.tar.gz ~/.agoric/data/ag-cosmos-chain-state

    • This file should be about 250MB and takes about a minute to create. It does not contain any private data (the contents should be mostly the same across all nodes).
  • If you happen to collect slogfiles (i.e. the SLOGFILE= environment variable is set), please compress the output and send us those to us as well.

Finally, whether you have problems or not, we’re interested in how long your validator was running before the restart. One of the avenues that we are investigating is the the possibility that runtime affects the probability of the bug being triggered, so please let us know the last time you restarted your node by adding a comment to this thread.

The Agoric team is working on the issue in Github, and will update this discussion and continue to engage with validators in Discord as we continue our bug hunt. :beetle:

3 Likes

Hello, I investigated my sentry nodes, they all worked without rebooting since the last upgrade on block 7179263. I rebooted my nodes one by one, there were no problems.

2 Likes

Ours is working fine as well.

As noted in #6588, we have identified the root cause and provided a fix that has been verified by a few affected validators.

For any validators who are affected by this going forward, we prepared a release including the fix: