[Solved] What should I do in this emergency?

Question

In MongoDB replication, the primary node writes events to the oplog that describe the writes as they occur. Secondary nodes copy apply these events and replay them so that the data on the secondary is logically the same as that on the primary.

A delayed secondary node will copy the events from the primary as soon as they are written, and buffer them locally for the delay period.

A delayed secondary node is not eligible to ever become primary.

The cluster as you’ve described is already fragile, and not highly available. If there is any problem with the existing primary, there is no other node that can accept writes, and no other node eligible to become primary.

That cluster will also take at least an hour to replicate any write, so many writes from the application will appear to take a full hour to complete.

In the scenario in the question the primary has dropped the databases and written the events to the oplog, and the secondary node has likely copied these events already. This means if you just turn off the primary, the secondary node will still happily drop the databases when it applies those events in an hour.

Also note that the secondary node has not applied any of the writes that occurred in the last hour.

If you stop the secondary nodes as soon as you realize the drop has occurred, you can prevent it from applying those operations. Then immediately make a new backup of the secondary to protect the data, reconfigure the secondary as a standalone node, start it up, extract the data and re-insert it to the primary node.

Accepted Answer