Skip to content

How to avoid split brain in MariaDB galera replication

CYBERQUEST MariaDB Galera Replication

Split-brain in a database cluster like Galera Cluster for MySQL refers to a situation where two or more nodes in a cluster start to accept writes independently of each other, leading to data inconsistencies. Preventing split-brain scenarios is crucial for the integrity and reliability of CYBERQUEST's database in a multi-master scenario.

Recommendations to avoid split-brain in a CYBERQUEST MariaDB Galera DB cluster:

  1. Use a Minimum of Three Nodes: Having at least three nodes can help in reaching quorum, which is important for deciding the primary component during network partitions. With two nodes, it's easy for both to think they are primary in case of a partition. Use at least 1 "witness" read-only node.

  2. Use the 'pc.weight' Parameter: This parameter can be used to give more voting power to specific nodes. If you know a particular node has more reliable network connectivity or is more trusted, you can give it more weight in the cluster decisions.

  3. Enable wsrep_auto_increment_control: By default, this is enabled. It helps in avoiding simultaneous auto-increment collisions when there are simultaneous writes on multiple nodes.

  4. Use gcache.size: This parameter ensures there is a retained write-set cache. It's beneficial when nodes that went out of sync want to rejoin without doing a full state snapshot transfer.

  5. Avoid Using Asynchronous Replication in Parallel

  6. Watch the Network: Unreliable network connections are a common cause for split-brain. Regularly check and maintain the physical and virtual network components.

  7. Assign dedicated replication network interfaces to the nodes to ensure proper communication

  8. Use Galera Arbitrator (GARBD): If you are running an even number of nodes, consider using Galera Arbitrator. It's a lightweight process that can act as an additional vote during network partitions but doesn't store data.

    Tune Failure Detection Parameters: evs.inactive_timeout adjusts the time it takes to declare a node inactive. evs.suspect_timeout adjusts the time after which a non-responding node is suspected to be down.

    Proper tuning can help in avoiding premature node evictions.

    Monitor the Cluster: Use monitoring tools to regularly check the health and status of nodes, ensuring they are in sync. Address any node outages or network interruptions quickly.

    Test Scenarios: Regularly simulate network partitions, node failures, and other disruptions in a staging environment to understand how your cluster reacts and to fine-tune settings.

  9. Educate Your Team: Ensure your team understands the importance of not writing to multiple nodes when the cluster is partitioned.

  10. Documentation and Procedures: Have clear documentation about what to do in the event of a network partition or node failure, and regularly review and update these procedures.

Finally, while MariaDB Galera Cluster offers synchronous replication which inherently reduces the risk of split-brain compared to asynchronous replication methods, it's always important to understand the configuration and operational aspects of any technology to ensure it behaves as expected in various scenarios.