otrack et al.: Thank you and congratulations! It's gratifying seeing the wheels of research make progress.
My appreciation of formal and machine-checked proofs has grown since we wrote the original EPaxos paper; I was delighted at the time at the degree to which Iulian was able to specify the protocol in TLA+, but now in hindsight wish we (or a later student) had made the push to get the recovery part formalized as well, so perhaps we'd have found these issues a decade ago. Kudos for finding and fixing it.
Have you yourselves considered formalizing your changes to the protocol in TLA+? I wonder if the advances the formal folks have made over the last decade or so would ease this task. Or, perhaps better yet -- one could imagine a joint protocol+implementation verification in a system like Ironfleet or Verus, which would be tremendously cool and also probably a person-year of work. :)
Edited to add: This would probably make a great masters thesis project. If y'all are not already planning on going there, I might drop the idea to Bryan Parno and see if we find someone one of these years who would be interested in verifying/implementing your fixed version in Verus. Let me know (or if we start down the path I'll reach out).
I can’t speak for the authors, but I have been lucky enough to be collaborating with them on behalf of the Apache Cassandra project, to refine and prove the correctness of the Accord protocol - a derivative of EPaxos we have integrated into the database.
It would be fantastic if such a project could be pursued for this variant, which has the distinction of being the only “real world” implementation.
Either way, thank you for the original EPaxos paper - it has been a privilege to convert its intuitions into a practical system.
One of the big gaps in Raft is that it’s hard to manage leader election on a heterogenous network. Everyone has or knows a story about the tiny branch office we keep for the CTO’s nephew or that engineer who decided to move to Colorado and quit if he couldn’t work from there, getting elected leader and the whole system limping to a halt.
In the case of Raft it would benefit I think from having an instant runoff election process. Where three nodes are nominated and everyone votes on which one has the best visibility.
At the very least I can see a way to use latency to determine who to vote for, to manage a fast election instead of timeouts and retries.
Lamport simply calls his protocol "Paxos" to refer to both the single‑decree and multi‑decree versions. This is also the case in his other works, e.g., "Fast Paxos" and "Generalized Paxos." The term "Multi‑Paxos" is a later community/industry shorthand for the repeated or optimized use of single‑decree Paxos.
“Paxos” is a term that can mean many different things, so it’s better not to get too attached to any one meaning especially in different contexts.
Multi Paxos is commonly used (especially in industry) as short hand for multi decree Paxos (in contrast to single decree Paxos), but “Paxos” most often refers to the family of protocols, all of which are typically implemented with a leader. It is confusing of course because single decree Paxos is used to implement EPaxos (and its derivatives).
It’s worth noting also that Lamport is (supposedly) on the record as having intended “Paxos” to refer to the protocol incorporating the leader optimisation.
In practice, almost every implementation of Paxos uses multi-paxos. Even the "Paxos Made Simple" paper notes:
> In normal operation, a single server is elected to be the leader, which
acts as the distinguished proposer (the only one that tries to issue proposals)
in all instances of the consensus algorithm.
because otherwise you don't have a mechanism for ordering; the more basic Paxos protocol only discusses how to arrive at consensus for a single proposal, not how to assign numbers to them in a reasonable way that preserves ordering.
* As others have pointed out, Paxos is leaderless. Electing a leader is a performance trick (reduce contention/retries), not a correctness trick - if you want to order your events.
* EPaxos appears to relax ordering as long as the clients can declare their event-dependencies.
Q1) If I withdraw from ATM 1 and someone else withdraws from ATM 2, we are independent consumers - so how do we possibly coordinate which withdrawal depends on the other?
Q2) Assuming that's not a problem, how do I get the ability to replay events? If the nodes don't care about order (beyond constraints), how can I re-read events 1-100, suffer a node outage, and resume reading events 101-200 from a replacement node?
The two commands affect the same account balance, so they don't commute, so these commands conflict. Every EPaxos worker is required to be able to determine whether any two commands are conflicting, in this case it would be something like:
def do_commands_conflict(c1):
return len(write(c1) & read(c2)) > 0 or len(write(c2) & read(c1)) > 0 or len(write(c1) & write(c2)) > 0
Whenever an EPaxos node learns about a new command, it compares it to the commands that it already knows about. If it conflicts with any current commands, then it gains a dependency on them (see Figure 3, "received PreAccept"). So the commands race; the first node to learn about both of them is going to determine the dependency order [in some cases, two nodes will disagree on the order that the conflicting commands were received -- this is what the "Slow Path" is for].
The clients don't coordinate this; the EPaxos nodes choose the order. The cluster as a whole guarantees linearity. This just means that there's at least one possible ordering of client requests that would produce the observed behavior; if two clients send requests concurrently, there's no guarantee of who goes first.
(in particular, the committed dependency graph is durable, even though it's arbitrary, so in the event of a failure/restart, all of the nodes will agree on the dependency graph, which means that they'll always apply non-commuting commands in the same order)
I'm not sure I understand Q1 - that's exactly the point: If you withdraw _from your account_ and customer B withdraws from _their_ account, then the two events are unrelated and can be executed in either order (and, in fact, replicas would still have the same state even if some executed AB and some BA).
The replay is part of what the authors fixed in the original protocol. I believe but need to read their protocol in more detail on Monday that the intuition for this is that when there's an outage and you bring a new node online, the system commits a Nop operation that conflicts with everything. This effectively creates a synchronization barrier that that forces re-reading all of the previous commits.
But I'm confused about the phrasing of your question because the actor isn't clear here when you say "I re-read events 1-100" -- which actor is "I"? Remember that a client of the system doesn't read "events", it performs operations, such as "read the value of variable X". In other words, clients perform operations that observe _state_, and the goal of the algorithm is to ensure that the state at the nodes is consistent according to a specific definition of consistency.
So if a client is performing operations that involve a replacement node, the client contacts the node to read the state, and the node is responsible for synchronizing with the state as defined by the graph of operations conflicting with the part of the state requested by the client, which will include _all_ operations prior to the replacement of the node due to the no-op.
> I'm not sure I understand Q1 - that's exactly the point: If you withdraw _from your account_ and customer B withdraws from _their_ account
Same account.
> the actor isn't clear here when you say "I re-read events 1-100" -- which actor is "I"?
The fundamental purpose of Paxos is that different actors will come to a consensus. If different actors see different facts, no consensus was reached, and Paxos wasn't necessary.
If it's the same account, the two operations will have the same dependencies, and thus the system will be forced to order them the same at all replicas.
Hypothetically lets say there's a synchronized quantum every 60 seconds. Order of operations might not matter if transactions within that window do not touch any account referenced by other transactions.
However every withdrawal is also a deposit. If Z withdraws from Y, and Y withdraws from X, and X also withdraws from Z there's a related path.
Order also matters if any account along the chain would reach an 'overdraft' state. The profitable thing for banks to do would be to synchronously deduct the withdrawals first, then apply them to maximize the overdraft fees. A kind thing would be the inverse, assume all payments succeed and then go after the sources. Specifying the order of applied operations, including aborts, in the case of failures is important.
Those transfers would be represented as having dependencies on both accounts they touch, and so would be forced to be ordered.
Transfer(a, b, $50)
And
Transfer(b, c, $50)
Are conflicting operations. They don't commute because of the possibility that b could overdraft. So the programmer would need to list (a, b) as the dependencies of the first transaction and (b, c) as the second. Doing so would prevent concurrent submission of these transactions from being executed on the fast path.
Between their two questions, I'm guessing more directly what they're getting at is if events 100 and 101 can be reordered, what's the guarantee that reconnecting doesn't end up giving you event 100 twice and skipping 101?
[Edit, rereading] Shortened down, just this part is probably it:
> which will include _all_ operations prior to the replacement of the node due to the no-op.
A cited paper's title is "There is more consensus in Egalitarian parliaments."
Are terms like "democracy" and "parliament" common terms in distributed computing theory? Or are these intentionally clickbaity/humorous paper titles?
The original Paxos paper was termed "The Part-Time Parliament", and was explained -- I'm serious here -- not as a distributed systems protocol, but as a discussion about how electors on a Greek island could vote despite wandering in and out of the room. (Lamport). It set the stage for a series of papers using that theme. We continued on that theme when picking the title for the EPaxos paper, and these folks built on that. So yeah, it's a bit of a thing specifically in the paxos literature.
And wait until I tell you about the Byzantine Generals Problem. :-)
> I keep telling people the future of politics is markets & Blockchains.
I hope that you don't mean just things related to cryptocurrencies, because as soon as you demand monetary investment for something, it ceases to be democratic.
>> Egalitarian Paxos introduced an alternative, leaderless approach...
> Every 10 minutes the network elects a leader to...
From that it sounds like it is completely different to how Bitcoin works. Bitcoin "elects" a leader node once every so often and this paper claims its protocol does not have a leader node. It is pretty easy to imagine a day passing in the Bitcoin world where one node is in control of all the transactions for that day with no ability for any other peer miner to have any influence at all in what transactions end up in the blockchain.
The network does not elect a leader. that is a mischaracterization of the PoW process.
It's not like you are hashing based on your public key or something and then you get to sign a block afterwards. You have to commit to a block template before every hash. And also the miner is decided randomly by a weighted hashrate.
Imagine applying this to anything else. The group with the most (extremely specialized) computer power just gets to decide everything?
There is a recurring trend of interpreting democracy to mean "leaderless consensus-based decision-making", which really doesn't work and never has. That's why Occupy and pretty much every other similar bottom-up movement failed: leaders are necessary. People follow other people, not algorithms or groups.
"Making democracy work" should be about training better leaders and getting them into the system.
"Reducing the influence of money" is fairly inconsistent with what money is. If anyone can influence anything in any way then having money is going to help them do it.
What you need is a way to reduce corruption, i.e. create a structure where diverting public funds to special interests or passing laws that limit competition can be vetoed by someone with the right structural incentives to actually prevent it.
I think EU federation is pretty good, but I feel very dumbfounded every time dumb decisions that do not benefit member states are made, too much empathy too early I guess.
I wouldn't say so. The first years of the largest war in Europe since WWII have shown that a leaderless EU is incapable of making important decisions crucial to its own survival, as a fallen Ukraine would have led to a divided EU where many countries would be governed by authoritarian fascist regimes, such as the one in Hungary led by Orban.
Occupy did not fail, it successfully shifted the entire national political conversation of the United States toward considerations of the class warfare being waged by the wealthy against the general population in ways that are continuing to publicly echo in campaigns and policy discussion ever since
My appreciation of formal and machine-checked proofs has grown since we wrote the original EPaxos paper; I was delighted at the time at the degree to which Iulian was able to specify the protocol in TLA+, but now in hindsight wish we (or a later student) had made the push to get the recovery part formalized as well, so perhaps we'd have found these issues a decade ago. Kudos for finding and fixing it.
Have you yourselves considered formalizing your changes to the protocol in TLA+? I wonder if the advances the formal folks have made over the last decade or so would ease this task. Or, perhaps better yet -- one could imagine a joint protocol+implementation verification in a system like Ironfleet or Verus, which would be tremendously cool and also probably a person-year of work. :)
Edited to add: This would probably make a great masters thesis project. If y'all are not already planning on going there, I might drop the idea to Bryan Parno and see if we find someone one of these years who would be interested in verifying/implementing your fixed version in Verus. Let me know (or if we start down the path I'll reach out).
It would be fantastic if such a project could be pursued for this variant, which has the distinction of being the only “real world” implementation.
Either way, thank you for the original EPaxos paper - it has been a privilege to convert its intuitions into a practical system.
In the case of Raft it would benefit I think from having an instant runoff election process. Where three nodes are nominated and everyone votes on which one has the best visibility.
At the very least I can see a way to use latency to determine who to vote for, to manage a fast election instead of timeouts and retries.
Isn’t that multi-Paxos? Paxos is leaderless.
Very odd opening sentence.
Lamport simply calls his protocol "Paxos" to refer to both the single‑decree and multi‑decree versions. This is also the case in his other works, e.g., "Fast Paxos" and "Generalized Paxos." The term "Multi‑Paxos" is a later community/industry shorthand for the repeated or optimized use of single‑decree Paxos.
Multi Paxos is commonly used (especially in industry) as short hand for multi decree Paxos (in contrast to single decree Paxos), but “Paxos” most often refers to the family of protocols, all of which are typically implemented with a leader. It is confusing of course because single decree Paxos is used to implement EPaxos (and its derivatives).
It’s worth noting also that Lamport is (supposedly) on the record as having intended “Paxos” to refer to the protocol incorporating the leader optimisation.
> In normal operation, a single server is elected to be the leader, which acts as the distinguished proposer (the only one that tries to issue proposals) in all instances of the consensus algorithm.
because otherwise you don't have a mechanism for ordering; the more basic Paxos protocol only discusses how to arrive at consensus for a single proposal, not how to assign numbers to them in a reasonable way that preserves ordering.
* EPaxos appears to relax ordering as long as the clients can declare their event-dependencies.
Q1) If I withdraw from ATM 1 and someone else withdraws from ATM 2, we are independent consumers - so how do we possibly coordinate which withdrawal depends on the other?
Q2) Assuming that's not a problem, how do I get the ability to replay events? If the nodes don't care about order (beyond constraints), how can I re-read events 1-100, suffer a node outage, and resume reading events 101-200 from a replacement node?
def do_commands_conflict(c1): return len(write(c1) & read(c2)) > 0 or len(write(c2) & read(c1)) > 0 or len(write(c1) & write(c2)) > 0
Whenever an EPaxos node learns about a new command, it compares it to the commands that it already knows about. If it conflicts with any current commands, then it gains a dependency on them (see Figure 3, "received PreAccept"). So the commands race; the first node to learn about both of them is going to determine the dependency order [in some cases, two nodes will disagree on the order that the conflicting commands were received -- this is what the "Slow Path" is for].
The clients don't coordinate this; the EPaxos nodes choose the order. The cluster as a whole guarantees linearity. This just means that there's at least one possible ordering of client requests that would produce the observed behavior; if two clients send requests concurrently, there's no guarantee of who goes first.
(in particular, the committed dependency graph is durable, even though it's arbitrary, so in the event of a failure/restart, all of the nodes will agree on the dependency graph, which means that they'll always apply non-commuting commands in the same order)
The replay is part of what the authors fixed in the original protocol. I believe but need to read their protocol in more detail on Monday that the intuition for this is that when there's an outage and you bring a new node online, the system commits a Nop operation that conflicts with everything. This effectively creates a synchronization barrier that that forces re-reading all of the previous commits.
But I'm confused about the phrasing of your question because the actor isn't clear here when you say "I re-read events 1-100" -- which actor is "I"? Remember that a client of the system doesn't read "events", it performs operations, such as "read the value of variable X". In other words, clients perform operations that observe _state_, and the goal of the algorithm is to ensure that the state at the nodes is consistent according to a specific definition of consistency.
So if a client is performing operations that involve a replacement node, the client contacts the node to read the state, and the node is responsible for synchronizing with the state as defined by the graph of operations conflicting with the part of the state requested by the client, which will include _all_ operations prior to the replacement of the node due to the no-op.
Same account.
> the actor isn't clear here when you say "I re-read events 1-100" -- which actor is "I"?
The fundamental purpose of Paxos is that different actors will come to a consensus. If different actors see different facts, no consensus was reached, and Paxos wasn't necessary.
Hypothetically lets say there's a synchronized quantum every 60 seconds. Order of operations might not matter if transactions within that window do not touch any account referenced by other transactions.
However every withdrawal is also a deposit. If Z withdraws from Y, and Y withdraws from X, and X also withdraws from Z there's a related path.
Order also matters if any account along the chain would reach an 'overdraft' state. The profitable thing for banks to do would be to synchronously deduct the withdrawals first, then apply them to maximize the overdraft fees. A kind thing would be the inverse, assume all payments succeed and then go after the sources. Specifying the order of applied operations, including aborts, in the case of failures is important.
Transfer(a, b, $50)
And
Transfer(b, c, $50)
Are conflicting operations. They don't commute because of the possibility that b could overdraft. So the programmer would need to list (a, b) as the dependencies of the first transaction and (b, c) as the second. Doing so would prevent concurrent submission of these transactions from being executed on the fast path.
[Edit, rereading] Shortened down, just this part is probably it:
> which will include _all_ operations prior to the replacement of the node due to the no-op.
Sounds like a graph merge, not actually a replay.
And wait until I tell you about the Byzantine Generals Problem. :-)
This is exactly how bitcoin works.
Every 10 minutes the network elects a leader to assort & order transactions and also throw out fraudulent transactions.
If he fails to do this, he is not allow to claim his block reward (technically the "coinbase" transaction)
I keep telling people the future of politics is markets & Blockchains.
Its hard to explain comprehensively and what's strange is that no one has written a thorough book on the topic.
I am happy there are people actually writing such material on this topic.
Albeit its a bit too technical.
Computer science is the future of politics & governance. (I don't think AI is any useful but rather distributed systems)
I hope that you don't mean just things related to cryptocurrencies, because as soon as you demand monetary investment for something, it ceases to be democratic.
> Every 10 minutes the network elects a leader to...
From that it sounds like it is completely different to how Bitcoin works. Bitcoin "elects" a leader node once every so often and this paper claims its protocol does not have a leader node. It is pretty easy to imagine a day passing in the Bitcoin world where one node is in control of all the transactions for that day with no ability for any other peer miner to have any influence at all in what transactions end up in the blockchain.
That just sounds like robber barons with extra steps?
It's not like you are hashing based on your public key or something and then you get to sign a block afterwards. You have to commit to a block template before every hash. And also the miner is decided randomly by a weighted hashrate.
Imagine applying this to anything else. The group with the most (extremely specialized) computer power just gets to decide everything?
“Hi everyone, I’m here to excitedly talk about the hyper-capitalist-hellscape I’d like to sell you all! Wait, why are you all leaving?”
"Making democracy work" should be about training better leaders and getting them into the system.
AND fixing the (fundamentally broken) system by reducing the influence of money.
What you need is a way to reduce corruption, i.e. create a structure where diverting public funds to special interests or passing laws that limit competition can be vetoed by someone with the right structural incentives to actually prevent it.
But it seems pretty obviously not very good at any real executive action. Which is, again, by design.