This reminds me of one of the most interesting bugs I've faced: I was responsible for developing the component that provided away market data to the core trading system of a major US exchange (which allows the trading system to determine whether an order should be matched in-house or routed to another exchange with a better price).
Throughputs were in the multiple tens of thousands of transactions per second and latencies were in single digit milliseconds (in later years these would drop to double digit microseconds, but that's a different story). Components were written in C++, running on Linux. The machine that ran my component and the trading engine were neighbors in a LAN.
We put my component through a full battery of performance tests, and for a while, we seem to be meeting the numbers. Then one day, with absolutely zero code changes from my end or the trading engine's end, the the latency numbers collapsed. We checked the hardware configs and the rate at which the latest test was run. Both identical.
It took, I think, several days to solve the mystery: in the latest test run, we had added one extra away market to a list of 7 or 8 markets for which my component provided market data to the trading system. We had added markets before without an issue. It's a negligible change to the market data message size, because it only adds a few bytes: market ID, best bid price & quantity, best offer price & quantity. In no way should such a small change result in a disproportionate collapse in the latency numbers. It took a while for us to realize that before the addition of these few bytes, our market data message (a binary packed format), neatly fit into a single ethernet frame. Those extra few bytes pushed it over the 1600 (or 1500?) mark and caused all market data message frames (which were the bulk of messages on the system, next to orders), to fragment. The frame fragmentation and reassembly overhead was enough to clog up the pipes at the rates we were pumping data.
In the short run, I think we managed to do some tweaks and get the message back under 1600 bytes (by omitting markets that did not have a current bid/offer, rather than sending NULLs). I can't recall what we did in the long run.
“You had an MTU problem. You enable jumbo frames. Now you have two MTU problems”
Unless you control the entire set of possible paths (can be many!) and set all the MTUs to match well, this (while maybe on surface helping with the problem, depending on many things) can set one up with a nasty footgun, whereby black hole will show in the most terrible moment of high traffic. See my PMTUD/PLPMTUD rant elsewhere in this thread.
Given this is a trading system where application latencies are measured in microseconds, the default would be to assume that jumbo frames are totally a valid approach.
MTU discovery would be so much easier if the default behavior was truncate and forward when encountering a oversized packet. The endpoints can then just compare the bytes received against the size encoded inside of the packet to trivially detect truncation and thus get the inbound MTU size.
This allows you to do MTU discovery as a endpoint protocol with all the authentication benefits that provides and allows you to send a single large probe packet to precisely identify the MTU size. It would also allow you to immediately and transparently identify MTU reductions due to route changes or any other such cause instead of packets just randomly blackholing or getting responses from unknown, unauthenticated endpoints.
Truncation for a dedicated probe packet type: you lose the information it's a probe when you go through a tunnel of some sort (VPN, L2TP, IPsec, MPLS, VPLS, VXLAN, PBB, q-in-q, whatever). You're also dealing with different layers e.g. a client could send an L3 packet probe and now you're expecting a layer 2 PBB/q-in-q node to recognize IP packet types and treat them specially (layering violation).
Truncation for all packet types: data in transit can occasionally get split for other reasons. Right now that's just made into loss, if we had built every protocol layer on the idea it should forward anyways then any instances of this type of loss also become MTU renegotiations, at best. At worst we're having to forward generally corrupted packets which can cause all sorts of other problems. It'd be another layering violation to require that e.g. an L2 switch must adjust the UDP checksum when it's intentionally truncating a packet, but that'd be the only way to avoid that. Tunnels (particularly secure) are also tricky here (you need to run multiple separate layers of this continuously to avoid truncation information not propagating to the right endpoints). It also doesn't allow for truly unidirectional protocols e.g. a UDP video stream as there is no allowance for out of session signaling to be possible.
The above is for "if we have started networking day 1 with this plan in mind". There are of course additional problems given we didn't. I'm also not sure I follow how allowing any intermediate node to truncate a packet is any more authenticated.
The (still ugly) beauty of using PMTUD-style approach over truncation or probe+notification is it doesn't try to make assumptions about how anything in the middle could ever work for the rest of time, and that makes it both simple (despite sounding like a lot of work) and reliable. You and your peer just exchange packets until you find the biggest size that fits (or that you care to check for) and you're off! MTU changes due to a path change? No problem, it's just part of your "I had a connection and the other side seems to have stopped responding. How do I attempt to continue" logic (be that retry a new session or attempt to be smart about it). It also plays nice with the ICMP too large messages - if they are there you can choose to listen, if they are not it still "just works".
Or, like the article says, safe minimums can be more practical.
From the pragmatic standpoint: manually hard coding a safe minimum is the only approach which consistently works.
PMTUD somehow missed that packet networks ditching the OOB mechanisms of circuit switched networks was a good thing. By adding an OOB mechanism of attempted MTU discovery. Unauthenticated.
Yes, matching the 5-tuple from the original payload somewhat helps against the obvious security problem with this. (It was a fun 3-4 years while it was being added to systems across the ‘net while everyone was blocking the ICMP outright to avoid the exploitation. The burps of that one might still find in some security guidelines)
But the number of the network admins who understand what do they have to configure in their ACLs and why, is scarily small compared to the overall pool size.
Here’s another hurdle: for about two decades, to generate ICMP you have to punt the packet from hardware forwarding to the slow path. Which gets rate-limited. Which gives one a fantastic way to create extremely entertaining and hard to debug problems: a single misbehaving or malicious flow can disable the ICMP generation for everyone else.
Make hardware that can do it in fast path ? Even if you don’t punt - you still have to rate-limit to prevent the unauthenticated amplification attack (28 bytes of added headers is not comparable with some of the DNS or NTP scenarios, but not great anyway)
So - practically speaking, it can’t be relied on, other than a source for great stories.
PLPMTUD is a little better, in a sense that it attempts to limit itself to inband probes, but then there is the delicate dance of loss customarily being used to signal the congestion.
So this mechanism isn’t too reliable either, in very painful ways for the poor soul on call dealing with the outcomes. Ask me how I know.. ;-)
Now, let’s add to this the extremely pragmatic and evil hack that is the TCP MSS clamping, coming back from the first PPPoE days; which makes just enough of the eyeball traffic work to make this a “small problem with unimportant traffic that no one cares for anyway”.
So yes, safe minimums are a practical solution.
Until one start to build the tunnels, that is. A wireguard tunnel inside IPSec tunnel. Because policy. Inside VXLAN tunnel inside another IPSec tunnel, because SD-WAN. Which traverses NAT64, because transition and address scarcity.
At which point the previously safe minimums might not be safe anymore and we are back to square 1. I suspect when folks will start running QUIC over wireguard/ipsec/vxlan + IPv6 en masse we will learn that (surprise!) 1200 was not a safe value after all.
So, with this in mind, I posit it’s nice to attempt to at least fantasize about the universe where MTU determination would be done entirely inline, even if hypothetical - if we had the benefit of today’s hindsight and could time travel - could we have made it better ?
P.s. unidirectional protocols could be taken care of by fountain codes not unlike the I-, P- and B- frames in video world, with similar trade offs, moreover, I feel the unequal probability of loss depending on a place in the packet might allow for some interesting tricks.
Agree wholeheartedly on the pragmatic standpoint of just using minimums.
With regard to the problems of out of band signaling in plain PMTUD I fully agree with all your well stated points, doubly so on PLPMTUD! PLPMTUD is my preferred variation of PMTUD and I was glad to see the datagram form utilized in QUIC (especially since it's really a generic secure network tunneling protocol, not just the HTTP variant). I'm also glad QUIC's security model naturally got rid of MSS clamping... it was somewhat pragmatic in one view... but concerning/problematic in others :D. Of course it's not like TCP/mss clamping have exactly gone away though :/.
Also fully agree on both PLPMTUD still not being as reliable/fast as one would like (though I still think it's the best of the options) + safe minimums never seeming to stay "safe". At least IPv6 attempted to hedge this by putting pressure on network admins, saying "everyone is expecting 1280". Of course... we all know that doesn't mean every client ends up with 1280, particularly if they are doing their own VPN tunnel or something, but at least it gives us network guys an extra wall of "well, the standard says we need to allow expectation of 1280 and the rate of bad things which happen will be much higher lower than that".
You seem to have some really neat perspectives on networking, do you mind if I ask about what you do/where you got your experience? I came up through the customer side and eventually over time morphed my way into NOS development at some network OEMs and it feels like I run into fewer and fewer folks who deal with the lower layers of networking as time has went on. I think the most "fun" parts are trying to design overlay/tunneling systems which are hardware compatible with existing ASICs or protocols but are able to squeeze some more cleverness out of the usage (or, as you put it, if we had the benefit of today’s hindsight and could time travel - could we have made it better). The area I'd say I've been least involved in, but would like to, is anything to do with time sensitive networking or lossless ethernet use cases.
This works great until there is an app that is expecting 1280 and there is an operator that gives you 1280, and you have to run this app over an encrypted GENÈVE tunnel that attempts to add half a kilobyte of metadata :-). RADIUS with EAP or DHCP with a bunch of options can be a good example of a user app like this. Unfortunately this is a real-world problem.
The smaller mismatch but nonetheless painful is the 20 byte difference between IPv4 and IPv6 header sizes. It trips up every NAT64 deployment.
> where you got your experience?
A long path along all the OSI layers :-). Fiber and UTP networks install between ~95 and 2000. CCIE R&S#5423 in ‘99 and from 2000 almost 10 years in TAC and one of the first CCIE in Europe. Then some years working on IPv6 transition. Large scale IPv6 WiFi. Some folks know me by “happy eyeballs”; some by a “nats are good” YouTube video (scariest thing it’s still funny a decade later). These days - relops at fd.io VPP + internal CI/CD pipeline for a bunch of projects using VPP; and as a side gig - full-cycle automation of the switched fleet (~500 boxes) at #CLEUR installations. One of the recent fun projects was [0] - probably industry first of this scale, for an event network: more than 15K WiFi clients on IPv6Mostly. Though we were benefitting from work of a lot of folks that pushed the standardization and did smaller/more controlled deployments, specifically to shout huge thanks to Jen Linkova and Ondřej Caletka.
If you like low level network stuff, you might like VPP - and given it’s Apache licensed, pretty easy to use it for your own project.
One minor Ethernet MTU thing I would change with a time machine is to have the network header portion of the MTU be more like 802.11. I.e. instead of sized exactly to the headers of the day it intentionally was larger to allow variation over time. It wouldn't really do anything for most of the MTU concerns discussed here or for clients but I think it would have been helpful for the evolution of wired protocols.
Happy eyeballs! Yes, I loved that one! I was always a huge IPv6 nerd as well, though I didn't get started until shortly after that. The "nats are good" video isn't ringing any bells but if you have a link I'd definitely give it a watch as it sounds right up my humour alley.
Unfortunately all of that Cisco affiliation means we are forever blood enemies and can never speak again... ;). I kid, I came up through the Nortel heritage originally so I'm bound by contract to make such statements.
I've heard great things about the Fast Data Project, I'll definitely have to look into it some before the Oblivion remake comes out :). Maybe after this current project at work I'll finally get to mess with software based dataplanes properly.
It was great running into you here, I hope to catch you around more now that I know to look!
L2 is “relatively simple” in a sense that it’s usually under the same administrative control; unlike with L3. And even then, if you have a look at all the complexity between the maintaining the interop in the wireless space… it’s amazing it works as well as it does, with so much functionality being conditional.
> I came up through the Nortel heritage originally
My networking cradle is Netware 4.1, and in those times it was a zoo of protocols anyway. I really liked conceptually the elegance of Nortel management being SNMP-first. Makes me smile hearing all these “API-first!” claims today.
> It was great running into you here
Indeed, nice to meet you too ! :-)
I do a fair bit of lurking. yesterday was a bit of an anomaly since the whole “truncation as a means to do PMTUD” was a subject of my idle ponder for more than a decade, so it struck the chord :-)
Data in transit is almost never split for reasons other than fragmentation to avoid MTU problems. Any such split necessarily defines a fragmentation and reconstruction protocol so it still "preserves" the original send length information needed for truncation detection. If they have gone truly crazy and implemented a entire stream protocol transparently backing their flows then their transparent inner point-to-point layer would need to be aware of truncation in much the same way it would need to be aware of MTU limits anyways.
Forwarding generally corrupted packets should not be a problem unless your middleboxes are aggressively engaging in layering violations. From the perspective of a middlebox that is not engaging in layering violations you just have headers with blobs of data. Truncating the blob of data is basically uninteresting; at most you recalculate your integrity tags at your appropriate layer. You do not and should not recompute anything at higher layers. Furthermore, your endpoints must already be robust to blobs of garbage that pass your integrity tag checking because it is trivial for malicious actors to send you blobs of garbage with correctly calculated integrity tags. And, even if you were fully isolated, you can still get correlated bit errors that result in a correct integrity tag despite payload bit errors. Every client implementation that is not grossly incompetent must already be robust to getting garbage. You only get problems when your middleboxes start mucking around and trying to be too smart and violating your point-point transport abstraction.
You still get unidirectional protocols because you should manage truncation information out-of-band of any of your protocols. UDP or any other protocol should not communicate back to the sender that truncation happened. You do that some other way or even do not bother to do it at all. This is extra channel information that you can choose to communicate to let the other endpoint know about channel properties to make better data encoding decisions. You can transmit that in-band, out-of-band, on a different protocol, whatever. This is a higher level property of the communication channel between you and the other side.
Truncation is better authenticated because the packet reaches the other, known, authenticated endpoint who is the entity who can inform you, over a authenticated channel, that the transport channel has problems. You do not get nonsense like ICMP too large messages which come from unknown, unauthenticated entities. Furthermore, truncated messages can still be authenticated as long as you authentication tag the base header which should never be in the truncated section (you still need to have a minimum MTU below which you should always reject, but that number is small and much smaller than existing MTUs).
> Data in transit is almost never split for reasons other than fragmentation to avoid MTU problems
Fragmentation is a specific (unrelated) term, it's not interchangeable with a split. You can have (depending on the protocols involved):
- A runt due to a collision
- A link drop during transmit
- A problem during cut-through type transport
You can do various things to combat some of these (such as fragment-free instead of cut-through in collision domains) but you can't guarantee every phy IP ends up riding over can or should avoid these constraints.
> Forwarding generally corrupted packets should not be a problem unless your middleboxes are aggressively engaging in layering violations. From the perspective of a middlebox that is not engaging in layering violations you just have headers with blobs of data.
If "delivery of something somewhere" is your only definition of a problem, perhaps :p.
> Furthermore, your endpoints must already be robust to blobs of garbage that pass your integrity tag checking because it is trivial for malicious actors to send you blobs of garbage with correctly calculated integrity tags.
Not only the endpoints to garbage in the data payloads but equally the gear to garbage in the network headers. Be it full authentication or just error detection, you don't want to just forward things with a corrupted network header and hope it doesn't cause an issue or security violation. Things like CRCs or HMACs are done per layer precisely for this kind of reason, going to truncation requires dropping that safe handling.
> Every client implementation
As a side note: the concerns have less to do with the clients, they have full context and control of their sessions in software land with little concerns from concerns in being the physical transport layer. Most all of these considerations need to be thought from the intermediate boxes doing the transport/truncation instead.
> You still get unidirectional protocols because you should manage truncation information out-of-band of any of your protocols
Unidirectional protocols cannot be expected to punt directionality to a separate session. In general, any time the answer to a network conundrum (such as the two generals) sounds as easy as "just move that to a separate channel which has the information" you have either duplicated the problem in that channel or added functionality which might not be physically available (or directionally available for security use case reasons, or scalably available for multicast, or something else for a use case that isn't 'inside out' from what might pop in mind as a 'standard' session).
> Truncation is better authenticated because the packet reaches the other, known, authenticated endpoint who is the entity who can inform you, over a authenticated channel, that the transport channel has problems.
I'm still not sure I follow - how is the message between endpoints still authenticated if middleboxes can modify the bytes, breaking an HMAC and/or CRC (if any), and it still gets delivered? Having authenticated an endpoint exists at an address you've sent a packet to before does not automatically authenticate any packet which arrives.
You also skipped over any of the implications for network tunnels (secure/insecure) - is MTU discovery just not supposed to work in those use cases?
I think you can absolutely make a domain specific protocol which is happy to use truncation for MTU discovery, I just don't think anything which is supposed to be as universally usable as IP can.
Your first point appears to be about physical layer concerns. My suggestion was not meant to operate at that level. The proposed model assumes the physical layer guarantees point-point delivery of a distinct packet between adjacent nodes in the network with MTU limits manifesting as either discarding or rejecting the trailing portions of the packet.
I said there were no “problems” if there are no layering violations because you argued that recalculating checksums would be a layering violation. Either we say layering violations are unacceptable at which point my argument stands. Or we say layering violations are par for the course and you can just recalculate the checksums if you need to.
Unidirectional protocols with no back channel must assume the network channel parameters such as MTU. Adding truncation information which can be picked up at a different layer is just strictly more information you can feed into your protocol if it is designed to handle that. You can just not use it and act as if truncation is dropping if you want to. This is just strictly more data you can use for decisions.
You can get still get authenticated transport in the presence of truncation if your protocol generates a authentication tag for the “original” length and puts it at the start of the message. Then you can authenticate the length field and verify truncation otherwise you can drop it.
I did not bother with tunnels because I do not see how it is a distinct problem. Tunnels already need to figure out how to manage their MTUs. Either the tunnel is transparently managing how it fragments data and can be enhanced to support truncation (though it does not need to, it can just drop truncation/malformed as they currently do) or it tells tunnel parameters to the endpoints so that the endpoints keep themselves in bound at which point the endpoints can detect whatever the MTU of the tunnel is.
And again, you can always just ignore truncated packets and act as if they are malformed which everybody already does. This is strictly more functionality which does not require changing all existing systems which can be used to support more efficient MTU discovery by systems and networks that supported it. And if they do not, you just fallback to the current, crusty way.
> Your first point appears to be about physical layer concerns. My suggestion was not meant to operate at that level.
The proposal doesn't operate at that level but it must be compatible with the operations of that level. I.e. that the physical layer can also cause truncation of layers riding on top of it needs to be accounted for in the way those upper layers consider what truncation means. The same is true for possible intermediate layers (which sorta aligns with the later conversations regarding tunnels, which are basically just more complicated forms of intermediate layers).
> The proposed model assumes the physical layer guarantees point-point delivery of a distinct packet between adjacent nodes in the network with MTU limits manifesting as either discarding or rejecting the trailing portions of the packet.
Then proposed isn't applicable to IP since an upper layer protocol cannot make guarantees about the behavior of lower level protocols it may be transported on.
In addition, discarding trailing portions of the packet still results in the aforementioned problems with consistency checks and forwarding behavior limitations for lower level layers which did abide by this behavior.
> Unidirectional protocols with no back channel
One cannot guarantee bidirectional protocols will be able/allowed to form a back channel either, I just used unidirectional as a more clear-cut example.
> You can just not use it and act as if truncation is dropping if you want to. This is just strictly more data you can use for decisions.
Well sure, the same is true of the ICMP method or an active probing method. The concern is less with sessions you don't care to PMTUD in the first place and more with how the truncation design affects the designs of such other use cases.
> You can get still get authenticated transport in the presence of truncation if your protocol generates a authentication tag for the “original” length and puts it at the start of the message. Then you can authenticate the length field and verify truncation otherwise you can drop it.
I totally agree one can include an HMAC tag in your client<->client protocol to validate unmodified packets are authentic. This is regardless of whether truncation, ICMP packet too big, active PMTUD probing, or any other method is in place as, to this point, this is only about validating delivered packets which did fit in the MTU.
What isn't clicking is when a truncated message arrives how a (now invalid) HMAC helps you authenticate if this packet was completely spoofed by a malicious actor or really truncated by a middlebox. All you know is it was supposed to be longer and now something claims it needs to be shorter, how do you know that's not because of the same malicious actor who was supposed to be sending the fake ICMP packet too big rather than a middlebox really trying to signal the packet truly needed to be truncated?
> I did not bother with tunnels because I do not see how it is a distinct problem.
As highlighted earlier, tunnels may either encapsulate other protocols or encapsulate protocols which are expecting truncation. If the only things which existed in the world were client network interfaces it wouldn't be a problem, once more network devices become involved then you have to consider the impact on those too. The main thing to keep in mind is very few network middleboxes or tunnel protocols have the ability to do fragmentation on behalf of tunneled data, particularly if they are hardware based or based on protocols without such a feature (such as Ethernet) since this eats up TONS of hardware to do so (especially at high speeds). E.g. take an IPv6 VXLAN tunnel of an Ethernet frame on a 400 Gbps interface, how is an pure L3 intermediate carrier router doing truncation supposed to know not to update the UDP (a layer up the stack) checksum so the truncated Ethernet payload actually gets delivered to the client destination from the egress VTEP? It's not even that the egress VTEP needs some way to signal to the ingress VTEP how much the truncation was, it's that the original client which was VXLAN encapsulated by the ingressing VTEP needs its packet delivered to the remote client so the remote client can see the truncation and re-negotiate (in band or out of band) with the client to send smaller frames. This signaling will not occur because of the aforementioned UDP checksum being broken by an intermediate router. Just removing all checksums and allowing all modifications to headers and delivering whatever arrives would create not only high incidences of the propagation of deformed traffic but also security risks.
This brings us back to the example of secure tunnels, like IPsec, which have the same problem but in a much more succinct form. All parts of the payload of an IPsec tunnel are basically random noise after you truncate it, so there is no way to even attempt to consider sending the truncated payload to the intended destination. It's not the responsibility of the IPsec encapsulator to perform the encapsulation and the IPsec receiver usually doesn't have a path to communicate with the original client (not that it even knows who that is).
If you redesign everything about how network tunneling works under some severe limitations and assumptions then it may be possible to solve some (or maybe all if I can figure out what I'm missing regarding authentication of packets claiming MTU changes) of these problems but I'm not sure I could ever see the set of requirements needed as easier than the other MTU approaches. That doesn't necessarily mean I think there is an overall perfect answer all, just that I think PMTUD and its variants are definitely the easier path.
I just do not understand the problems you are stating. Let me present a concrete example.
We have A <-> B <-> C. A wishes to transmit a packet of 0x1000 bytes containing a Ethernet, IPv4, and then bespoke protocol, P, which is a header containing a length, MAC on the length + header, MAC on entire packet, encrypted payload, in that order. A then prepares transmit descriptors pointing at the packet and with size 0x1000 bytes.
C prepares receive descriptors pointing to buffers with a maximum capacity of size 0x1000 bytes per packet. B prepares receive descriptors pointing to buffers with a maximum capacity of size 0x500 (1280) bytes per packet.
A transmits the packet to B. The physical coding layer transmits the bytes terminating in the FCS. B receives bytes and does a running computation of the FCS. Upon reaching 0x500 bytes, it stops storing data into memory, stores the current FCS into memory, then continues receiving the data and computing the FCS until the data stream ends. Upon determining that the FCS matches, it marks the descriptor as valid for consumption and stores that the descriptor contains 0x500 bytes of data. The transmit engine of B then configures a transmit descriptor pointing at the packet and with size 0x500 bytes.
C then receives the 0x500 byte packet from B and observes that the FCS matches the 0x500 byte FCS and marks the descriptor as valid for consumption and stores that the descriptor contains 0x500 bytes of data. C then processes the packet observing that the P header indicates a length of 0x1000 bytes, but only 0x500 bytes are available. It attempts to authenticate the P header MAC using a secret known only to A and C. As the truncation only hit the encrypted payload at the tail, the P header MAC and the header data it is authenticating have not been modified by the truncation process. As such, C is able use the higher layer secret it shares with A to successfully authenticate the header data and determine that the header containing a length field with the value of 0x1000 bytes could have only been written by A and has not been tampered with. It then rejects the rest of the packet, but stores that the inbound MTU is only 0x500 bytes.
In this process one can only show they are unable to authenticate the packet's length matches the length the header said it should have been i.e. you can only authenticate that nobody tried to claim the MTU should change. You have not provided any authentication to the parts of the message signaling the MTU is now supposed be 0x500.
The authentication header can only helps you authenticate when the MTU stayed the same as expected during delivery, it cannot help you authenticate the signals claiming MTU was supposed to be something else as those modifications, inherently, do not come from nodes partaking in the authentication header. The malicious middleman could falsely truncate a single packet to 0x500 bytes just as easily as they could falsely create an ICMP packet claiming the MTU is 0x500 bytes, in both cases the only thing you know for sure is "someone is trying to claim that last packet was too big".
With IPv4, clearing the DF bit in all egress packets and hacking on top of QUIC could give just enough of a wiggle room to make it possible to explore this between a pair of cooperating hosts even in today’s Internet.
Anti-DDoS middle boxes will be almost certainly unhappy with lone fragments and UDP in general, so it’s a bit of a thorny path.
The big question is what to do with IPv6, since the intermediary nodes will only drop. This bit unfortunately makes the whole exercise pretty theoretical, but it can be fun nonetheless to explore.
Feel free to contact me at my github userid at gmail, if this is a topic of interest.
Most carrier/enterprise/hardware IPv4 routers, particular those on the internet, will not actually perform IPv4 fragmentation on behalf of the client traffic even though it's allowed by the IPv4 standard. Typically fragmentation is reserved for boxes which already have another reason to care about it (such as needing to NAT or inspect the packets) or the client endpoints themselves. I.e. the internet will (sparing security middleboxes) allow arbitrary IPv4 fragments through but it won't typically turn a 8000 byte packet into 6 fragments to fit through a 1500 byte MTU limitation on behalf of the clients. E.g. if you send a 1500 byte IPv4 ping without DF set to a cellular modem or someone with a DSL modem using PPPoE it'll almost always get dropped by the carrier rather than fragmented.
Of course nothing is stopping you from labbing it up at home. Firewalls and software routers can usually be made to do refragmentation.
Of course on the carrier boxes the fragmentation is done also not inline, so its behavior will depend on the aggressiveness of the CoPP configuration, and will be subject to the same pitfalls as the ICMP packet too big generation.
Thanks for keeping me straight here!
Based on the admittedly old study at [0] seems like some carriers just don’t bother to fragment, indeed - but by far not all of them.
Firewalls might do virtual reassembly, so the trick with the initial fragment won’t fly there.
This MTU subject is interesting for me because I have a little work in progress experiment: https://gerrit.fd.io/r/c/vpp/+/41914/1/src/plugins/pvti/pvti... (the code itself is already in, but has a few crashy bugs still and I need to take make it not suck performance wise, but that is my attempt to revisit the issue of MTU for tunnel use case. The thesis is that keeping the 5-tuple will make “chunking”/“de-chunking” at tunnel endpoints much much simpler on the endpoints of the tunnel.
The source of inspiration was a very practical setup at [1], which is, while looking horrible in theory (locally fragmented GRE over L2TP), actually gives a decent performance with 1500-byte end to end MTU over the tunnel.
The open question is which inner MTU will be sane, taking into account the increased probability of loss with bigger inner MTU… intuitively seems like something like ~2.5K should just double the loss probability (because it’s 2x packets) and might be a workable compromise in 2025….
One could also do the same trick over QUIC, of course, but i wanted something tiny and easier to experiment with - and the ability to go over IPSec or wireguard as well as a secured underlay.
Very interesting! It's like the best of the fragment-pre-encrypt world (everything appears as single packet 5 tuples to middleboxes) and fragment-post-encrypt world (transported packet data remains untouched) debate seen on IPsec deployments.
Like you mention you could do this under QUIC but then you'd be hamstrung to some of the design mandates such as encryption. This is way better as it's just datagrams doing your one goal - hiding that you're transporting fragments.
And how do you tell the difference between cut off packets, and a mtu drop? What about crcs / frame checks? Do you regenerate the frames? Do you do this at routed interfaces? What if there's just layer 2 only involved?
> And how do you tell the difference between cut off packets, and a mtu drop?
You don't, apart from enforcing a bare-minimum MTU for sanity's sake. If your jumbo-size packets are getting randomly cut off by a middlebox, then they probably aren't stable at that size anyway.
Packets do not get “cut-off” normally. That is kind of the point. Some protocols allow transparent fragmentation, but the fragments need to encode enough information for reconstruction, so you can still detect “less data received than encoded on send”.
You do not need bit error detection because you literally truncated the packet. The data is already lost. But in the process you learned it was due to MTU limits which is very useful. Protocols are already required to be robust to garbage that fails bit error detection anyways, so it is not “required” to always have valid integrity tags. You could transparently re-encode bit error detection on the truncated packet if you so desire to ensure data integrity of the “MTU resulted in truncation” packet that you are now forwarding, but again, not necessary.
Any end-to-end protocol that encodes the intended data size in-band can use this technique across truncating transport layers. And any protocol which does so already requires implementations to not blindly trust the in-band value otherwise you get trivial buffer overflows. So, all non-grossly insecure client implementations should already be able to safely handle MTU truncation if they received it (they would just not be able to use that for MTU discovery until they are updated). The only thing you need is routers to truncate instead of drop and then you can slowly update client implementations to take advantage of the new feature since this middlebox change should not break any existing implementations unless they are inexcusably insecure.
I don’t think you understand what normally looks like if you start forwarding damaged frames like this because you can’t tell the difference. That was the point.
I literally have no idea what you are talking about. You can send garbage packets that conform to no known protocol on the internet. You can get more bit errors or perfect bit errors that make your bit error detection pass while still forwarding corrupt payloads. Transport protocols and channels must be and are robust to this.
“Damaged” frames and frame integrity only matter if you need the contents of the entire packet to remain intact. Which you explicitly do not when truncating.
The only new problem that arises is that maybe the in-band length information or headers get corrupted resulting in misinterpreting the truncation that actually occurred. And again, you already need to be robust to garbage. And you can just change my proposal to recompute the integrity tag on the truncated data if you think that really matters.
PMTU just doesn't feel reliable to me because of poorly behaved boxes in the middle. The worst offender I've had to deal with was AWS Transit Gateway, which just doesn't bother sending ICMP too big messages. The second worst offender is, IMO (data center and ISP) routers that generate ICMP replies in their CPU, meaning large packets hit a rate limited exception punt path out of the switch ASIC over to the cheapest CPU they could find to put in the box. If too many people are hitting that path at the same time, (maybe) no reply for you.
More rare cases, but really frustrating to debug was when we had an L2 switch in the path with lower MTU than the routers it was joining together. Without an IP level stack, there is no generation of ICMP messages and that thing just ate larger packets. The even stranger case was when there was a Linux box doing forwarding that had segment offload left on. It was taking in several 1500 byte TCP packets from one side, smashing them into ~9000 byte monsters, and then tried to send those over a VPNish network interface that absolutely couldn't handle that. Even if the network in the middle bothered to generate the ICMP too big message, the source would have been thoroughly confused because it never sent anything over 1500.
> The even stranger case was when there was a Linux box doing forwarding that had segment offload left on. It was taking in several 1500 byte TCP packets from one side, smashing them into ~9000 byte monsters, and then tried to send those over a VPNish network interface that absolutely couldn't handle that. Even if the network in the middle bothered to generate the ICMP too big message, the source would have been thoroughly confused because it never sent anything over 1500.
This is an old Linux tcp offloading bug; large receive offload smooshes the inbound packet, then it's too big to forward.
I had to track down the other side of this. FreeBSD used to resend the whole send queue if it got a too big message, even if the size did not change. Sending all at once made it pretty likely for the broken forwarder to get packets close enough to do LRO, which resulted in large enough packet sending to show up as network problems.
I don't remember where the forwarder seemed to be, somewhere far away, IIRC.
> PMTU just doesn't feel reliable to me because of poorly behaved boxes in the middle. The worst offender I've had to deal with was AWS Transit Gateway, which just doesn't bother sending ICMP too big messages.
Because UDP is only a very thin layer, each layer on top (eg, QUIC) has to implement PLPMTUD; although, recently IETF standardised a way to extend UDP to have options and PLPTMUD is also specified for that: https://datatracker.ietf.org/doc/draft-ietf-tsvwg-udp-option...
> The speed of light in glass or fiber-optic cable is significantly slower, at approximately 194,865 kilometers per second. The speed of voltage propagation in copper is 224,844 kilometres per second.
If I understand correctly, the speed of light in an electrical cable doesn't depend on the metal that carries current, but instead depends on the dielectric materials (plastic, air, etc.) between the two conductors?
If I’m interpreting what you’re asking correctly, yes. The velocity factor of a cable doesn’t spend on the metal it’s made of but rather the insulator material and the geometry of the cable.
For fibre the velocity factor depends on the refraction index of the fibre.
Huh? Maybe I'm completely misreading the question, but when they say fiber-optic cable, they do mean optic. It's not an "electrical cable"; there is no metal needed in optic communication cables (perhaps for stiffness or whatnot, but not for the communication)
The site specifies a base font size of 12px. The better practice is to not specify a base font size at all, just taking it from the user's web browser instead. Then, the web designer should specify every other font size and box dimension as a scaled version of the base font size, using units like em/rem/%, not px.
It's the same size as HN: 12px. HN looks larger to me for some reason, but I can't figure out why: when I overlay a quote someone posted here over the website with half transparency in GIMP, the text is clearly the same height. Some letters are wider, some narrower, but the final length of the 8 words I sampled is 360px on HN vs. 358px on that website (so differences basically cancel out)
This is on Firefox/Debian, in case that means something for installed fonts. I see that site's CSS specifies Verdana and Arial, names that sound windowsey to me but I have no idea if my system has (analogous versions to) those
Is there any convenient way to tell linux distributions that the local subnet can handle 9k jumbos (or whatever) but that anything routed out must be 1500?
I currently have this solved by just sticking hosts on two vlans, one that has the default route and another that only has the jumbo capable hosts. ... but this seems kinda stupid.
The efficiency argument applies to private flows mostly. In terms of overall network traffic, the huge majority takes place between peers that share a local or private network. Internetworking as such has a relatively small share of total flows. So large frame sizes are beneficial in the context where they are also not problematic, and path MTU discovery is not beneficial in the context where it has many drawbacks. It seems as though the current state is pretty much optimal.
If you've ever tried to enable jumbo packets on your LAN, you'd soon learn that it causes lots of problems.
First, every L2 "dumb" switch that doesn't support your jumbogram size just silently drops the packet, which is no good.
Then, you have to figure out what size of jumbogram every device on your network supports, and select the minimum. In many cases, you'll have clients that don't support it at all.
And I hope all your OSes support setting an MTU per route, and you enjoy setting special routes on all of your clients, since Path MTU discovery, even where it is enabled and supported, at the very least adds latency to every connection, if it even works at all.
And god help you once you try to scale up your sweet jumbo frame solution. Plenty of routers have strict ICMP rate limits either imposed in software or hardware (because ICMP may be handled in an anemic CPU). So those ICMP fragmentation needed packets aren't reliably returned to your clients. It's even worse if your ISP doesn't block jumbograms outright. You will soon learn which of your ISPs peerings do or don't support jumbograms and whether they do or don't emit or forward ICMP.
The only advisable way to use jumbo frames is if you are running a datacenter and you have a group of machines that can be properly configured for route-based MTU and that would benefit from jumbo frames, and every piece of hardware you buy is carefully specced to support it.
Throughputs were in the multiple tens of thousands of transactions per second and latencies were in single digit milliseconds (in later years these would drop to double digit microseconds, but that's a different story). Components were written in C++, running on Linux. The machine that ran my component and the trading engine were neighbors in a LAN.
We put my component through a full battery of performance tests, and for a while, we seem to be meeting the numbers. Then one day, with absolutely zero code changes from my end or the trading engine's end, the the latency numbers collapsed. We checked the hardware configs and the rate at which the latest test was run. Both identical.
It took, I think, several days to solve the mystery: in the latest test run, we had added one extra away market to a list of 7 or 8 markets for which my component provided market data to the trading system. We had added markets before without an issue. It's a negligible change to the market data message size, because it only adds a few bytes: market ID, best bid price & quantity, best offer price & quantity. In no way should such a small change result in a disproportionate collapse in the latency numbers. It took a while for us to realize that before the addition of these few bytes, our market data message (a binary packed format), neatly fit into a single ethernet frame. Those extra few bytes pushed it over the 1600 (or 1500?) mark and caused all market data message frames (which were the bulk of messages on the system, next to orders), to fragment. The frame fragmentation and reassembly overhead was enough to clog up the pipes at the rates we were pumping data.
In the short run, I think we managed to do some tweaks and get the message back under 1600 bytes (by omitting markets that did not have a current bid/offer, rather than sending NULLs). I can't recall what we did in the long run.
Unless you control the entire set of possible paths (can be many!) and set all the MTUs to match well, this (while maybe on surface helping with the problem, depending on many things) can set one up with a nasty footgun, whereby black hole will show in the most terrible moment of high traffic. See my PMTUD/PLPMTUD rant elsewhere in this thread.
This allows you to do MTU discovery as a endpoint protocol with all the authentication benefits that provides and allows you to send a single large probe packet to precisely identify the MTU size. It would also allow you to immediately and transparently identify MTU reductions due to route changes or any other such cause instead of packets just randomly blackholing or getting responses from unknown, unauthenticated endpoints.
Truncation for all packet types: data in transit can occasionally get split for other reasons. Right now that's just made into loss, if we had built every protocol layer on the idea it should forward anyways then any instances of this type of loss also become MTU renegotiations, at best. At worst we're having to forward generally corrupted packets which can cause all sorts of other problems. It'd be another layering violation to require that e.g. an L2 switch must adjust the UDP checksum when it's intentionally truncating a packet, but that'd be the only way to avoid that. Tunnels (particularly secure) are also tricky here (you need to run multiple separate layers of this continuously to avoid truncation information not propagating to the right endpoints). It also doesn't allow for truly unidirectional protocols e.g. a UDP video stream as there is no allowance for out of session signaling to be possible.
The above is for "if we have started networking day 1 with this plan in mind". There are of course additional problems given we didn't. I'm also not sure I follow how allowing any intermediate node to truncate a packet is any more authenticated.
The (still ugly) beauty of using PMTUD-style approach over truncation or probe+notification is it doesn't try to make assumptions about how anything in the middle could ever work for the rest of time, and that makes it both simple (despite sounding like a lot of work) and reliable. You and your peer just exchange packets until you find the biggest size that fits (or that you care to check for) and you're off! MTU changes due to a path change? No problem, it's just part of your "I had a connection and the other side seems to have stopped responding. How do I attempt to continue" logic (be that retry a new session or attempt to be smart about it). It also plays nice with the ICMP too large messages - if they are there you can choose to listen, if they are not it still "just works".
Or, like the article says, safe minimums can be more practical.
PMTUD somehow missed that packet networks ditching the OOB mechanisms of circuit switched networks was a good thing. By adding an OOB mechanism of attempted MTU discovery. Unauthenticated.
Yes, matching the 5-tuple from the original payload somewhat helps against the obvious security problem with this. (It was a fun 3-4 years while it was being added to systems across the ‘net while everyone was blocking the ICMP outright to avoid the exploitation. The burps of that one might still find in some security guidelines)
But the number of the network admins who understand what do they have to configure in their ACLs and why, is scarily small compared to the overall pool size.
Here’s another hurdle: for about two decades, to generate ICMP you have to punt the packet from hardware forwarding to the slow path. Which gets rate-limited. Which gives one a fantastic way to create extremely entertaining and hard to debug problems: a single misbehaving or malicious flow can disable the ICMP generation for everyone else.
Make hardware that can do it in fast path ? Even if you don’t punt - you still have to rate-limit to prevent the unauthenticated amplification attack (28 bytes of added headers is not comparable with some of the DNS or NTP scenarios, but not great anyway)
So - practically speaking, it can’t be relied on, other than a source for great stories.
PLPMTUD is a little better, in a sense that it attempts to limit itself to inband probes, but then there is the delicate dance of loss customarily being used to signal the congestion.
So this mechanism isn’t too reliable either, in very painful ways for the poor soul on call dealing with the outcomes. Ask me how I know.. ;-)
Now, let’s add to this the extremely pragmatic and evil hack that is the TCP MSS clamping, coming back from the first PPPoE days; which makes just enough of the eyeball traffic work to make this a “small problem with unimportant traffic that no one cares for anyway”.
So yes, safe minimums are a practical solution.
Until one start to build the tunnels, that is. A wireguard tunnel inside IPSec tunnel. Because policy. Inside VXLAN tunnel inside another IPSec tunnel, because SD-WAN. Which traverses NAT64, because transition and address scarcity.
At which point the previously safe minimums might not be safe anymore and we are back to square 1. I suspect when folks will start running QUIC over wireguard/ipsec/vxlan + IPv6 en masse we will learn that (surprise!) 1200 was not a safe value after all.
So, with this in mind, I posit it’s nice to attempt to at least fantasize about the universe where MTU determination would be done entirely inline, even if hypothetical - if we had the benefit of today’s hindsight and could time travel - could we have made it better ?
P.s. unidirectional protocols could be taken care of by fountain codes not unlike the I-, P- and B- frames in video world, with similar trade offs, moreover, I feel the unequal probability of loss depending on a place in the packet might allow for some interesting tricks.
With regard to the problems of out of band signaling in plain PMTUD I fully agree with all your well stated points, doubly so on PLPMTUD! PLPMTUD is my preferred variation of PMTUD and I was glad to see the datagram form utilized in QUIC (especially since it's really a generic secure network tunneling protocol, not just the HTTP variant). I'm also glad QUIC's security model naturally got rid of MSS clamping... it was somewhat pragmatic in one view... but concerning/problematic in others :D. Of course it's not like TCP/mss clamping have exactly gone away though :/.
Also fully agree on both PLPMTUD still not being as reliable/fast as one would like (though I still think it's the best of the options) + safe minimums never seeming to stay "safe". At least IPv6 attempted to hedge this by putting pressure on network admins, saying "everyone is expecting 1280". Of course... we all know that doesn't mean every client ends up with 1280, particularly if they are doing their own VPN tunnel or something, but at least it gives us network guys an extra wall of "well, the standard says we need to allow expectation of 1280 and the rate of bad things which happen will be much higher lower than that".
You seem to have some really neat perspectives on networking, do you mind if I ask about what you do/where you got your experience? I came up through the customer side and eventually over time morphed my way into NOS development at some network OEMs and it feels like I run into fewer and fewer folks who deal with the lower layers of networking as time has went on. I think the most "fun" parts are trying to design overlay/tunneling systems which are hardware compatible with existing ASICs or protocols but are able to squeeze some more cleverness out of the usage (or, as you put it, if we had the benefit of today’s hindsight and could time travel - could we have made it better). The area I'd say I've been least involved in, but would like to, is anything to do with time sensitive networking or lossless ethernet use cases.
This works great until there is an app that is expecting 1280 and there is an operator that gives you 1280, and you have to run this app over an encrypted GENÈVE tunnel that attempts to add half a kilobyte of metadata :-). RADIUS with EAP or DHCP with a bunch of options can be a good example of a user app like this. Unfortunately this is a real-world problem.
The smaller mismatch but nonetheless painful is the 20 byte difference between IPv4 and IPv6 header sizes. It trips up every NAT64 deployment.
> where you got your experience?
A long path along all the OSI layers :-). Fiber and UTP networks install between ~95 and 2000. CCIE R&S#5423 in ‘99 and from 2000 almost 10 years in TAC and one of the first CCIE in Europe. Then some years working on IPv6 transition. Large scale IPv6 WiFi. Some folks know me by “happy eyeballs”; some by a “nats are good” YouTube video (scariest thing it’s still funny a decade later). These days - relops at fd.io VPP + internal CI/CD pipeline for a bunch of projects using VPP; and as a side gig - full-cycle automation of the switched fleet (~500 boxes) at #CLEUR installations. One of the recent fun projects was [0] - probably industry first of this scale, for an event network: more than 15K WiFi clients on IPv6Mostly. Though we were benefitting from work of a lot of folks that pushed the standardization and did smaller/more controlled deployments, specifically to shout huge thanks to Jen Linkova and Ondřej Caletka.
If you like low level network stuff, you might like VPP - and given it’s Apache licensed, pretty easy to use it for your own project.
[0] https://www.ietf.org/proceedings/122/slides/slides-122-iepg-...
One minor Ethernet MTU thing I would change with a time machine is to have the network header portion of the MTU be more like 802.11. I.e. instead of sized exactly to the headers of the day it intentionally was larger to allow variation over time. It wouldn't really do anything for most of the MTU concerns discussed here or for clients but I think it would have been helpful for the evolution of wired protocols.
Happy eyeballs! Yes, I loved that one! I was always a huge IPv6 nerd as well, though I didn't get started until shortly after that. The "nats are good" video isn't ringing any bells but if you have a link I'd definitely give it a watch as it sounds right up my humour alley.
Unfortunately all of that Cisco affiliation means we are forever blood enemies and can never speak again... ;). I kid, I came up through the Nortel heritage originally so I'm bound by contract to make such statements.
I've heard great things about the Fast Data Project, I'll definitely have to look into it some before the Oblivion remake comes out :). Maybe after this current project at work I'll finally get to mess with software based dataplanes properly.
It was great running into you here, I hope to catch you around more now that I know to look!
L2 is “relatively simple” in a sense that it’s usually under the same administrative control; unlike with L3. And even then, if you have a look at all the complexity between the maintaining the interop in the wireless space… it’s amazing it works as well as it does, with so much functionality being conditional.
> "nats are good" video isn't ringing any bells
https://youtu.be/v26BAlfWBm8?feature=shared - it was a bit of a meme back at the time in making the “X fanboy” videos.
> I came up through the Nortel heritage originally
My networking cradle is Netware 4.1, and in those times it was a zoo of protocols anyway. I really liked conceptually the elegance of Nortel management being SNMP-first. Makes me smile hearing all these “API-first!” claims today.
> It was great running into you here
Indeed, nice to meet you too ! :-)
I do a fair bit of lurking. yesterday was a bit of an anomaly since the whole “truncation as a means to do PMTUD” was a subject of my idle ponder for more than a decade, so it struck the chord :-)
Data in transit is almost never split for reasons other than fragmentation to avoid MTU problems. Any such split necessarily defines a fragmentation and reconstruction protocol so it still "preserves" the original send length information needed for truncation detection. If they have gone truly crazy and implemented a entire stream protocol transparently backing their flows then their transparent inner point-to-point layer would need to be aware of truncation in much the same way it would need to be aware of MTU limits anyways.
Forwarding generally corrupted packets should not be a problem unless your middleboxes are aggressively engaging in layering violations. From the perspective of a middlebox that is not engaging in layering violations you just have headers with blobs of data. Truncating the blob of data is basically uninteresting; at most you recalculate your integrity tags at your appropriate layer. You do not and should not recompute anything at higher layers. Furthermore, your endpoints must already be robust to blobs of garbage that pass your integrity tag checking because it is trivial for malicious actors to send you blobs of garbage with correctly calculated integrity tags. And, even if you were fully isolated, you can still get correlated bit errors that result in a correct integrity tag despite payload bit errors. Every client implementation that is not grossly incompetent must already be robust to getting garbage. You only get problems when your middleboxes start mucking around and trying to be too smart and violating your point-point transport abstraction.
You still get unidirectional protocols because you should manage truncation information out-of-band of any of your protocols. UDP or any other protocol should not communicate back to the sender that truncation happened. You do that some other way or even do not bother to do it at all. This is extra channel information that you can choose to communicate to let the other endpoint know about channel properties to make better data encoding decisions. You can transmit that in-band, out-of-band, on a different protocol, whatever. This is a higher level property of the communication channel between you and the other side.
Truncation is better authenticated because the packet reaches the other, known, authenticated endpoint who is the entity who can inform you, over a authenticated channel, that the transport channel has problems. You do not get nonsense like ICMP too large messages which come from unknown, unauthenticated entities. Furthermore, truncated messages can still be authenticated as long as you authentication tag the base header which should never be in the truncated section (you still need to have a minimum MTU below which you should always reject, but that number is small and much smaller than existing MTUs).
Fragmentation is a specific (unrelated) term, it's not interchangeable with a split. You can have (depending on the protocols involved):
- A runt due to a collision
- A link drop during transmit
- A problem during cut-through type transport
You can do various things to combat some of these (such as fragment-free instead of cut-through in collision domains) but you can't guarantee every phy IP ends up riding over can or should avoid these constraints.
> Forwarding generally corrupted packets should not be a problem unless your middleboxes are aggressively engaging in layering violations. From the perspective of a middlebox that is not engaging in layering violations you just have headers with blobs of data.
If "delivery of something somewhere" is your only definition of a problem, perhaps :p.
> Furthermore, your endpoints must already be robust to blobs of garbage that pass your integrity tag checking because it is trivial for malicious actors to send you blobs of garbage with correctly calculated integrity tags.
Not only the endpoints to garbage in the data payloads but equally the gear to garbage in the network headers. Be it full authentication or just error detection, you don't want to just forward things with a corrupted network header and hope it doesn't cause an issue or security violation. Things like CRCs or HMACs are done per layer precisely for this kind of reason, going to truncation requires dropping that safe handling.
> Every client implementation
As a side note: the concerns have less to do with the clients, they have full context and control of their sessions in software land with little concerns from concerns in being the physical transport layer. Most all of these considerations need to be thought from the intermediate boxes doing the transport/truncation instead.
> You still get unidirectional protocols because you should manage truncation information out-of-band of any of your protocols
Unidirectional protocols cannot be expected to punt directionality to a separate session. In general, any time the answer to a network conundrum (such as the two generals) sounds as easy as "just move that to a separate channel which has the information" you have either duplicated the problem in that channel or added functionality which might not be physically available (or directionally available for security use case reasons, or scalably available for multicast, or something else for a use case that isn't 'inside out' from what might pop in mind as a 'standard' session).
> Truncation is better authenticated because the packet reaches the other, known, authenticated endpoint who is the entity who can inform you, over a authenticated channel, that the transport channel has problems.
I'm still not sure I follow - how is the message between endpoints still authenticated if middleboxes can modify the bytes, breaking an HMAC and/or CRC (if any), and it still gets delivered? Having authenticated an endpoint exists at an address you've sent a packet to before does not automatically authenticate any packet which arrives.
You also skipped over any of the implications for network tunnels (secure/insecure) - is MTU discovery just not supposed to work in those use cases?
I think you can absolutely make a domain specific protocol which is happy to use truncation for MTU discovery, I just don't think anything which is supposed to be as universally usable as IP can.
I said there were no “problems” if there are no layering violations because you argued that recalculating checksums would be a layering violation. Either we say layering violations are unacceptable at which point my argument stands. Or we say layering violations are par for the course and you can just recalculate the checksums if you need to.
Unidirectional protocols with no back channel must assume the network channel parameters such as MTU. Adding truncation information which can be picked up at a different layer is just strictly more information you can feed into your protocol if it is designed to handle that. You can just not use it and act as if truncation is dropping if you want to. This is just strictly more data you can use for decisions.
You can get still get authenticated transport in the presence of truncation if your protocol generates a authentication tag for the “original” length and puts it at the start of the message. Then you can authenticate the length field and verify truncation otherwise you can drop it.
I did not bother with tunnels because I do not see how it is a distinct problem. Tunnels already need to figure out how to manage their MTUs. Either the tunnel is transparently managing how it fragments data and can be enhanced to support truncation (though it does not need to, it can just drop truncation/malformed as they currently do) or it tells tunnel parameters to the endpoints so that the endpoints keep themselves in bound at which point the endpoints can detect whatever the MTU of the tunnel is.
And again, you can always just ignore truncated packets and act as if they are malformed which everybody already does. This is strictly more functionality which does not require changing all existing systems which can be used to support more efficient MTU discovery by systems and networks that supported it. And if they do not, you just fallback to the current, crusty way.
The proposal doesn't operate at that level but it must be compatible with the operations of that level. I.e. that the physical layer can also cause truncation of layers riding on top of it needs to be accounted for in the way those upper layers consider what truncation means. The same is true for possible intermediate layers (which sorta aligns with the later conversations regarding tunnels, which are basically just more complicated forms of intermediate layers).
> The proposed model assumes the physical layer guarantees point-point delivery of a distinct packet between adjacent nodes in the network with MTU limits manifesting as either discarding or rejecting the trailing portions of the packet.
Then proposed isn't applicable to IP since an upper layer protocol cannot make guarantees about the behavior of lower level protocols it may be transported on.
In addition, discarding trailing portions of the packet still results in the aforementioned problems with consistency checks and forwarding behavior limitations for lower level layers which did abide by this behavior.
> Unidirectional protocols with no back channel
One cannot guarantee bidirectional protocols will be able/allowed to form a back channel either, I just used unidirectional as a more clear-cut example.
> You can just not use it and act as if truncation is dropping if you want to. This is just strictly more data you can use for decisions.
Well sure, the same is true of the ICMP method or an active probing method. The concern is less with sessions you don't care to PMTUD in the first place and more with how the truncation design affects the designs of such other use cases.
> You can get still get authenticated transport in the presence of truncation if your protocol generates a authentication tag for the “original” length and puts it at the start of the message. Then you can authenticate the length field and verify truncation otherwise you can drop it.
I totally agree one can include an HMAC tag in your client<->client protocol to validate unmodified packets are authentic. This is regardless of whether truncation, ICMP packet too big, active PMTUD probing, or any other method is in place as, to this point, this is only about validating delivered packets which did fit in the MTU.
What isn't clicking is when a truncated message arrives how a (now invalid) HMAC helps you authenticate if this packet was completely spoofed by a malicious actor or really truncated by a middlebox. All you know is it was supposed to be longer and now something claims it needs to be shorter, how do you know that's not because of the same malicious actor who was supposed to be sending the fake ICMP packet too big rather than a middlebox really trying to signal the packet truly needed to be truncated?
> I did not bother with tunnels because I do not see how it is a distinct problem.
As highlighted earlier, tunnels may either encapsulate other protocols or encapsulate protocols which are expecting truncation. If the only things which existed in the world were client network interfaces it wouldn't be a problem, once more network devices become involved then you have to consider the impact on those too. The main thing to keep in mind is very few network middleboxes or tunnel protocols have the ability to do fragmentation on behalf of tunneled data, particularly if they are hardware based or based on protocols without such a feature (such as Ethernet) since this eats up TONS of hardware to do so (especially at high speeds). E.g. take an IPv6 VXLAN tunnel of an Ethernet frame on a 400 Gbps interface, how is an pure L3 intermediate carrier router doing truncation supposed to know not to update the UDP (a layer up the stack) checksum so the truncated Ethernet payload actually gets delivered to the client destination from the egress VTEP? It's not even that the egress VTEP needs some way to signal to the ingress VTEP how much the truncation was, it's that the original client which was VXLAN encapsulated by the ingressing VTEP needs its packet delivered to the remote client so the remote client can see the truncation and re-negotiate (in band or out of band) with the client to send smaller frames. This signaling will not occur because of the aforementioned UDP checksum being broken by an intermediate router. Just removing all checksums and allowing all modifications to headers and delivering whatever arrives would create not only high incidences of the propagation of deformed traffic but also security risks.
This brings us back to the example of secure tunnels, like IPsec, which have the same problem but in a much more succinct form. All parts of the payload of an IPsec tunnel are basically random noise after you truncate it, so there is no way to even attempt to consider sending the truncated payload to the intended destination. It's not the responsibility of the IPsec encapsulator to perform the encapsulation and the IPsec receiver usually doesn't have a path to communicate with the original client (not that it even knows who that is).
If you redesign everything about how network tunneling works under some severe limitations and assumptions then it may be possible to solve some (or maybe all if I can figure out what I'm missing regarding authentication of packets claiming MTU changes) of these problems but I'm not sure I could ever see the set of requirements needed as easier than the other MTU approaches. That doesn't necessarily mean I think there is an overall perfect answer all, just that I think PMTUD and its variants are definitely the easier path.
We have A <-> B <-> C. A wishes to transmit a packet of 0x1000 bytes containing a Ethernet, IPv4, and then bespoke protocol, P, which is a header containing a length, MAC on the length + header, MAC on entire packet, encrypted payload, in that order. A then prepares transmit descriptors pointing at the packet and with size 0x1000 bytes.
C prepares receive descriptors pointing to buffers with a maximum capacity of size 0x1000 bytes per packet. B prepares receive descriptors pointing to buffers with a maximum capacity of size 0x500 (1280) bytes per packet.
A transmits the packet to B. The physical coding layer transmits the bytes terminating in the FCS. B receives bytes and does a running computation of the FCS. Upon reaching 0x500 bytes, it stops storing data into memory, stores the current FCS into memory, then continues receiving the data and computing the FCS until the data stream ends. Upon determining that the FCS matches, it marks the descriptor as valid for consumption and stores that the descriptor contains 0x500 bytes of data. The transmit engine of B then configures a transmit descriptor pointing at the packet and with size 0x500 bytes.
C then receives the 0x500 byte packet from B and observes that the FCS matches the 0x500 byte FCS and marks the descriptor as valid for consumption and stores that the descriptor contains 0x500 bytes of data. C then processes the packet observing that the P header indicates a length of 0x1000 bytes, but only 0x500 bytes are available. It attempts to authenticate the P header MAC using a secret known only to A and C. As the truncation only hit the encrypted payload at the tail, the P header MAC and the header data it is authenticating have not been modified by the truncation process. As such, C is able use the higher layer secret it shares with A to successfully authenticate the header data and determine that the header containing a length field with the value of 0x1000 bytes could have only been written by A and has not been tampered with. It then rejects the rest of the packet, but stores that the inbound MTU is only 0x500 bytes.
The authentication header can only helps you authenticate when the MTU stayed the same as expected during delivery, it cannot help you authenticate the signals claiming MTU was supposed to be something else as those modifications, inherently, do not come from nodes partaking in the authentication header. The malicious middleman could falsely truncate a single packet to 0x500 bytes just as easily as they could falsely create an ICMP packet claiming the MTU is 0x500 bytes, in both cases the only thing you know for sure is "someone is trying to claim that last packet was too big".
Anti-DDoS middle boxes will be almost certainly unhappy with lone fragments and UDP in general, so it’s a bit of a thorny path.
The big question is what to do with IPv6, since the intermediary nodes will only drop. This bit unfortunately makes the whole exercise pretty theoretical, but it can be fun nonetheless to explore.
Feel free to contact me at my github userid at gmail, if this is a topic of interest.
Of course nothing is stopping you from labbing it up at home. Firewalls and software routers can usually be made to do refragmentation.
Thanks for keeping me straight here!
Based on the admittedly old study at [0] seems like some carriers just don’t bother to fragment, indeed - but by far not all of them.
Firewalls might do virtual reassembly, so the trick with the initial fragment won’t fly there.
This MTU subject is interesting for me because I have a little work in progress experiment: https://gerrit.fd.io/r/c/vpp/+/41914/1/src/plugins/pvti/pvti... (the code itself is already in, but has a few crashy bugs still and I need to take make it not suck performance wise, but that is my attempt to revisit the issue of MTU for tunnel use case. The thesis is that keeping the 5-tuple will make “chunking”/“de-chunking” at tunnel endpoints much much simpler on the endpoints of the tunnel.
The source of inspiration was a very practical setup at [1], which is, while looking horrible in theory (locally fragmented GRE over L2TP), actually gives a decent performance with 1500-byte end to end MTU over the tunnel.
The open question is which inner MTU will be sane, taking into account the increased probability of loss with bigger inner MTU… intuitively seems like something like ~2.5K should just double the loss probability (because it’s 2x packets) and might be a workable compromise in 2025….
One could also do the same trick over QUIC, of course, but i wanted something tiny and easier to experiment with - and the ability to go over IPSec or wireguard as well as a secured underlay.
[0] https://labs.ripe.net/author/emileaben/ripe-atlas-packet-siz...
[1] https://github.com/ayourtch/linode-ipv6-tunnel
Like you mention you could do this under QUIC but then you'd be hamstrung to some of the design mandates such as encryption. This is way better as it's just datagrams doing your one goal - hiding that you're transporting fragments.
OTOH, I heard folks calling to banish the “no messing with a flow within 5-tuple” principle, so my hack may not have an overly long shelf life.
You don't, apart from enforcing a bare-minimum MTU for sanity's sake. If your jumbo-size packets are getting randomly cut off by a middlebox, then they probably aren't stable at that size anyway.
You do not need bit error detection because you literally truncated the packet. The data is already lost. But in the process you learned it was due to MTU limits which is very useful. Protocols are already required to be robust to garbage that fails bit error detection anyways, so it is not “required” to always have valid integrity tags. You could transparently re-encode bit error detection on the truncated packet if you so desire to ensure data integrity of the “MTU resulted in truncation” packet that you are now forwarding, but again, not necessary.
Any end-to-end protocol that encodes the intended data size in-band can use this technique across truncating transport layers. And any protocol which does so already requires implementations to not blindly trust the in-band value otherwise you get trivial buffer overflows. So, all non-grossly insecure client implementations should already be able to safely handle MTU truncation if they received it (they would just not be able to use that for MTU discovery until they are updated). The only thing you need is routers to truncate instead of drop and then you can slowly update client implementations to take advantage of the new feature since this middlebox change should not break any existing implementations unless they are inexcusably insecure.
“Damaged” frames and frame integrity only matter if you need the contents of the entire packet to remain intact. Which you explicitly do not when truncating.
The only new problem that arises is that maybe the in-band length information or headers get corrupted resulting in misinterpreting the truncation that actually occurred. And again, you already need to be robust to garbage. And you can just change my proposal to recompute the integrity tag on the truncated data if you think that really matters.
Ugh. I don't understand this. Especially passive PMTUD should just be rolled out everywhere. On Linux it still defaults to disabled! https://sourcegraph.com/search?q=context%3Aglobal+repo%3A%5E...
More rare cases, but really frustrating to debug was when we had an L2 switch in the path with lower MTU than the routers it was joining together. Without an IP level stack, there is no generation of ICMP messages and that thing just ate larger packets. The even stranger case was when there was a Linux box doing forwarding that had segment offload left on. It was taking in several 1500 byte TCP packets from one side, smashing them into ~9000 byte monsters, and then tried to send those over a VPNish network interface that absolutely couldn't handle that. Even if the network in the middle bothered to generate the ICMP too big message, the source would have been thoroughly confused because it never sent anything over 1500.
This is an old Linux tcp offloading bug; large receive offload smooshes the inbound packet, then it's too big to forward.
I had to track down the other side of this. FreeBSD used to resend the whole send queue if it got a too big message, even if the size did not change. Sending all at once made it pretty likely for the broken forwarder to get packets close enough to do LRO, which resulted in large enough packet sending to show up as network problems.
I don't remember where the forwarder seemed to be, somewhere far away, IIRC.
Passive PMTUD does NOT depend on ICMP messages.
So it is not compatible with anycast, for instance, which is massively used everywhere
In the end, having no answer is better than having a most likely wrong answer
Because UDP is only a very thin layer, each layer on top (eg, QUIC) has to implement PLPMTUD; although, recently IETF standardised a way to extend UDP to have options and PLPTMUD is also specified for that: https://datatracker.ietf.org/doc/draft-ietf-tsvwg-udp-option...
At 10Gbps it would take 3.4 seconds just to serialize the frame.
[1] https://docs.broadcom.com/doc/957608-PB1
I get the impression that the standard still allows hubs to exist, but that you just don't see them in practice.
I would be interested if anyone has ever used a 100mbit hub.
If I understand correctly, the speed of light in an electrical cable doesn't depend on the metal that carries current, but instead depends on the dielectric materials (plastic, air, etc.) between the two conductors?
For fibre the velocity factor depends on the refraction index of the fibre.
This part?
Can’t we accept to start a change that may take a decade or more to go forward? Instead of not starting that change.
Related reading: https://joshcollinsworth.com/blog/never-use-px-for-font-size
This is on Firefox/Debian, in case that means something for installed fonts. I see that site's CSS specifies Verdana and Arial, names that sound windowsey to me but I have no idea if my system has (analogous versions to) those
I currently have this solved by just sticking hosts on two vlans, one that has the default route and another that only has the jumbo capable hosts. ... but this seems kinda stupid.
See "mtu" option in ip-route(8):
* https://man.archlinux.org/man/ip-route.8.en#mtu
The BSDs also have an "-mtu" option in route(8):
* https://man.freebsd.org/cgi/man.cgi?route(8)
* https://man.openbsd.org/route
First, every L2 "dumb" switch that doesn't support your jumbogram size just silently drops the packet, which is no good.
Then, you have to figure out what size of jumbogram every device on your network supports, and select the minimum. In many cases, you'll have clients that don't support it at all.
And I hope all your OSes support setting an MTU per route, and you enjoy setting special routes on all of your clients, since Path MTU discovery, even where it is enabled and supported, at the very least adds latency to every connection, if it even works at all.
And god help you once you try to scale up your sweet jumbo frame solution. Plenty of routers have strict ICMP rate limits either imposed in software or hardware (because ICMP may be handled in an anemic CPU). So those ICMP fragmentation needed packets aren't reliably returned to your clients. It's even worse if your ISP doesn't block jumbograms outright. You will soon learn which of your ISPs peerings do or don't support jumbograms and whether they do or don't emit or forward ICMP.
The only advisable way to use jumbo frames is if you are running a datacenter and you have a group of machines that can be properly configured for route-based MTU and that would benefit from jumbo frames, and every piece of hardware you buy is carefully specced to support it.