Thread Links Date Links
Thread Prev Thread Next Thread Index Date Prev Date Next Date Index

Re: [8023-CMSG] Server/NIC analogy



Hi
I felt Norm's messages where lacking the bridge between people / ideas.
This is where we should start to focus or will never get there.
I believe the Ethernet standard body should not teach zillion engineers
in the industry, how to architect their switch and routers.
This is a very "religious" issue as we have just experienced.
Clearly 802.3 can deal only with the link and the MAC .
We should try and understand the merit ( and there is ) of providing two
things that are with the scope of 802.3
1 - Directed flow control messages across the link. That is, improve the
PAUSE frame message, in the same scope 802.3x has.
2 - Evaluate the merit of introducing a BW control ( shaping) mechanism
in the MAC, similar to some proprietary implementation.
If there is enough merit to justify a new standard with in IEEE 802.3,
we have a go.
If not, then lets put our effort elsewhere.
I hope you will be able to excuse the simplification I have introduced,
which I found important for the stage we are at.

Thanks

Gadi


-----Original Message-----
From: Norman Finn [mailto:nfinn@CISCO.COM]
Sent: Wednesday, June 09, 2004 12:33
To: STDS-802-3-CM@LISTSERV.IEEE.ORG
Subject: Re: [8023-CMSG] Server/NIC analogy

Gary,

McAlpine, Gary L wrote:
 >
 > I think this discussion is off on a tangent.

One can reasonably claim that you're the one who's off on a tangent.
One man's tangent is another man's heart of the argument.  You keep
saying, "we're just ..." and "we're only ..." and "we're simply ..." and
failing to acknowledge our "but you're ..." arguments.  Specifically:

You want back pressure on some level finer than whole links.  The heart
of the argument, that you are not addressing in your last message, is,
"On exactly what granularity do you want to exert back pressure?"

The answer to that question is, inevitably, "flows".  (I have no problem
that "flows" are relatively undefined; we dealt with that in Link
Aggregation.)  Per-flow back pressure is the "but you're ..." argument.
Hugh Barrass's comments boil down to exactly this point.  You want to
have per-flow back pressure.

The "per-something Pause" suggestions have mentioned VLANs and priority
levels as the granularity.  The use of only 8 priority levels, and thus
only 8 flows, is demonstrably insufficient in any system with more than
9 ports.  For whatever granularity you name, you require at least one
queue in each endstation transmitter for each flow in which that
transmitter participates.  Unfortunately, this o(n) problem in the
endstations is an o(n**2) problem in the switch.  A simple-minded switch
architecture requires one queue per flow on each inter-switch trunk
port, which means o(n**2) queues per trunk port.  The construction of
switches to handle back-pressured flows without requiring o(n**2) queues
per inter-switch port has been quite thoroughly explored by ATM and
Fibre Channel, to name two.  It is
*not* an easy problem to solve.

At the scale of one switch, one flow per port, and only a few ports, as
Ben suggests, it is easy and quite convenient to ignore the o(n**2)
factor, and assume that the per-link back pressure protocol is the whole
problem.  Unfortunately, as you imply in your e-mail below, the trivial
case of a one-switch "network" is insufficient.  As soon as you scale
the system up to even "a few hops", as you suggest, the number of ports
has grown large enough to stress even a 12-bit tag per link.
Furthermore, to assume that a given pair of physical ports will never
want to have multiple flows, e.g. between different processes in the
CPUs, is to deny the obvious.

In other words, implementing per-flow back pressure, even in networks
with a very small number of switches, very quickly requires very
sophisticated switch architectures.

For a historical example, just look at Fibre Channel.  It started with
very similar goals, and very similar scaling expectations, to what
you're talking about, here.  (The physical size was different because of
the technology of the day, but the number of ports and flows was quite
similar.)  Fibre Channel switches are now quite sophisticated, because
the problem they are solving becomes extraordinarily difficult even for
relatively small networks.

Summary:

This project, as described by its proponents, is per-flow switching.
It is not the job of 802.3 to work on switching based even on MAC
address, much less per-flow switching.  It is essential that anyone who
desires to work on per-flow switching in 802 or any forum become
familiar with what the real problems are, and what solutions exist.

-- Norm

 > ... There are assumptions being
> made here that are off-base. We need to focus our attention on what it

> is we are trying to enable with new standards. (My numbered items are
> responses to Hugh's numbered items.)
>
> 1. If what we are trying to enable are single stage interconnects for
> backplanes, then wrt the IEEE standards, we're done. We just need to
> get good implementations of NIC's and switches using 802.3x (rate
> control, not XON/XOFF) to meet the requirements (e.g. good enough
> throughput, low latency, low latency variation, no loss due to
congestion). But ...
> single stage interconnects are not very interesting to people who want

> to construct larger interconnects to tie multiple racks with multiple
> shelves of blades together into a single system.
>
> 2. (Putting on my server hat) We're NOT asking for IEEE to provide
> end-to-end congestion management mechanisms. If IEEE can simply
> standardize some tweaks to the current 802.3 (& 802.1) standards to
> support better congestion visibility at layer 2 and better methods of
> reacting to congestion at layer 2 (more selective rate control and no
> frame drops), then the rest can be left up to the upper layers. There
> are methods that can be implemented in layer 2 that don't prohibit
> scalability. Scalability may be limited to a few hops, but that is all

> that is needed.
>
> 3. The assumption in item 3 is not entirely true. There are
> relationships (that can be automatically discovered or configured)
> that can be expoited for significantly improved layer 2 congestion
control.
>
> 4. For backpressure to work, it neither requires congestion to be
> pushed all the way back to the source nor does it require the
> backpressuring device to accurately predict the future. From the layer

> 2 perspective, the source may be a router. So back pressure only needs

> to be pushed up to the upper layers (which could be a source endpoint
or a router).
> Also, the backpressuring device simply needs to know its own state of
> congestion and be able to convey clues to that state to the
> surrounding devices. We don't need virtual circuits to supported at
> layer 2 to get "good enough" congestion control.
>
> 5. From an implementation perspective, I believe the queues can go
> either in the MAC or the bridge, depending on the switch
implementation.
> (Am I wrong? I haven't seen anything in the interface between the
> bridge and the MAC that would force the queues to be in the bridge.)
> IMO, where they go should NOT be dictated by either 802.1 or 802.3.
> The interface between the bridge and MAC should be defined to enable
> the queues to be place where most appropriate for the switch
> architecture. In fact, a switch could be implemented such that frame
> payloads bypass the bridge and the bridge only deal with the task of
> routing frame handles from MAC receivers to one or more MAC
> transmitters (Do the 802.1 standards prevent such a design?).
>
> As far as the IETF standards go, they don't seem to rely on layer 2 to

> drop frames (although we don't yet have a clear answer on this). If a
> router gets overwhelmed, it will drop packets. But if it supports ECN,

> it can start forwarding ECN notices before becoming overwhelmed. I
> think the jury is still out on whether the upper layers (in a confined
> network) would work better with layer 2 backpressure or layer 2 drops.
>>From a datacenter server perspective, there is no doubt in my mind
>>that
> backpressure would be preferrable to drops.
>
> Gary