# 100G SERDES Power Study 

Phil Sun, Credo

IEEE 802.3ck Task Force

## Introduction

- 100Gbps SERDES power challenge and lower-power solutions have been presented.
- sun 3ck 01a 0518 introduced "balanced lower-power EQ", training protocol, and silicon test results.
- healey 3ck 01b 0718 pointed out "extensions to TX FFE" can improve margin while keeping low C2M power.
- welch 3ck adhoc 01081518 concluded power budget for C2M interface is very little for some future modules.
- $\quad$ lim 3ck 01b 0718 showed 8 FFE taps may be needed for C2M and SERDES power may be a concern.
- This contribution is to discuss 100G SERDES power of different SERDES architectures.
- Power optimization and shrink may be very different for each design and each process. PAM4 SERDES requires better linearity, bandwidth, and noise control than NRZ. This contribution tries to summarize latest papers on PAM4 SERDES, and predict power of 100G SERDES by scaling clock frequency.


## Major Blocks of a Typical SERDES



- High-power blocks are TX driver, RX FFE/DFE, PLL/clock buffers, CTLE. Some SERDES also has ADC.
- FFE and DFE may be implemented in analog or digital domain depend on whether there is high-precision ADC.


## SERDES Structure with "Balanced EQ"

- "Balanced EQ" is proposed to move part of the equalization from RX to TX to save power.
- For C2M, module RX is CTLE only and host has extended TX FFE. There are two possible structures based on Module TX:

1. Asymmetric structure: module has short TX FFE (e.g. 4 taps with 2 pre). Host has full RX.
2. Symmetric structure: module has extended TX FFE. Host RX does not have long FFE/DFE.

|  | Module TX | Module RX | Host TX | Host RX |
| :--- | :--- | :--- | :--- | :--- |
| Asymmetric <br> Balanced EQ | Short FFE <br> (e.g. 4 taps) | CTLE only | Extended FFE <br> (e.g. 11-taps) | Full RX |
| Symmetric <br> Balanced EQ | Extended FFE <br> (e.g. 11 -taps) | CTLE only | Extended FFE <br> (e.g. 11-taps) | Shorter <br> Equalizer |
| Traditional <br> Structure | Short FFE <br> (e.g. 4 taps) | CTLE + FFE/ DFE <br> with 8 post cursors | RegularTX FFE <br> (e.g. 6 taps) | Full RX |

Equalization Configuration (assuming 2 pre and 8 post cursors for C2M)

## PAM4 SERDES Power Survey -TX

| Reference | [1] Dickson ISSCC 2017 | [2] Frans JSSC 2017 | $\begin{gathered} \text { [3] Im } \\ \text { ISSCC } 2017 \end{gathered}$ | [4] Upadhyaya ISSCC 2018 | $\begin{gathered} \text { [5] Wang } \\ \text { ISSCC } 2018 \end{gathered}$ | [6] Depaoli ISSCC 2018 | [7] Menol <br> ISSCC 2018 |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| Technology | 14 nm | 16 nm | 16 nm | 16 nm | 16 nm | 28 nm | 14 nm |
| Data Rate [ $\mathrm{Gb} / \mathrm{s}$ ] | 56 | 56 | 56 | 56 | 63.375 | 64 | 112 |
| TX | voltage driver <br> FFE taps: 3 <br> Resolution:30 slices | Current driver <br> FFE taps: 3 <br> Resolution:5b for each tap | - | voltage driver <br> FFE taps: 4 <br> Resolution:78-90 slices | voltage driver <br> FFE taps: 3 <br> Resolution: 33 slices with half cells | voltage driver <br> FFE taps: 4 <br> Resolution:72 slices | DAC <br> FFE taps: 8 <br> Resolution: 8 bit |
| TX Power (mw) | 101 | 140 | - | - | 89.7 | 135 | 264 Including 34 for 8 -tap FIR |
| TX Power Scaled to $106.25 \mathrm{~Gb} / \mathrm{s}$ (mw) | 192 | 266 | - | - | 150 | 224 | $\begin{gathered} 250 \\ \text { Including } 32 \mathrm{mw} \\ \text { for 8-tap FIR } \end{gathered}$ |

Most of the data rates listed are close to 56 Gbps . For the same structure, power will be almost double for 112 Gbps considering majority of circuit power scales with clock rate/Bandwidth.

- Dynamic power is proportional to $\mathrm{CV}^{2} \mathrm{f}_{\mathrm{ck}}$
- There are 4 voltage mode drivers. Resolution and the number of taps are among the major contributors to the power difference. $2.5 \%$ resolutions and at least 4 TX FFE taps are assumed for 100G C2M (healey 3ck 01b 0718). Resolution and the number of FFE taps of [1] and [5] need to be increased and result in higher power for this application.
- [7] is an early design of 112G TX with high-precision DAC. Power usually will improve with time.


## Traditional Voltage v.s. DAC Drivers



- "Traditional" TX FFE structure
- time-delayed data streams x[k-n]
- variable weight sub-drivers with weight cn
- limited number of taps
- DAC based TX FFE structure
- digital FFE implementation $\rightarrow$ digital sample $\mathrm{y} \#(\mathrm{~K})$
- sample bit data streams
- fixed binary weight sub-drivers
- suitable for larger number of taps
> Maximum flexibility in \# taps and weights
- Summation circuit of FFE is in analog domain for traditional voltage-mode driver, and in digital domain for DAC based TX.
- Traditional voltage mode driver power scales up quickly with resolution (and the number of taps).
- DAC based receiver becomes popular because of its flexibility in the number of FFE taps and weights.


## PAM4 SERDES Power Survey

| Reference | [1] Dickson ISSCC 2017 | [2] Frans JSSC 2017 | $\begin{gathered} {[3] \text { Im }} \\ \text { ISSCC } 2017 \end{gathered}$ | $\begin{aligned} & \text { [4] Upadhyaya } \\ & \text { ISSCC } 2018 \end{aligned}$ | $\begin{gathered} \text { [5] Wang } \\ \text { ISSCC } 2018 \end{gathered}$ | [6] Depaoli ISSCC 2018 | [7] Menol ISSCC 2018 |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| Technology | 14 nm | 16 nm | 16 nm | 16 nm | 16 nm | 28 nm | 14 nm |
| Data Rate [Gb/s] | 56 | 56 | 56 | 56 | 63.375 | 64 | 112 |
| RX EQ | TX Only | CTLE <br> 24-tap FFE <br> 1-tap DFE <br> ADC based | CTLE 10-tap directfeedback DFE | CTLE <br> 14-tap FFE <br> 1-tap DFE | CTLE | CTLE | TX Only |
| ADC Res (bits) | - | 8 | Non-ADC | 7 <br> 3 if FFE/DFE Off | 6 <br> 2 for easy channels | Non-ADC | - |
| RX Power (mw) | - | 370 <br> DSP Power not included | 230 | - | 100 for 8.6 dB channel 184.9 for 13.6 dB channel 283.9 for 29.5 dB channel FFE, Deserializer, PLL, CDR are not included | 180 <br> 126 if scaled for 56 G and $16 \mathrm{~nm} * *$ |  |
| Total Power (mw) | - | 510 <br> DSP Power not included | 350* | 545 (PMA 325, digital 220) for high loss channel 360 w/o FFE/DFE (PMA 295, digital 65) | 189.7 for 8.6 dB channel 274.6 for 13.6 dB channel 373.6 for 29.5 dB channel (FFE, Deserializer, PLL, CDR are not included) | 315 <br> 221 if scaled for 56G and $16 \mathrm{~nm} * *$ |  |
| Total Power at $\begin{gathered} 106.25 \mathrm{~Gb} / \mathrm{s} \\ (\mathrm{mw}) \end{gathered}$ | - | $\qquad$ | 664* | 1034 683 w/o FFE/DFE | 360 for 8.6 dB channel 460.2 for 13.6 dB channel 709 for 2b 29.5 dB channel (FFE, Deserializer, PLL, CDR are not included) | 419 for 16 nm |  |

-     * [3] total power is around 350 mW if assuming a 120 mW TX.
- **Assuming $20 \%$ power saving from 28 nm to 16 nm . (possibly $+/-10 \%$ estimation error for one full node)


## PAM4 SERDES Power Survey Summary

- Some latest receiver architectures published on ISSCC and JSSC are listed - CTLE only, direct feedback DFE, and ADC-based.
- In average TX power about 110 mW for 53.125 Gbps and 220 mW for $106.25 \mathrm{~Gb} / \mathrm{s}$.
- [5] and [6] shows ADC-based receiver power can be reduced by 350 mW at $106.25 \mathrm{~Gb} / \mathrm{s}$ by turning off RX FFE/DFE. SERDES power increased about $51 \%$ to enable RX FFE/DFE. As the same design can be used for both long-reach and short-reach with optimized power, design cost is reduced.
- Can receiver FFE/DFE be turned off for C2M channels?
- sun nea 01a 0517 shows TX FIR effectively cancels bad reflections for a 33 dB channel.
- sun 3ck 01a 0518 shows channel output eye is wide open for a 14 dB channel with extended TX FIR. No RX FFE/DFE will be needed.
- twombly 3ck 01a 0718 shows good performance on a 30 dB channel by extending TX FIR. Only 3-tap FFE and DFE on the RX side to deal with material loss.
- healey 3ck 01b 0718 compared performance of TX and RX FFE, and concluded extended TX FFE can improve link margin and increase loss budget while keeping a CTLE only receiver.


### 106.25Gb/s C2M SERDES Power - 8 post cursors

| Architecture | Balanced EQ (1. Asymmetric, 2. symmetric) | 3. Analog DFE ** | 4. ADC Based |
| :---: | :---: | :---: | :---: |
| Equalization | TX: FIR (2/4 taps for asymmetric structure, $2 / 11$ taps for symmetric structure) <br> RX: CTLE | TX: FIR (2/4) <br> RX: CTLE, with DFE taps | TX: FIR (2/4) <br> RX: CTLE, 6 -bit ADC, 8 postcursor digital FFE |
| TX Power*(mW) | $\begin{aligned} & 196 \\ & 224 \text { (symmetric structure) } \end{aligned}$ | 196 | 196 |
| RX Power (mW) | $\begin{aligned} & 239 \\ & \text { (by scaling [6]) } \end{aligned}$ | 436 <br> (by scaling [3], 2 DFE tail tap power is very low) | 498 <br> (310 by scaling [5] front end for 13.6 dB channel; 108 for FFE by scaling FIR of [7] for 6b input; 80 for PLL, deserializer and CDR) |
| Relative total Power (mW) | 0 (435 as Baseline for asymmetric) <br> 28 (463 for symmetric) | $\begin{aligned} & 197 \\ & \text { (total 632) } \end{aligned}$ | $\begin{aligned} & 259 \\ & (\text { total 694) } \end{aligned}$ |
| Power Difference for 2x400G Module C2M at $106.25 \mathrm{G}(\mathrm{mW})$ | 0 for asymmetric (Total 3480) <br> 224 for symmetric (Total 3704) | $\begin{aligned} & 1,576 \\ & \text { (Total 5056) } \end{aligned}$ | $\begin{aligned} & 2,072 \\ & \text { (Total 5552) } \end{aligned}$ |
| Projection with $30 \%$ reduction (mw)*** | 0 for asymmetric (Total 305) 19 for symmetric (Total 324) | 137 (total 442) | 181 (total 486) |

Power of different SERDES structure is derived from the survey results. 8 postcursor taps are assumed.
*assuming 180 mw for a 6 bit DAC based on feedbacks of ad hoc meeting. TX FIR is 4 mw per tap based on [7].
The asymmetric structure adds 28 mW power on switch ( 0.9 W for 32 ports) to trade for lowest module power. Symmetric The symmetric structure enables close to lowest power RX for both module and host.
**DFE tap 1 timing is tight. Assuming it can implemented by other power equivalent ways for C2M performance.
Total power ratio for architecture $1,2,3$, and 4 is $\mathbf{1 : 1 . 0 6} \mathbf{1 . 4 5}$ : 1.57.
***Brave projection for future nodes with design improvements.
IEEE P802.3ck Task Force

## Module Power Budget - 8 Postcursor Taps



## $2 \times 400$ GBase DR4: Gen 1 excluding Electrical I/O

Lowest Max Power (ex. electrical I/O) ~ 9.9 W
Highest Max Power (ex. electrical I/O) ~ 16.8 W


Power Available for Electricall/O~5.1 W


PowerAvailable for Electricall/O~ - 1.8 W

- welch 3ck adhoc 01081518 analyzed power budget for electrical I/O. Power available for C 2 M is 5.1 W in the best case, and -1.8 W in the worst case. Average is 3.45 W .
- "Balanced EQ" is close to the average power budget. Direct feedback is at the edge of best case budget, but DFE error propagation may be a problem for C 2 M interface.
- "Balanced EQ" needs extra logic for adaptive turning. If management network is used for this purpose, the extra logic is mainly for register access and its power should be small.


## c2M SERDES Power - 5 post cursors

- Besides implementations in the survey table, FFE with a few taps can also be implemented in analog domain.
- Assuming 5 FFE postcursors are enough by tightening channel or relaxing pre-FEC BER target, power ratio of C2M with asymmetric TX FFE, symmetric TX FFE, and analog RX FFE is about $1.00: 1.04: 1.40$. FEE power could be lower at cost of larger area etc. In this case, power ratio of these three architectures is estimated to be about 1.00:1.04:1.30.
- TX FIR has 4 or 11 taps depending on whether there is RX FFE. The TX in this survey is different from [7]. Its tail taps are assumed to have less bits than major taps, and TX power is also lower.


## C2M Power with Asymmetric Extended TX



C2M Power with Symmetric Extended TX


C2M Power with RX FFE


## Module Power Budget - 5 Postcursor Taps



- If 5 postcursor taps are needed, 16 nm analog FFE based low-power architecture meets budget between best and average.


## Analog FFE Based Architecture

- Delay of analog FFE is usually implemented by buffers and passive/active LC delay lines [10, $11,12]$. Circuit distortion is a challenge if too many FFE taps are required. Main tap will be distorted if there are precursors and becomes a problem especially for PAM4.

| Reference | [10] Momtaz <br> JSSC 2010 | J11] Chen <br> JSSC 2012 | [12] Mammei <br> JSSC 2014 |
| :---: | :---: | :---: | :---: |
| Technology | 65 nm CMOS | 65 nm CMOS | 28 nm LP CMOS |
| Signaling | $40 \mathrm{~Gb} / \mathrm{s}$ NRZ | $40 \mathrm{~Gb} / \mathrm{s}$ NRZ | $25 \mathrm{~Gb} / \mathrm{s} \mathrm{NRZ}$ |
| FFE taps | $7 \mathrm{~T} / 2(3.5 \mathrm{UI})$ | $3 \mathrm{~T}(3 \mathrm{UI})$ | $73 / 4 \mathrm{~T}(5.25 \mathrm{UI})$ |
| FFE Power <br> $(m w)$ | 65 | - | 90 |
| chip power <br> (mw) | 80 | 655 | - |
| Application | Repeater | CDR | CDR |

- If scale [12] for 53.125 GBd NRZ on 16 nm (assuming $20 \%$ process shrink with probably $10 \%$ estimation error), FFE power would be 153 mw . Higher power is expected for $106.25 \mathrm{~Gb} / \mathrm{s}$ PAM4 but hard to estimate without actual implementations.
O
ghiasi 3ck 020918 derives FFE power from [10]. But [10] is optimized for single-lane repeater, not suitable for multilane chips [ref 12].


## Analog FFE Based Architecture Cont'd




Delay cell, die size, and eye diagram of [10]

- [10] achieved very low power using this structure for a 7 -tap T/2 FFE on 65 nm . This design is well optimized for a single-lane repeater with NRZ signaling.
- Inductors are extensively used for low power at cost of large die size.
- As it is for NRZ, device nonlinearity is tolerated and signal swing is very small.
- Coupling caused by inductors is less problematic for a single-lane repeater which has no complicated clock circuits.


## Analog Based FFE Architecture Cont'd

- Long FFE (e.g. 8 post taps) is very difficult to be implemented by this structure even at latest process.
- If we need 8 postcursor taps, 9 UI coverage is needed. (7-tap T/2 FFE of [10] covers 3.5 UI. )
- [10] is published 8 years ago, industry is still experimenting different architectures for low power. This can also be observed in publications.
- For $106.25 \mathrm{~Gb} / \mathrm{s}$ PAM4 C2M for multi-lane modules, new challenges may result in a lot higher power compared to [10].
- PAM4 can tolerate much less device nonlinearity and noise.
- Inductors can be used to save power, but need to be controlled to avoid very large die size and coupling issue. Inductor size does not scale with process.
- Delay needs to increase from 12.5 ps to 18.8 ps. Simply adjusting transconductance amplifier will result in low delay cell bandwidth and degrade performance. More inductors may be needed for this purpose regardless of process.
- [10] FFE bandwidth is 20 GHz with delay cell bandwidth of 41 GHz . To keep the same performance for $106.25 \mathrm{~Gb} / \mathrm{s}$ PAM4, more than $30 \%$ bandwidth increase is likely needed.
- Signal swing needs to be greatly increased. As a consequence, device nonlinearity becomes more challenging.
- It can be very misleading to estimate $106.25 \mathrm{~Gb} / \mathrm{s}$ PAM4 C2M power based on [10] .
- Actual implementation is needed to quantify power increase and check performance related to linearity, noise, or other challenges.
- Area and coupling issues are problematic for multilane applications.
- Power shrink this type of circuit can be bad. Power scale across multiple process may result in huge estimation error. (e.g. for two generations, assuming $10 \%$ or $30 \%$ power shrink results in $65 \%$ estimation difference. )


## Conclusions

- The number of EQ taps impacts architecture choices.
- If 8 postcursor taps are needed, power of balanced EQ, analog DFE, and ADC based SERDES are considered. The ratio is $1: 1.45: 1.57$.
- If 5 postcursor taps are needed, analog FFE based architecture appears to be more power efficient than the other RX equalization structures. Power ratio of balanced EQ and analog FFE based SERDES is $1: 1.3$.
- For 16 nm SERDES with 8 postcursor taps, 2 x 400 G module power is 1.6 W to 2.1 W lower by using "balanced EQ". The power difference is 1.1 W and 1.5 W after $30 \%$ of power shrink for newer technology.


## References

[1] T. O. Dickson, et al., "A 1.8pJ/b 56Gb/s PAM-4 Transmitter with Fractionally Spaced FFE in 14nm CMOS," ISSCC, pp. 118-119, Feb. 2017.
[2] Y. Frans, et al., "A 56-Gb/s PAM4 Wireline Transceiver Using a 32-Way Time-Interleaved SAR ADC in 16-nm FinFET," IEEE JSSC, vol. 52, no. 4, pp. 1101-1110, Apr. 2017.
[3] J. Im, et al., " A 40-to-56Gb/s PAM-4 Receiver with 10-Tap Direct Decision-Feedback Equalization in 16nm FinFET", ISSCC, pp. 114-115, Feb. 2017.
[4] P. Upadhyaya, et al., "A Fully Adaptive 19-to-56Gb/s PAM-4 Wireline Transceiver with a Configurable ADC in 16nm FinFET", ISSCC, pp. 108-109, Feb. 2018.
[5] L. Wang, et al., "A 64Gb/s PAM-4 Transceiver Utilizing an Adaptive Threshold ADC in 16nm FinFET", ISSCC, pp. 110-111, Feb. 2018.
[6] E. Depaoli, et al., "A 4.9pJ/b 16-to-64Gb/s PAM-4 VSR Transceiver in 28nm FDSOI CMOS", ISSCC, pp. 112113, Feb. 2018.
[7] C. Menolfi, et al., "A 112Gb/s 2.6pJ/b 8-Tap FFE PAM-4 SST TX in 14nm CMOS", ISSCC, pp. 103-104, Feb. 2018.
[8] http://www.ieee802.org/3/100GEL/public/18 03/farjadrad 100GEL 01a 0318.pdf
[9] http://www.ieee802.org/3/ad hoc/ngrates/public/17 05/sun nea 01a 0517.pdf
[10] A. Momtaz, "An $80 \mathrm{~mW} 40 \mathrm{~Gb} / \mathrm{s} 7$-Tap T/2-Spaced Feed-Forward Equalizer in 65 nm CMOS", IEEE JSSC, Vol. 45, No. 3, Mar. 2010.
[11] M. Chen et al., "A Fully-Integrated 40-Gb/s Transceiver in 65-nm CMOS Technology", IEEE JSSC, vol. 47, no. 3, pp. 627-640, Mar. 2012.
[12] E. Mammei et al. "Analysis and Design of a Power-Scalable Continuous-Time FIR Equalizer for $10 \mathrm{~Gb} / \mathrm{s}$ to $25 \mathrm{~Gb} / \mathrm{s}$ Multi-Mode Fiber EDC in 28 nm LP CMOS", IEEE JSSC, vol. 49, no. 12, pp. 3130-3140, Dec. 2014

## Thanks!

