Authors: cros-fuzzing-ext@google.com
Status: Final
Last updated: 2021-08-31
EVALUATED HARDWARE AND SOFTWARE 4
Issue #1: Lack of a bounds check on LRO RXD’s next_desc_ptr 5
#1.1: OOB buff_ring reads overflow into device controlled DMA buffers 7
#1.2 OOB writes to driver and other kernel data structures 8
#1.3: Use of uninitialized rxdata 8
Issue #2: Lack of extra consistency checks for EOP while gathering RX frags 8
#2.1: OOB write outside of packet buffers 9
#2.2: Use of uninitialized frag[0] 10
Issue #3 Lack of a bound check on TXD hw_head 12
#3.1: Double kfree of socket buffers 13
The atlantic (net/ethernet/aquantia/atlantic) Linux kernel driver was evaluated for security vulnerabilities at the device/driver interface. This was done as a prerequisite for allowlisting this device driver to bind against associated USB4 or Thunderbolt devices on Google Chromebook consumer devices.
We have found three separate Root Causes that can manifest as various security weaknesses. We believe that all of the root causes can be patched easily without requiring any intrusive changes to the driver.
Authors: cros-fuzzing-ext@google.com
This is the general Threat model that we operate under. Note that some of the blocks below may not be relevant to the specific device/driver that this report covers.
Our threat model here treats Devices and Peripherals as untrusted. Though somewhat unconventional, this is a real concern due to factors such as
It’s worth highlighting that IOMMU is a necessary but not sufficient mitigation for us. Device drivers can have a rich attack surface even via the limited subset of memory that is mapped as accessible to the device.
We pick out the Driver’s DMA surface as the main concern. DMA transactions performed by the device can change memory contents of mapped DMA buffers at arbitrary points in time. This can lead to situations where the data changes under the driver's feet while it is processing a DMA buffer, leading to a much larger and more subtle attack surface than interfaces where the data is either first copied or unmapped from the device. Secondarily, MMIO reads from device BARs - which present similar concerns but generally have a smaller attack surface.
Our methodology is a combination of VM introspection based fuzzing and manual code inspection. Fuzzing uses Kernel Fuzzer for Xen Project. We use it to identify driver sites that parse DMA’d buffers as well as to perform the actual coverage assisted fuzzing.
Like any fuzzing, the quality of results is very heavily based upon domain expertise. We do not claim this report to be comprehensive in any way and invite closer engagement with driver domain experts.
Driver: atlantic
Linux kernel versions: 5.4 (LTS)
Although a distro-specific (Chrome OS) kernel was evaluated in this exercise, the distro does not carry any private patches for this driver. Vulnerabilities were also confirmed to be present in upstream kernel version v5.14 by visual code inspection.
Host system: Intel x86_64
Device model: Aquantia Corp. AQC107 NBase-T/IEEE 802.3bz Ethernet Controller [AQtion] (rev 02) via a Cable Matters Thunderbolt 3 to 10 Gb Ethernet Adapter
Caveats: We were unable to source A0, B1 or atl2 rev devices and lack coverage on the corresponding hw_atl code.
# | Title | CVSS score |
1 | Lack of a bounds check on RXD next_desc_ptr | 6.0 |
2 | Lack of a loop iteration bound while gathering RX frags | 5.7 |
3 | Lack of a bounds check on TX hw_head | 5.7 |
CVSS:3.1/AV:P/AC:H/PR:N/UI:N/S:U/C:L/I:H/A:H
rxd_wb->next_desc_ptr comes from DMA accessible memory -- from an RX descriptor ring owned by the device. This is used as an index into RX buff_ring via buff->next without any further validation.
static int hw_atl_b0_hw_ring_rx_receive(struct aq_hw_s *self, struct aq_ring_s *ring) { /* ... */ } else { buff->len = rxd_wb->pkt_len > AQ_CFG_RX_FRAME_MAX ? AQ_CFG_RX_FRAME_MAX : rxd_wb->pkt_len; if (HW_ATL_B0_RXD_WB_STAT2_RSCCNT & rxd_wb->status) { /* LRO */ buff->next = rxd_wb->next_desc_ptr; ++ring->stats.rx.lro_packets; } else { /* jumbo */ buff->next = aq_ring_next_dx(ring, ring->hw_head); ++ring->stats.rx.jumbo_packets; } } /* ... */ } |
It is later used as an offset onto driver-internal buff_ring without any bounds checks.
int aq_ring_rx_clean(struct aq_ring_s *self, struct napi_struct *napi, int *work_done, int budget) { /* ... */ if (!buff->is_eop) { buff_ = buff; do { next_ = buff_->next, buff_ = &self->buff_ring[next_]; is_rsc_completed = aq_ring_dx_in_range(self->sw_head, next_, self->hw_head); if (unlikely(!is_rsc_completed)) break; buff->is_error |= buff_->is_error; buff->is_cso_err |= buff_->is_cso_err; } while (!buff_->is_eop); if (!is_rsc_completed) { err = 0; goto err_exit; } if (buff->is_error || buff->is_cso_err) { buff_ = buff; do { next_ = buff_->next, buff_ = &self->buff_ring[next_]; buff_->is_cleaned = true; } while (!buff_->is_eop); ++self->stats.rx.errors; continue; } } /* ... */ if (!buff->is_eop) { buff_ = buff; i = 1U; do { next_ = buff_->next, buff_ = &self->buff_ring[next_]; dma_sync_single_range_for_cpu( aq_nic_get_dev(self->aq_nic), buff_->rxdata.daddr, buff_->rxdata.pg_off, buff_->len, DMA_FROM_DEVICE); skb_add_rx_frag(skb, i++, buff_->rxdata.page, buff_->rxdata.pg_off, buff_->len, AQ_CFG_RX_FRAME_MAX); page_ref_inc(buff_->rxdata.page); buff_->is_cleaned = 1; buff->is_ip_cso &= buff_->is_ip_cso; buff->is_udp_cso &= buff_->is_udp_cso; buff->is_tcp_cso &= buff_->is_tcp_cso; buff->is_cso_err |= buff_->is_cso_err; } while (!buff_->is_eop); } /* ... */ } |
Here’s an example virtual address map observed on a 2 CPU machine with default ring settings.
Alloc | Virtual address | size | [D]MA / [I]nternal |
RX 1 dx_ring | 0xffff888063e40000 | 0x8000 | D |
TX 1 buff_ring | 0xffff888063e00000 | 0x30000 | I |
RX 1 buff_ring | 0xffff888063de0000 | 0x18000 | I |
TX 1 dx_ring | 0xffff888063dd0000 | 0x10000 | D |
Vec 1 | 0xffff888063dc9000 | 0x9b8 | I |
RX 0 dx_ring | 0xffff888063dc0000 | 0x8000 | D |
TX 0 buff_ring | 0xffff888063d80000 | 0x30000 | I |
RX 0 buff_ring | 0xffff888063d60000 | 0x18000 | I |
TX 0 dx_ring | 0xffff888063d50000 | 0x10000 | D |
Vec 0 | 0xffff888063d4c000 | 0x9b8 | I |
The range and alignment of buff_ is limited by sizeof(next_desc_ptr) and sizeof(buff_ring[0]). The possible range of OOB values for buff_ is buff_ring + X * 48 for X between self->size and UINT16_MAX.
However, the kernel allocates dx_rings right around the same location. dx_rings are DMA mapped and device controlled. This allows the device to influence a skb_add_rx_frag call with device controlled struct page*, offset, and size.
In the above example, the RX 0 overflow extends a bit beyond RX 1 dx_ring. RX 1 overflow extends into unknown territory.
Important caveat: We have not found a way for the device to discover valid struct page* addresses by itself. Though it offers an attacker a potent capability in the presence of such an assist.
With IP forwarding, we hypothesize the device could construct a preamble for the packet such that it gets forwarded back via the same interface, thereby exfiltrating page contents. Another opportunity would be to route the packet to a cooperating userspace application.
Similarly, buff_->is_cleaned = true allows the device to set a specific single bit per 48 bytes. With our observed alignment and packing, this is known to overlap with: is_cleaned of descriptors from other TX and RX rings, which are benign. The more interesting overlaps are with fields of aq_vec_s such as aq_ring_s::buff_ring, aq_ring_s::size, aq_ring_s::page_order, etc. However these weren’t evaluated further due to the 1 bit restriction and due to the shared root cause with a trivial fix.
Even with an in-bounds buff_, it’s possible that next_ lies beyond sw_tail in which case its rxdata would not be correctly initialized yet.
Check that next_desc_ptr is inside the boundaries of buff_ring and points to an active RXD.
CVSS:3.1/AV:P/AC:L/PR:N/UI:N/S:U/C:N/I:H/A:H
While assimilating LRO and jumbo packets, aq_ring_rx_clean routinely relies on a single device controlled signal, buff->is_eop, to break loops.
static int hw_atl_b0_hw_ring_rx_receive(struct aq_hw_s *self, struct aq_ring_s *ring) { /* ... */ if (HW_ATL_B0_RXD_WB_STAT2_EOP & rxd_wb->status) { buff->len = rxd_wb->pkt_len % AQ_CFG_RX_FRAME_MAX; buff->len = buff->len ? buff->len : AQ_CFG_RX_FRAME_MAX; buff->next = 0U; buff->is_eop = 1U; } else { /* ... */ } |
aq_ring_rx_clean continuously calls skb_add_rx_frag with an incrementing value of i until the device signals buff->is_eop.
int aq_ring_rx_clean(struct aq_ring_s *self, struct napi_struct *napi, int *work_done, int budget) { /* ... */ do { next_ = buff_->next, buff_ = &self->buff_ring[next_]; dma_sync_single_range_for_cpu( aq_nic_get_dev(self->aq_nic), buff_->rxdata.daddr, buff_->rxdata.pg_off, buff_->len, DMA_FROM_DEVICE); skb_add_rx_frag(skb, i++, buff_->rxdata.page, buff_->rxdata.pg_off, buff_->len, AQ_CFG_RX_FRAME_MAX); page_ref_inc(buff_->rxdata.page); buff_->is_cleaned = 1; buff->is_ip_cso &= buff_->is_ip_cso; buff->is_udp_cso &= buff_->is_udp_cso; buff->is_tcp_cso &= buff_->is_tcp_cso; buff->is_cso_err |= buff_->is_cso_err; } while (!buff_->is_eop); /* ... */ } |
static inline void __skb_fill_page_desc(struct sk_buff *skb, int i, struct page *page, int off, int size) { skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; /* * Propagate page pfmemalloc to the skb if we can. The problem is * that not all callers have unique ownership of the page but rely * on page_is_pfmemalloc doing the right thing(tm). */ frag->bv_page = page; frag->bv_offset = off; skb_frag_size_set(frag, size); page = compound_head(page); if (page_is_pfmemalloc(page)) skb->pfmemalloc = true; } |
The device can use this to overflow skb_shared_info::frags which is a fixed size array of size MAX_SKB_FRAGS. Furthermore, per Issue #1, page and off could also be influenced by the device by influencing buff_
!buff->is_eop with buff->len <= AQ_CFG_RX_HDR_SIZE can cause the driver to skip skb_add_rx_frag for i == 0 but then add subsequent frags starting with i == 1.
int aq_ring_rx_clean(struct aq_ring_s *self, struct napi_struct *napi, int *work_done, int budget) { /* ... */ hdr_len = buff->len; if (hdr_len > AQ_CFG_RX_HDR_SIZE) hdr_len = eth_get_headlen(skb->dev, aq_buf_vaddr(&buff->rxdata), AQ_CFG_RX_HDR_SIZE); memcpy(__skb_put(skb, hdr_len), aq_buf_vaddr(&buff->rxdata), ALIGN(hdr_len, sizeof(long))); if (buff->len - hdr_len > 0) { skb_add_rx_frag(skb, 0, buff->rxdata.page, buff->rxdata.pg_off + hdr_len, buff->len - hdr_len, AQ_CFG_RX_FRAME_MAX); page_ref_inc(buff->rxdata.page); } if (!buff->is_eop) { buff_ = buff; i = 1U; do { next_ = buff_->next, buff_ = &self->buff_ring[next_]; dma_sync_single_range_for_cpu( aq_nic_get_dev(self->aq_nic), buff_->rxdata.daddr, buff_->rxdata.pg_off, buff_->len, DMA_FROM_DEVICE); skb_add_rx_frag(skb, i++, buff_->rxdata.page, buff_->rxdata.pg_off, buff_->len, AQ_CFG_RX_FRAME_MAX); page_ref_inc(buff_->rxdata.page); buff_->is_cleaned = 1; buff->is_ip_cso &= buff_->is_ip_cso; buff->is_udp_cso &= buff_->is_udp_cso; buff->is_tcp_cso &= buff_->is_tcp_cso; buff->is_cso_err |= buff_->is_cso_err; } while (!buff_->is_eop); } /* ... */ } |
The device can easily cause the kernel to infinitely busy loop by not emitting a is_eop frag.
int aq_ring_rx_clean(struct aq_ring_s *self, struct napi_struct *napi, int *work_done, int budget) { /* ... */ if (!buff->is_eop) { buff_ = buff; do { next_ = buff_->next, buff_ = &self->buff_ring[next_]; is_rsc_completed = aq_ring_dx_in_range(self->sw_head, next_, self->hw_head); if (unlikely(!is_rsc_completed)) break; buff->is_error |= buff_->is_error; buff->is_cso_err |= buff_->is_cso_err; } while (!buff_->is_eop); /* ... */ } |
RECOMMENDATION:
Limit all affected loop iterations to process at most MAX_SKB_FRAGS frags (accounting for any processed outside the loop). Check that is_eop is consistent with buffer length and any other relevant status conditions.
CVSS:3.1/AV:P/AC:L/PR:N/UI:N/S:U/C:N/I:H/A:H
TX ring’s hw_head is read directly from the device via MMIO without further scrutiny. The device can influence it to be any value between 0x0 and 0x1fff.
static int hw_atl_b0_hw_ring_tx_head_update(struct aq_hw_s *self, struct aq_ring_s *ring) { int err = 0; unsigned int hw_head_ = hw_atl_tdm_tx_desc_head_ptr_get(self, ring->idx); if (aq_utils_obj_test(&self->flags, AQ_HW_FLAG_ERR_UNPLUG)) { err = -ENXIO; goto err_exit; } ring->hw_head = hw_head_; err = aq_hw_err_from_flags(self); err_exit: return err; } |
u32 hw_atl_rdm_rx_desc_head_ptr_get(struct aq_hw_s *aq_hw, u32 descriptor) { return aq_hw_read_reg_bit(aq_hw, HW_ATL_RDM_DESCDHD_ADR(descriptor), HW_ATL_RDM_DESCDHD_MSK, HW_ATL_RDM_DESCDHD_SHIFT); } |
Unexpected values of hw_head may cause aq_ring_tx_clean to double dev_kfree_skb_any already cleaned parts of the ring.
bool aq_ring_tx_clean(struct aq_ring_s *self) { /* ... */ for (budget = AQ_CFG_TX_CLEAN_BUDGET; budget && self->sw_head != self->hw_head; budget--) { struct aq_ring_buff_s *buff = &self->buff_ring[self->sw_head]; /* ... */ if (unlikely(buff->is_eop)) { ++self->stats.rx.packets; self->stats.tx.bytes += buff->skb->len; dev_kfree_skb_any(buff->skb); } buff->pa = 0U; buff->eop_index = 0xffffU; self->sw_head = aq_ring_next_dx(self, self->sw_head); } return !!budget; } |
It’s worth noting that aq_ring_next_dx is successful in preventing the buff from overflowing outside the ring.
Validate that hw_head is within the bounds of sw_head and sw_tail.
Also please inspect and apply fixes to A0, B1, and atl2 code as applicable.