GPU zos3 #191

Open
opened 2025-02-24 10:56:08 +00:00 by thabeta · 6 comments
Owner

Some of VMs when deployed GPU models they end up in a crashloop resulting the vm to be removed by zos

Some of VMs when deployed GPU models they end up in a crashloop resulting the vm to be removed by zos
Author
Owner

one of the farmers moved a 4090 into devnet to check

However, debugging wasn't possible because of

  • no access to that machine as it's not using a public IP unlike nodes in freefarm in the DC

mitigation:

  • stop log removal of decommissioned VMs
  • add an RMB call to expose these logs
one of the farmers moved a 4090 into devnet to check However, debugging wasn't possible because of - no access to that machine as it's not using a public IP unlike nodes in freefarm in the DC mitigation: - stop log removal of decommissioned VMs - add an RMB call to expose these logs
Author
Owner

Jan also is going to put a machine for debugging purposes

Jan also is going to put a machine for debugging purposes
thabeta added this to the tfgrid_3_16 project 2025-02-24 11:11:30 +00:00
thabeta self-assigned this 2025-02-24 11:11:36 +00:00
thabeta added the
Issue
label 2025-02-24 11:11:44 +00:00
Author
Owner

the crash

cloud-hypervisor: 1.531966s: <vmm> ERROR:pci/src/vfio.rs:1682 -- Could not unmap mmio region from vfio container: invalid dma unmap size
Error booting VM: VmBoot(DeviceManager(VfioMapRegion(DmaMap(IommuDmaMap(Error(22)))))

Maxime has some idea to fix that but more investigation

image image
the crash ``` cloud-hypervisor: 1.531966s: <vmm> ERROR:pci/src/vfio.rs:1682 -- Could not unmap mmio region from vfio container: invalid dma unmap size Error booting VM: VmBoot(DeviceManager(VfioMapRegion(DmaMap(IommuDmaMap(Error(22))))) ``` Maxime has some idea to fix that but more investigation <img width="348" alt="image" src="attachments/49d464cf-3c18-4447-86a3-831ae84a54ef"> <img width="337" alt="image" src="attachments/c2fbe648-d860-4c46-a335-9eec1971da8c">
Author
Owner

now seems like an audio controller is in the way according to @delandtj

now seems like an audio controller is in the way according to @delandtj
Owner

Update:

  • Jan found the issue and will work on the code changes needed in the following days
  • Starting Sunday, devs can start implementing the codes and publish the new code changes
  • Then all Nvidia 4090 should work on the grid
    • To be confirmed, but that's the goal

We will give more info when possible.

# Update: - Jan found the issue and will work on the code changes needed in the following days - Starting Sunday, devs can start implementing the codes and publish the new code changes - Then all Nvidia 4090 should work on the grid - To be confirmed, but that's the goal We will give more info when possible.
Owner

@thabeta Any update here? I've been told by Jan that it's fixed and should go to mainnet soon. Ideally this is fixed next week and we can proceed to 3.16 release next week and close this project, then start planning 3.17 with stakeholders. Thanks!

@thabeta Any update here? I've been told by Jan that it's fixed and should go to mainnet soon. Ideally this is fixed next week and we can proceed to 3.16 release next week and close this project, then start planning 3.17 with stakeholders. Thanks!
Sign in to join this conversation.
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: tfgrid/circle_product_management#191
No description provided.