Demystifying Artificial Intelligence and Machine Learning Infrastructure for a Network Engineer
Cisco’s presentation at AI Field Day 5, led by Paresh Gupta and Nicholas Davidson, focused on demystifying AI/ML infrastructure for network engineers, particularly in the context of building and managing GPU clusters for AI workloads. Paresh, a technical marketing leader, began by explaining the challenges of setting up a GPU cluster, emphasizing the importance of inter-GPU networking and how Cisco’s Nexus 9000 Series switches address these challenges. He highlighted the complexity of cabling and configuring such clusters, which can take weeks to set up, but with Cisco’s validated solutions, the process can be streamlined to just eight hours. Paresh also discussed the importance of non-blocking, non-over-subscribed network designs, such as the « Rails Optimized » design used by Nvidia and the « Fly » design by Intel, which ensure efficient communication between GPUs during distributed AI training tasks.
The presentation also delved into the technical aspects of inter-GPU communication, particularly the need for collective communication protocols like all-reduce and reduce-scatter, which allow GPUs to synchronize their states during parallel processing. Paresh explained how Cisco’s network designs, such as the use of dynamic load balancing and static pinning, help optimize the flow of data between GPUs, reducing congestion and improving performance. He also touched on the importance of creating a lossless network using priority-based flow control to avoid packet loss, which can significantly delay AI training jobs. Cisco’s Nexus Dashboard plays a crucial role in monitoring and detecting anomalies, such as packet loss or congestion, ensuring that the network operates efficiently.
Nicholas Davidson, a machine learning engineer at Cisco, then shared his experience of building a generative AI (GenAI) application using the on-premises GPU cluster managed by Paresh. He explained how the infrastructure allowed him to train models on Cisco’s private data, which could not be moved to the cloud due to security concerns. By leveraging the GPU cluster, Nicholas was able to reduce training times from days to hours, processing billions of tokens in a fraction of the time it would have taken using cloud-based resources. He also demonstrated how the AI model, integrated with Cisco’s Nexus Dashboard, could provide real-time insights and anomaly detection for network engineers, showcasing the practical benefits of having an on-prem AI/ML infrastructure.
Recorded live in San Francisco, CA on September 11, 2024. Watch the entire presentation at https://techfieldday.com/appearance/cisco-presents-at-ai-field-day-5/ or visit https://TechFieldDay.com/event/aifd5/ or https://Cisco.com/ai for more information.
Views : 1158
network engineer