Réseaux

Demystifying Artificial Intelligence and Machine Learning Infrastructure for a Network Engineer

Cisco’s presentation at AI Field Day 5, led by Paresh Gupta and Nicholas Davidson, focused on demystifying AI/ML infrastructure for network engineers, particularly in the context of building and managing GPU clusters for AI workloads. Paresh, a technical marketing leader, began by explaining the challenges of setting up a GPU cluster, emphasizing the importance of inter-GPU networking and how Cisco’s Nexus 9000 Series switches address these challenges. He highlighted the complexity of cabling and configuring such clusters, which can take weeks to set up, but with Cisco’s validated solutions, the process can be streamlined to just eight hours. Paresh also discussed the importance of non-blocking, non-over-subscribed network designs, such as the « Rails Optimized » design used by Nvidia and the « Fly » design by Intel, which ensure efficient communication between GPUs during distributed AI training tasks.

The presentation also delved into the technical aspects of inter-GPU communication, particularly the need for collective communication protocols like all-reduce and reduce-scatter, which allow GPUs to synchronize their states during parallel processing. Paresh explained how Cisco’s network designs, such as the use of dynamic load balancing and static pinning, help optimize the flow of data between GPUs, reducing congestion and improving performance. He also touched on the importance of creating a lossless network using priority-based flow control to avoid packet loss, which can significantly delay AI training jobs. Cisco’s Nexus Dashboard plays a crucial role in monitoring and detecting anomalies, such as packet loss or congestion, ensuring that the network operates efficiently.

Nicholas Davidson, a machine learning engineer at Cisco, then shared his experience of building a generative AI (GenAI) application using the on-premises GPU cluster managed by Paresh. He explained how the infrastructure allowed him to train models on Cisco’s private data, which could not be moved to the cloud due to security concerns. By leveraging the GPU cluster, Nicholas was able to reduce training times from days to hours, processing billions of tokens in a fraction of the time it would have taken using cloud-based resources. He also demonstrated how the AI model, integrated with Cisco’s Nexus Dashboard, could provide real-time insights and anomaly detection for network engineers, showcasing the practical benefits of having an on-prem AI/ML infrastructure.

Recorded live in San Francisco, CA on September 11, 2024. Watch the entire presentation at https://techfieldday.com/appearance/cisco-presents-at-ai-field-day-5/ or visit https://TechFieldDay.com/event/aifd5/ or https://Cisco.com/ai for more information.

Views : 1158
network engineer

Source by Tech Field Day

Mourad ELGORMA

Fondateur de summarynetworks, passionné des nouvelles technologies et des métiers de Réseautique , Master en réseaux et système de télécommunications. ,j’ai affaire à Pascal, Delphi, Java, MATLAB, php …Connaissance du protocole TCP / IP, des applications Ethernet, des WLAN …Planification, installation et dépannage de problèmes de réseau informatique……Installez, configurez et dépannez les périphériques Cisco IOS. Surveillez les performances du réseau et isolez les défaillances du réseau. VLANs, protocoles de routage (RIPv2, EIGRP, OSPF.)…..Manipuler des systèmes embarqués (matériel et logiciel ex: Beaglebone Black)…Linux (Ubuntu, kali, serveur Mandriva Fedora, …). Microsoft (Windows, Windows Server 2003). ……Paquet tracer, GNS3, VMware Workstation, Virtual Box, Filezilla (client / serveur), EasyPhp, serveur Wamp,Le système de gestion WORDPRESS………Installation des caméras de surveillance ( technologie hikvision DVR………..). ,

Laisser un commentaire

Votre adresse e-mail ne sera pas publiée. Les champs obligatoires sont indiqués avec *