Building AI Agents with Multimodal Models: NVIDIA DLI Workshop for Academia

Ready to build cutting-edge AI that understands the world through more than just text? Join our hands-on workshop and learn how to build neural network agents that can see, read, and reason across multiple data types! We’ll explore advanced techniques like data fusion, OCR, and NVIDIA's powerful AI Blueprints to tackle real-world challenges in robotics, healthcare, and beyond.

We'll start with a robotics use case, apply those principles to supercharge Large Language Models (LLMs), and finish by orchestrating a team of models to work together seamlessly. You can find the full workshop description here: https://learn.nvidia.com/courses/course-detail?course_id=course-v1:DLI+C-FX-17+V1

Who is this for This certification workshop is completely free for academic staff and students. A valid academic email address is required to access the NVIDIA DLI compute environment. If you are in industry, please contact [email protected] to request a quote for you or your team.

Register Please remember to fill in the form with your current institutional email. https://forms.gle/YEETAidJqUzEkNS56 the access code to the NVIDIA DLI Platform will be shared through your academic email.

What You Will Learn

🧠 Data Fusion Mastery: Discover the difference between early, late, and intermediate fusion to combine camera, LiDAR, and other data types.
📄 PDF & Document AI: Learn to extract and process text from PDFs using Optical Character Recognition (OCR).
🌐 Agent Orchestration: Understand how to make multiple AI models collaborate to solve complex problems.
🪜 NVIDIA AI Blueprints: Get hands-on with the Video Search and Summarization (VSS) blueprint to build powerful applications.
🗣️ Vision-Language Models: Turn a standard Language Model into a Vision Language Model (VLM) that can process images and documents.

Agenda Part 1: Early & Late Fusion (1.0 hr)

Fuse camera and LiDAR data to predict object positions.
Prep various data types for your neural networks.

Part 2: Intermediate Fusion (1.0 hr)

Dive into the theory of multimodal model architecture.
Train a Contrastive Pretraining model and create a vector database.

Part 3: Cross-modal Projection (2.0 hrs)

Transform an LLM into a Vision Language Model (VLM).
Process PDFs like a pro with OCR tools.

Part 4: Model Orchestration (2.0 hrs)

Analyze video with Cosmos Nemotron.
Use the VSS Blueprint to find answers in video content.

Part 5: Final Assessment (1.0 hr)

Put your new skills to the test by converting a pre-trained model to accept a new data type.