Nvidia has announced the launch of Nemotron 3 Nano Omni, an open multimodal AI model designed to unify vision, audio and language processing within a single system.
The model aims to address the current limitations faced by agentic systems, which typically use separate models for different modalities, causing increased latency and fragmented context during AI operations.
Access deeper industry intelligence
Experience unmatched clarity with a single platform that combines unique data, AI, and human expertise.
Companies including Applied Scientific Intelligence, Aible, Foxconn, Eka Care, H Company, Palantir and Pyler have already begun integrating Nemotron 3 Nano Omni into their solutions.
Further evaluations are underway at organisations such as Dell Technologies, K-Dense, Docusign, Lila, Infosys, Oracle, and Zefr.
Nvidia’s new model integrates vision and audio encoders via a 30B-A3B hybrid mixture-of-experts architecture, enabling faster and more efficient inference.
The unified approach allows AI agents to process video, audio, image and text data simultaneously. This is reported to result in up to nine times higher throughput compared to existing open multimodal models offering similar functionality.
According to Nvidia, this leads to reduced operational costs and improved scalability, enabling the deployment of responsive yet efficient AI agents.
The Nemotron 3 Nano Omni model is equipped for use cases such as computer interaction, document analysis and media understanding.
For instance, in customer service or finance, the model can handle screen recordings, call audio, and the analysis of complex documents, minimising the need for separate systems to manage each data type.
The model can also work alongside other Nemotron models, such as Nemotron 3 Super and Nemotron 3 Ultra, as well as proprietary solutions, to support AI workflows involving complex planning and document intelligence.
H Company recently implemented Nemotron 3 Nano Omni in a computer usage agent, using the model’s high native resolution support to navigate and interpret graphical user interfaces in experiments on the OSWorld benchmark.
H Company CEO Gautier Cloix said: “By building on Nemotron 3 Nano Omni, our agents can rapidly interpret full HD screen recordings — something that wasn’t practical before.”
In enterprise environments, the model enables the evaluation and interpretation of documents, tables, charts and mixed media inputs, supporting regulatory or compliance analysis.
The model is provided with open weights, datasets and training methods, allowing organisations to adapt and deploy it as per their requirements and comply with data localisation or regulatory standards.
Developers can use the Nvidia NeMo toolkit to customise, evaluate, and optimise the model for specific industry needs.
The Nemotron family, including Nano, Super and Ultra variants, has seen over 50 million downloads in the last year, and the Omni variant now extends its capabilities into new domains.
Nemotron 3 Nano Omni is available as an Nvidia NIM microservice through platforms such as Hugging Face, OpenRouter, build.Nvidia.com, and is supported by a range of Nvidia Cloud Partners and inference platforms.
The architecture allows deployment across a variety of computing environments, from local systems and edge devices to large-scale cloud data centres.
Earlier this month, Cadence broadened its technology collaboration with Nvidia to advance agentic AI, physics-based simulation and digital twin applications in engineering and system design.
