Introduction
The Wallaroo AI Inference Engine delivers high-performance for both batch and real-time inference. The inference engine architecture features an optimized, multi-threaded network server and orchestration layer that manages incoming requests, prepares data, and executes models. Wallaroo offers flexible deployment, easy scaling and orchestration of Agentic AI inference microservices across GPU architectures and Arm-based architecture on Ampere, all through simple configurations.
Additionally, the Wallaroo AI inference engine, written in Rust, comes with built-in Autoscaling capabilities that adjust resource utilization using pre-configured triggers based on real-time demand, ensuring optimal inference latency and throughput, all while maximizing utilization of the underlying hardware (e.g. Ampere® Altra® or AmpereOne®processors) with built-in dynamic and concurrent batching capabilities
Wallaroo offers superior inference performance for critical Agentic AI use cases on Arm®-based hardware, both in the cloud and at the edge. Compared to bespoke or non-specialized AI inference solution implementations, Wallaroo significantly reduces unnecessary AI inference overhead using Ampere as seen in the benchmarks below.
The Wallaroo AI Control Plane
The Wallaroo AI Control Plane is also designed to simplify and automate the entire AI production lifecycle with seamless integration into existing AI tools and ecosystems for launching AI models faster to production and maintenance of Agentic AI application, in an “AI factory” fashion, on Arm-based architectures with Ampere.
The Wallaroo AI Control Plane is comprised of LLM operations capabilities encompassing automated AI inference microservices packaging & deployment, ongoing model management with full governance, and observability.
With Wallaroo, enterprises can overcome production challenges mentioned above to easily operationalize LLMs for agentic AI applications and put in place measures such as RAG (Retrieval-Augmented Generation), complemented by Wallaroo’s LLM Listeners, to help ensure AI application efficacy, and the standards of security, privacy, compliance are avoiding outputs with toxicity, hallucinations etc.
RAG is a widely adopted method for tailoring LLM outputs with proprietary data. For LLM output validation with RAG, Wallaroo offers native capabilities for deploying and managing embedding models from sources such as HuggingFace directly on Ampere CPUs, eliminating complex hardware or infrastructure manipulations.
Through implementing RAG LLM as an authoritative source for models, it enhances the reliability of the Agentic AI application by ensuring that the generated outputs are relevant to the user’s context, and free from any potential hallucinations. Typically, organizations leverage pre-trained LLMs and enhance responses using pertinent business information, avoiding the need for extensive fine-tuning. AI engineers frequently employ RAG to improve model relevance in dynamic data landscapes due to its adaptability and straightforward implementation. A crucial initial step in implementing RAG is determining the vector embeddings strategy.
Embeddings offer a means of data representation by transforming input data, such as PDFs and text documents into numerical vectors that capture semantic relationships. These numerical vectors are then stored and retrieved in a vector database, and require a specialized AI model to generate the necessary embeddings for retrieval in the vector database. Embedding models are often small (less than 1 GB), making Arm-basedarchitecture on Ampere CPU a cost and energy efficient infrastructure choice, to be combined with the Wallaroo control plane’s capability for packaging and orchestrating multi-step inference pipelines, which are critical for Agentic AI use cases.
RAG LLMs with Wallaroo for Agentic AI applications
While RAG is useful for ensuring context-appropriate agentic AI application, monitoring the health and reliability of Agentic AI applications and LLMs in production is crucial to maintain a tight feedback loop as application usage grows. To that effect, Wallaroo’s LLM ListenersTMoffer a suite of tools to automatically scan Agentic AI application inferences for unwanted results, and enable AI developers to proactively address production issues
Fully compatible with Arm-based architecture, the Wallaroo LLM ListenersTM can be orchestrated to generate real-time monitoring reports and metrics for understanding how the LLM or the overall Agentic AI application is behaving and ensure that it is effective in production allowing AI teams to iterate quickly without impacting the bottom line.
Safeguard Agentic AI applications with Wallaroo’s LLM ListenerTM