Selecting a local model or hardware setup for Intella Assist depends on your environment, the types and sizes of items in your case, and the complexity of the tasks you run. Different models behave differently depending on whether items are short or long and whether tasks are simple or complex. Because deployments vary widely, Vound does not recommend specific models or hardware.
This article explains what Intella Assist — especially Intella Assist Tasks — requires, and how to evaluate models in a way that fits your setup.What are Intella Assist's Requirements?
Intella Assist integrates with models that expose the same API format as OpenAI (POST /v1/chat/completions).
The runtime does not matter — the model may run on:
-
Ollama
-
vLLM
LM Studio
-
llama.cpp
-
Local Docker deployments
-
Cloud instances
...
... as long as the API follows the OpenAI Chat Completions standard.
B.) Structured output (JSON)
Intella Assist Tasks require models that support OpenAI-style Structured output, specifically:
response_format: { "type": "json_schema" }
The model must be able to generate responses that follow the provided JSON Schema exactly.
Gateway Compatibility (e.g., LiteLLM)
If your large language model does not natively expose an OpenAI-compatible API, you may optionally use a gateway such as LiteLLM.
LiteLLM is:
-
open-source (MIT licensed)
-
a lightweight translation layer
-
capable of exposing a unified OpenAI-style API
-
able to route requests to various backends
-
capable of normalizing responses to the OpenAI format
This is just an example; we are not affiliated with LiteLLM.
Gateways are optional and used at your discretion.
Why We Don’t Recommend Specific Models or Hardware
There is no single “best” model for all environments. Performance depends on many factors:
-
Task type (summarization, extraction, categorization, redaction, etc.)
-
Clarity and size of the task definition
-
Required context window size (we recommend at least 32k tokens)
-
Hardware (GPU type, VRAM, ... )
-
Quantization level (Q2, Q4, Q5, Q6, Q8, FP16, etc.)
-
Inference runtime (Ollama vs vLLM vs llama.cpp)
-
Latency and throughput expectations
Quality expectations
Legal and ethical considerations (transparency of the training process incl. what data was used to train the model, bias mitigation strategies, …)
Because all these variables differ across customer deployments, fixed recommendations often do not match real-world usage.
We also do not provide specific model or hardware recommendations because quality is not objectively measurable. Performance metrics such as speed or latency can be benchmarked, but the “quality” of the generated output is inherently subjective and varies with the task, prompt wording, domain, and reviewer expectations.
In addition, Intella Assist Tasks are used for a very broad range of use-cases — from simple extractions to complex analyses — and this diversity makes it impossible to predict which model will perform best in every scenario. A model that excels for one organization’s workload may not produce satisfactory results for another.
Understanding Practical Trade-offs
Try Models in the Cloud First
If you do not yet have suitable hardware, evaluating models in the cloud is the safest way to understand real-world performance before investing in equipment.
Examples of platforms that let you select models and hardware (examples only; we do not endorse or partner with these services):
-
RunPod
-
Vast.ai
-
Lambda.ai
These platforms allow experimentation with different GPU types, quantizations, and model families.
What About Specific Models?
We don't recommend specific models. However, for internal evaluation purposes we have tested compact models such as Mistral Small 3.2 to understand baseline capabilities. In our tests, models of this size could handle simple tasks and small items, but they also showed clear limitations — including a higher tendency to hallucinate and tendency to fail on more complex cases. This is an example, not a recommendation. Different environments have very different performance expectations, so choosing the model remains your decision.
Already Running a Model?
If your model supports an OpenAI-compatible API and returns Structured output, you can integrate it with Intella Assist immediately.
If the results are slower or less accurate than expected, you can revisit your setup to improve either performance or quality.
Improving Speed vs Improving Quality
Depending on what you want to optimize, different strategies apply.
Conclusion
Choosing a model or hardware for Intella Assist requires balancing speed, quality, context size, and task complexity. Because deployments vary widely, we cannot prescribe a single “best” option. Testing in the cloud helps determine what works reliably before committing to hardware.
As long as your model supports an OpenAI-compatible API and returns structured output, it can work with Intella Assist.
Important Notes
References to third-party tools, services, or models (such as LiteLLM, RunPod, Vast.ai, Lambda.ai, or any model family) are provided only as examples.
We do not certify, endorse, or guarantee their performance or compatibility.
Model and hardware performance depend on configuration, task design, and runtime environment.
Customers are responsible for selecting, configuring, and maintaining the infrastructure and models they use with Intella Assist.