Intella Assist: Choosing a Local Model and Hardware

Selecting a local model or hardware setup for Intella Assist depends on your environment, the types and sizes of items in your case, and the complexity of the tasks you run. Different models behave differently depending on whether items are short or long and whether tasks are simple or complex. Because deployments vary widely, Vound does not recommend specific models or hardware.

This article explains what Intella Assist — especially Intella Assist Tasks — requires, and how to evaluate models in a way that fits your setup.

What are Intella Assist's Requirements?


Intella Assist can work with any large language model that supports:  

A.) An OpenAI-compatible Chat Completions API

Intella Assist integrates with models that expose the same API format as OpenAI (POST /v1/chat/completions). 

The runtime does not matter — the model may run on:

  • Ollama

  • vLLM

  • LM Studio

  • llama.cpp

  • Local Docker deployments

  • Cloud instances

  • ...

... as long as the API follows the OpenAI Chat Completions standard.


B.) Structured output (JSON)


Intella Assist Tasks require models that support OpenAI-style Structured output, specifically:

response_format: { "type": "json_schema" }

The model must be able to generate responses that follow the provided JSON Schema exactly.


Gateway Compatibility (e.g., LiteLLM)  


If your large language model does not natively expose an OpenAI-compatible API, you may optionally use a gateway such as LiteLLM.

LiteLLM is:

  • open-source (MIT licensed)

  • a lightweight translation layer

  • capable of exposing a unified OpenAI-style API

  • able to route requests to various backends

  • capable of normalizing responses to the OpenAI format

This is just an example; we are not affiliated with LiteLLM.


Gateways are optional and used at your discretion.


Why We Don’t Recommend Specific Models or Hardware  


There is no single “best” model for all environments. Performance depends on many factors:

  • Task type (summarization, extraction, categorization, redaction, etc.)

  • Clarity and size of the task definition

  • Required context window size (we recommend at least 32k tokens)

  • Hardware (GPU type, VRAM, ... )

  • Quantization level (Q2, Q4, Q5, Q6, Q8, FP16, etc.)

  • Inference runtime (Ollama vs vLLM vs llama.cpp)

  • Latency and throughput expectations

  • Quality expectations

  • Legal and ethical considerations (transparency of the training process incl. what data was used to train the model, bias mitigation strategies, …)

Because all these variables differ across customer deployments, fixed recommendations often do not match real-world usage.

We also do not provide specific model or hardware recommendations because quality is not objectively measurable. Performance metrics such as speed or latency can be benchmarked, but the “quality” of the generated output is inherently subjective and varies with the task, prompt wording, domain, and reviewer expectations.

In addition, Intella Assist Tasks are used for a very broad range of use-cases — from simple extractions to complex analyses — and this diversity makes it impossible to predict which model will perform best in every scenario. A model that excels for one organization’s workload may not produce satisfactory results for another.


Understanding Practical Trade-offs


General tendencies across model sizes:

Smaller models (7B–8B)
- Faster responses
- Lower accuracy
- Not recommended

Mid-sized models (14B–30B)
- Balance between speed and reasoning quality
- Often a reasonable choice for short and predictable tasks

Large models (70B+)
- Strongest reasoning
- Highest hardware requirements and slower responses
- Better multilingual capabilities

Side note:
Models utilizing larger context windows use more memory but allow you to process longer documents.

Important:
Regardless of model size, quality improves significantly when instructions are clear and the task is well-defined.


Try Models in the Cloud First  


If you do not yet have suitable hardware, evaluating models in the cloud is the safest way to understand real-world performance before investing in equipment.


Examples of platforms that let you select models and hardware (examples only; we do not endorse or partner with these services):  

  • RunPod

  • Vast.ai

  • Lambda.ai

These platforms allow experimentation with different GPU types, quantizations, and model families.


What About Specific Models?  


We don't recommend specific models. However, for internal evaluation purposes we have tested compact models such as Mistral Small 3.2 to understand baseline capabilities. In our tests, models of this size could handle simple tasks and small items, but they also showed clear limitations — including a higher tendency to hallucinate and tendency to fail on more complex cases. This is an example, not a recommendation. Different environments have very different performance expectations, so choosing the model remains your decision.  


Already Running a Model?  


If your model supports an OpenAI-compatible API and returns Structured output, you can integrate it with Intella Assist immediately.

If the results are slower or less accurate than expected, you can revisit your setup to improve either performance or quality.


Improving Speed vs Improving Quality  


Depending on what you want to optimize, different strategies apply.  


If you want to speed things up:
- Pick a smaller or more aggressively quantized model
- Simplify the task so the model performs short output
- Use faster hardware

If you want to improve output quality
- Use a larger or higher-precision model
- Clarify or refine the task definition, concentrating only on a single topic
- Increase the context window for longer documents

Conclusion  


Choosing a model or hardware for Intella Assist requires balancing speed, quality, context size, and task complexity. Because deployments vary widely, we cannot prescribe a single “best” option. Testing in the cloud helps determine what works reliably before committing to hardware.


As long as your model supports an OpenAI-compatible API and returns structured output, it can work with Intella Assist.


Important Notes


References to third-party tools, services, or models (such as LiteLLM, RunPod, Vast.ai, Lambda.ai, or any model family) are provided only as examples.

We do not certify, endorse, or guarantee their performance or compatibility.

Model and hardware performance depend on configuration, task design, and runtime environment.

Customers are responsible for selecting, configuring, and maintaining the infrastructure and models they use with Intella Assist.