Vound Colorado, Ltd. Knowledge Base - Intella Assist: Running Large Language Models Locally

Disclaimer:

This guide is provided for informational purposes only and is intended as a starting point for integrating local Large Language Models (LLMs) with Intella Connect/Investigator. We do not guarantee the performance quality or hardware specifications of any models mentioned, including the `Mistral` model. Users should verify the suitability of these models and evaluate their own hardware capabilities independently.

Before committing to dedicated hardware for running LLMs, we strongly recommend conducting thorough research on the following aspects:
- The quality of the model.
- The hardware requirements necessary to support the model.
- Testing the model's compatibility and performance with your Intella product by leveraging trial plans from cloud providers. This approach allows for cost-effective evaluation without the need for immediate hardware investment.
- If opting for a quantized model, consider how quantization may impact its quality.

Please note that the selection of the model and the associated hardware is your responsibility. We cannot be held accountable for the performance of your chosen model or for any hardware requirements.

Intella Assist is designed to function seamlessly with Large Language Models (LLMs). To enable it, integration with an LLM is essential.

Although integrating with the OpenAI API (specifically, `gpt-4o`) yields the most effective results at the moment, it may not be the preferred option for all clients. In response to diverse client needs, we provide guidance on utilizing local LLMs, ensuring that all data remains within your infrastructure.

This guide will help you set up the `Mistral Small 3` model, which is available under the Apache 2.0 license:

https://mistral.ai/en/news/mistral-small-3

Our internal testing showed that this model offers a good balance between hardware requirements and performance quality at the time of writing this knowledge base article. It is important to note that any model compatible with the OpenAI API and supporting a context size of at least 32,000 tokens may be theoretically used, provided that it produces quality results .

The tool we will be using for installation, management, and operation of LLM models is called Ollama.

Please be aware that based on our evaluations, `gpt-4o` remains the only model capable of supporting Intella Assist's facet functionality. Consequently, the Intella Assist facet will be disabled when using an alternative model. The Intella Assist panel in the Previewer does work with alternative models.

Installation on Windows

1. Visit https://ollama.com/, download and install Ollama

Note:

Ensure you use version v0.6.8 or later.

2.  Set the OLLAMA_HOST environment variable

setx OLLAMA_HOST "0.0.0.0:11434" /M

This exposes Ollama on all interfaces (port 11434) so other machines can reach it; otherwise it’s limited to localhost.

(Reopen the terminal, or sign out/in, for the change to take effect.)

3. Download the model:

For this tutorial, we are selecting 8-bit quantized version of the model, which demands fewer hardware resources at a slightly reduced quality.

Note:

The recommended VRAM requirement for this version of the model is reportedly 32 GB. Possible hardware setup, to run such model, is either dual GPU setup of RTX 3090 / RTX 4090 or Tesla T40 (this is just example - not recommendation).

View the available Mistral small 3 models at:

https://ollama.com/library/mistral-small:24b-instruct-2501-q8_0

To download the 8-bit quantized version of the Mistral model:

ollama run mistral-small:24b-instruct-2501-q8_0

4. Increase the Context Window Size

By default, Ollama uses a context window size of 2048 tokens, which may be insufficient and does not fully utilize the larger context window that Mistral can support.

To increase the context window, you cannot modify the original Mistral model directly. Instead, you need to create a new model based on it. Follow the procedure below:

Step 1: Create a Modelfile

Create a new file named `mistral-intella-modelfile.olm`. To set the context size to 32k tokens, add the following content:

FROM mistral-small:24b-instruct-2501-q8_0

PARAMETER num_ctx 32768

Keep the following in mind:

- Maximum Limit: Do not exceed the original model's maximum context window.

- System Resources: The maximum context window depends on your machine’s available RAM. If you exceed it, you will be notified.

Step 2: Create the New Model

Run the following command to create the new model:

ollama create mistral-intella -f mistral-intella-modelfile.olm

This command will create a new model named `mistral-intella` with a 32k context window.

Step 3: Verify the Model

To confirm that the model was successfully created, run:

ollama list

Ensure that the `mistral-intella` model appears in the list.

5. Configure Intella Connect/Investigator:

Go to Admin Dashboard -> Settings -> Intella Assist and set it up like this:

Note:

Default OLLAMA port is used in this case. Also don't forget to append "/v1" to the endpoint URL like depicted above.

Note

API key needs to be entered - despite it's not required in this particular case.

6. Make sure that everything is set up properly by pressing "Test integration"

Installation on Linux / WSL2 (Ubuntu)

Note:

For installations using Windows Subsystems, we recommend opting for WSL2 over WSL1 due to its substantial performance benefits.

Note:

Local models are compatible with Intella Connect/Investigator from version 2.7.1 onwards.

1. Ensure your Linux system is up to date:

sudo apt update

sudo apt upgrade

2.  Set the OLLAMA_HOST environment variable:

echo 'export OLLAMA_HOST="0.0.0.0:11434"' >> ~/.bashrc

source ~/.bashrc

That exposes Ollama on all interfaces (port 11434) so other machines can reach it; otherwise it’s limited to localhost.

3. Install Ollama:

curl -fsSL https://ollama.com/install.sh | sh

Note:

Ensure you use version v0.6.8 or later.

4. Download the model:

For this tutorial, we are selecting 8-bit quantized version of the model, which demands fewer hardware resources at a slightly reduced quality.

Note:

View the available Mistral small 3 models at:

https://ollama.com/library/mistral-small:24b-instruct-2501-q8_0

To download the 8-bit quantized version of the Mistral model:

ollama run mistral-small:24b-instruct-2501-q8_0

5. Increase the Context Window Size

By default, Ollama uses a context window size of 2048 tokens, which may be insufficient and does not fully utilize the larger context window that Mistral can support.

To increase the context window, you cannot modify the original Mistral model directly. Instead, you need to create a new model based on it. Follow the procedure below:

Step 1: Create a Modelfile

Create a new file named `mistral-intella-modelfile.olm`. To set the context size to 32k tokens, add the following content:

FROM mistral-small:24b-instruct-2501-q8_0

PARAMETER num_ctx 32768

Keep the following in mind:

- Maximum Limit: Do not exceed the original model's maximum context window.

- System Resources: The maximum context window depends on your machine’s available RAM. If you exceed it, you will be notified.

Step 2: Create the New Model

Run the following command to create the new model:

ollama create mistral-intella -f mistral-intella-modelfile.olm

This command will create a new model named `mistral-intella` with a 32k context window.

Step 3: Verify the Model

To confirm that the model was successfully created, run:

ollama list

Ensure that the `mistral-intella` model appears in the list.

6. Configure Intella Connect/Investigator:

Go to Admin Dashboard -> Settings -> Intella Assist and set it up like this:

Note:

Default OLLAMA port is used in this case. Also don't forget to append "/v1" to the endpoint URL like depicted above.

Note

API key needs to be entered - despite it's not required in this particular case.

7. Make sure that everything is set up properly by pressing "Test integration"