Running Large Language Models Locally

Disclaimer:

This guide is provided for informational purposes only and is intended as a starting point for integrating local Large Language Models (LLMs) with Intella Connect/Investigator. We do not guarantee the performance quality or hardware specifications of any models mentioned, including the `Mixtral` model. Users should verify the suitability of these models and evaluate their own hardware capabilities independently.

Before committing to dedicated hardware for running LLMs, we strongly recommend conducting thorough research on the following aspects:
- The quality of the model.
- The hardware requirements necessary to support the model.
- Testing the model's compatibility and performance with your Intella product by leveraging trial plans from cloud providers. This approach allows for cost-effective evaluation without the need for immediate hardware investment.
- If opting for a quantized model, consider how quantization may impact its quality.

Please note that the selection of the model and the associated hardware is your responsibility. We cannot be held accountable for the performance of your chosen model or for any hardware requirements.



Intella Assist is designed to function seamlessly with Large Language Models (LLMs). To enable it, integration with an LLM is essential.

Although integrating with the OpenAI API (specifically, `gpt-4o`) yields the most effective results at the moment, it may not be the preferred option for all clients. In response to diverse client needs, we provide guidance on utilizing local LLMs, ensuring that all data remains within your infrastructure.

This guide will help you set up the `mistralai/Mixtral-8x7B-Instruct-v0.1` model, which is available under the Apache 2.0 license at:

Our internal testing showed that this model offers a good balance between hardware requirements and performance quality at the time of writing this knowledge base article. It is important to note that any model compatible with the OpenAI API and supporting a context size of at least 16,000 tokens may be theoretically used, provided that it produces quality results  .

The tool we will be using for installation, management, and operation of LLM models is called Ollama.

Please be aware that based on our evaluations, `gpt-4o` remains the only model capable of supporting Intella Assist's facet functionality. Consequently, the Intella Assist facet will be disabled when using an alternative model. The Intella Assist panel in the Previewer does work with alternative models.

Installation on Windows


1. Visit https://ollama.com/, download and install Ollama

2. Download the model:

  For this tutorial, we are selecting the quantized (5-bit quantization) version of the model, which demands fewer hardware resources at a slightly reduced quality.

  Note:
  The maximum VRAM requirement for this version of the model is reportedly 34.73 GB. It is suggested that the model would perform on hardware configurations such as dual GPUs (e.g., RTX 3090, RTX 4090, or Tesla T40), which are expected to generate approximately 50 tokens per second, based on available reports.
 
  View the available Mixtral models at:
 
  To download the 5-bit quantized version of the Mixtral model:  
  `ollama pull mixtral:8x7b-instruct-v0.1-q5_K_M`

3. Increase the Context Window Size

   By default, Ollama uses a context window size of 2048 tokens, which may be insufficient and does not fully utilize the larger context window that Mixtral can support.
 
   To increase the context window, you cannot modify the original Mixtral model directly. Instead, you need to create a new model based on it. Follow the procedure below:
 
   Step 1: Create a Modelfile
 
   Create a new file named `mixtral-intella-modelfile.olm`. To set the context size to 32k tokens, add the following content:
 
   ```
   FROM mixtral:8x7b-instruct-v0.1-q5_K_M
   PARAMETER num_ctx 32768
   ```
 
   Keep the following in mind:
    - Maximum Limit: Do not exceed the original model's maximum context window.
    - System Resources: The maximum context window depends on your machine’s available RAM. If you exceed it, you will be notified.
 
   Step 2: Create the New Model
 
   Run the following command to create the new model:
 
   ```
   ollama create mixtral-intella -f mixtral-intella-modelfile.olm
   ```
 
   This command will create a new model named `mixtral-intella` with a 32k context window.
 
   Step 3: Verify the Model
 
   To confirm that the model was successfully created, run:
   ```
   ollama list
   ```
   
    Ensure that the `mixtral-intella` model appears in the list.

4. Configure Intella Connect/Investigator:

    Go to Admin Dashboard -> Settings -> Intella Assist  and set it up like this:

  
   Note:
   Default OLLAMA port is used in this case. Also don't forget to append "/v1" to the endpoint URL like depicted above.     
   
    Note
    API key needs to be entered - despite it's not required in this particular case.

5. Make sure that everything is set up properly by pressing "Test integration"

Installation on Linux / WSL2 (Ubuntu)


Note:
For installations using Windows Subsystems, we recommend opting for WSL2 over WSL1 due to its substantial performance benefits.

Note:
Local models are compatible with Intella Connect/Investigator from version 2.7.1 onwards.

1. Ensure your Linux system is up to date:
  ```
  sudo apt update      
  sudo apt upgrade      
  ```
 
2. Install Ollama:

  `curl -fsSL https://ollama.com/install.sh | sh`
 
3. Download the model:

  For this tutorial, we are selecting the quantized (5-bit quantization) version of the model, which demands fewer hardware resources at a slightly reduced quality.

  Note:
  The maximum RAM requirement for this version of the model is reportedly 34.73 GB. It is suggested that the model would perform on hardware configurations such as dual GPUs (e.g., RTX 3090, RTX 4090, or Tesla T40), which are expected to generate approximately 50 tokens per second, based on available reports.
 
  View the available Mixtral models at:
 
  To download the 5-bit quantized version of the Mixtral model, proceed with this command:  
  `ollama pull mixtral:8x7b-instruct-v0.1-q5_K_M`

4. Increase the Context Window Size

   By default, Ollama uses a context window size of 2048 tokens, which may be insufficient and does not fully utilize the larger context window that Mixtral can support.
 
   To increase the context window, you cannot modify the original Mixtral model directly. Instead, you need to create a new model based on it. Follow the procedure below:
 
   Step 1: Create a Modelfile
 
   Create a new file named `mixtral-intella-modelfile.olm`. To set the context size to 32k tokens, add the following content:
 
   ```
   FROM mixtral:8x7b-instruct-v0.1-q5_K_M
   PARAMETER num_ctx 32768
   ```
 
   Keep the following in mind:
    - Maximum Limit: Do not exceed the original model's maximum context window.
    - System Resources: The maximum context window depends on your machine’s available RAM. If you exceed it, you will be notified.
 
   Step 2: Create the New Model
 
   Run the following command to create the new model:
 
   ```
   ollama create mixtral-intella -f mixtral-intella-modelfile.olm
   ```
 
   This command will create a new model named `mixtral-intella` with a 32k context window.
 
   Step 3: Verify the Model
 
   To confirm that the model was successfully created, run:
   ```
   ollama list
   ```
   
    Ensure that the `mixtral-intella` model appears in the list.
   
5. Configure Intella Connect/Investigator:

    Go to Admin Dashboard -> Settings -> Intella Assist  and set it up like this:

  
   Note:
   Default OLLAMA port is used in this case. Also don't forget to append "/v1" to the endpoint URL like depicted above.     
   
    Note
    API key needs to be entered - despite it's not required in this particular case.

6. Make sure that everything is set up properly by pressing "Test integration"