Proximity searches - A better understanding

There is also a webinar on using proximity searches in Intella here

In some cases we are provided with examples of the syntax which the customer has used. In most of these cases the syntax is very complex and the syntax is often incorrect.

Some customers ask us whether the syntax is correct, or ask why their proximity search is not working. This is something that we cannot answer on an individual basis. The point of this document is to provide information and examples to help our customers to get a better understanding of the proximity search syntax so that they can create the correct search syntax for the search that they want to perform.

Note: Most of this information here applies to all versions of Intella which support Proximity searching. However, there are few known limitations.

  • There is a limitation with hit highlighting in versions prior to 1.9.1
  • There is a limitation with running phrases within proximity searches prior to version 2.5.0
We recommend that you update to the latest version to take advantage of all of the features that are available with proximity searches.

What is a Proximity Search?

Proximity searches are search syntax specifically crafted to find items based on words that are within a specified maximum distance from each other in the item’s text. For example, if I wanted to find all items that have the words 'desktop' and 'application' within 10 words of each other then I would use the following:

“desktop application”~10

A proximity search differs from a phrase search in that it does not matter whether 'desktop' is before or after the term 'application' in the text. For example, documents containing either of the passages of text below will be respondent to the proximity search above.

"You must turn on your desktop computer before you can open an application."
"I have copied the shortcut for the application onto the desktop."

Here is an example of a proximity search being used in Intella.


Using the Correct Proximity Search Syntax

As mention above we receive proximity search syntax from customers. A lot of the time we see that the customer has created search strings such as the examples provided below:

  1. (Baxter Jason) ~20 (article) OR (paper) OR (presentation) OR (public) OR (report)
  2. "national OR fire OR service"~30 (truck) OR (department)
  3. reading w2 glasses

These examples have been sanitized and shortened however, the original search strings contained several lines of OR statements. This makes the search string complex, cumbersome, prone for errors and difficult to troubleshoot.


Example 1
If we look at the first example above, we can see immediately that there are several issues which make this syntax incorrect. One issue is that the terms to be searched are not encased in double quotes. Another issue is that the number of words to be within (~20 in this case) is not at the end of the proximity search syntax as there are several OR statements after this number. The user manual shows a basic example of the syntax “desktop application”~10. Note that the structure is to have two (or more) search terms encased in double quotes followed by the number of words that the terms must be within.

The proximity string can be made more useful for larger queries by adding more search terms. The additional search terms need to be separated by the OR operator and encased in parentheses. For example, the first example above could be rewritten this way: 
"(Baxter OR Jason) (article OR paper OR presentation OR public OR report)"~20

Because the user is looking for one of two terms within 20 words of one of several other terms, we have grouped the keywords by placing them in parentheses and separating the terms with the OR operator. Note that all of the search terms are still encased in double quotes, followed by the number of words that the terms must be within. This syntax will return any items where Baxter or Jason is within 20 words of article, paper, presentation, public or report.


Example 2
Again we see that there are issues with the search syntax in example 2. This time double quotes are used however, they do not encase all of the search terms. Also, we see a similar trend to example 1 where there are several search terms within parentheses and separated by the OR operator. We see a lot of samples like this and wonder whether this format of proximity search has come from another tool.

 

The way I read this example is as follows: Find all items that have national, fire, or service within 30 words of truck or department. The syntax can be rewritten this way: 
"(national OR fire OR service) (truck OR department)"~30

Again we use the parentheses to group the search terms into the two groups and make sure that all terms are encased in double quotes.


Example 3

This is a common issue that we see a lot. Many other tools use the w (within) and the n (near) operators, e.g. reading w2 glasses. This syntax does not work with Intella. 


Using phrases within proximity searches

Using phrases within proximity searches are now supported in Intella from version 2.5.0 and up. Phrases can be used by using single quotes around the phrase. Here are a few examples.

  1. The search below is used to find evidence of suspected stolen documents. The search finds the phrase chemical formula within 50 words of either copied OR attached. Note that in this example, the phrase 'chemical formula' has single quotes to encase the phrase.
    "'chemical formula' (copied OR attached)"~50


  2.  The search below is used to locate a fraudulent invoice. The names for the two firms are searched, and these are highlighted in the document. Again the company names are encased in single quotes. 
    "'abc engineering' 'bobs construction'"~20



Limitations

  • In the past we have seen very long proximity search strings where the syntax contained over 40 words separated by the OR operator. Constructing complex proximity searches like this is not recommended. Even if the syntax is  correct, 40 words in a proximity search makes the search string complex, cumbersome, prone for errors and difficult to troubleshoot.
  • We have also received extremely long search syntax where all search terms contained wildcards. Such long and complex queries with many wildcards are known to have very poor performance, especially for hit highlighting in the Previewer window.

Workarounds

There are a couple a methods one could use to manage complex proximity searches that contain a large number of search terms separated by the OR operator. The first method is to break down the search string. The second method is to use keyword lists.

Breaking down the search string

A complex search string can be broken down into several shorter proximity search strings. The shorter search strings can then be placed into a keyword list. E.g. 

“Baxter article”~20
“Baxter paper”~20
“Baxter presentation”~20
“Baxter public”~20
“Baxter report”~20

Intella will be able to process the list of shorter proximity searches more efficiently than one large complex search string. 

With a small amount of Excel work you can create a keyword list that includes all of your shortened proximity searches in a single list 

Using keyword lists
The idea behind using keyword lists is to reduce the number of items that your proximity search needs to search across. Two keyword lists can be created, one list which contains the search terms in the left group of a proximity search, and a second list which contains all the other terms in the right group, e.g. 

Keyword list 1      Keyword list 2
Baxter                   article
Jason                     paper
                              presentation
                              public
                              report 

Next, run the two keyword lists and Tag the overlapping cluster. This cluster will contain the items that have search terms from both keyword lists. 

Set this Tag as a Require search and run the proximity search. This provides faster searching as you are not searching over the entire dataset. However, be aware that hit highlighting can still be slow or have issues if the proximity search is complex and contains many wildcards.



Updated June 2022