There is also a webinar on using proximity searches in Intella here.
In some cases we are provided with examples of the syntax which the customer has used. In most of these cases the syntax is very complex and the syntax is often incorrect.
Some customers ask us whether the syntax is correct, or ask why their proximity search is not working. This is something that we cannot answer on an individual basis. The point of this document is to provide information and examples to help our customers to get a better understanding of the proximity search syntax so that they can create the correct search syntax for the search that they want to perform.
Note: Most of this information here applies to all versions of Intella which support Proximity searching. However, there are few known limitations.
- There is a limitation with hit highlighting in versions prior to 1.9.1
- There is a limitation with running phrases within proximity searches prior to version 2.5.0
What is a Proximity Search?
Proximity searches are search syntax specifically crafted to find items based on words that are within a specified maximum distance from each other in the item’s text. For example, if I wanted to find all items that have the words 'desktop' and 'application' within 10 words of each other then I would use the following:
A proximity search differs from a phrase search in that it does not matter whether 'desktop' is before or after the term 'application' in the text. For example, documents containing either of the passages of text below will be respondent to the proximity search above.
"You must turn on your desktop computer before you can open an application."
"I have copied the shortcut for the application onto the desktop."
Here is an example of a proximity search being used in Intella.
Using the Correct Proximity Search Syntax
As mention above we receive proximity search syntax from customers. A lot of the time we see that the customer has created search strings such as the examples provided below:
- (Baxter Jason) ~20 (article) OR (paper) OR (presentation) OR (public) OR (report)
- "national OR fire OR service"~30 (truck) OR (department)
- reading w2 glasses
These examples have been sanitized and shortened however, the original search strings contained several lines of OR statements. This makes the search string complex, cumbersome, prone for errors and difficult to troubleshoot.
"(Baxter OR Jason) (article OR paper OR presentation OR public OR report)"~20.
The way I read this example is as follows: Find all items that have national, fire, or service within 30 words of truck or department. The syntax can be rewritten this way:
"(national OR fire OR service) (truck OR department)"~30.
Again we use the parentheses to group the search terms into the two groups and make sure that all terms are encased in double quotes.
This is a common issue that we see a lot. Many other tools use the w (within) and the n (near) operators, e.g. reading w2 glasses. This syntax does not work with Intella.
Using phrases within proximity searches are now supported in Intella from version 2.5.0 and up. Phrases can be used by using single quotes around the phrase. Here are a few examples.
The search below is used to find evidence of suspected stolen documents. The search finds the phrase chemical formula within 50 words of either copied OR attached. Note that in this example, the phrase 'chemical formula' has single quotes to encase the phrase.
"'chemical formula' (copied OR attached)"~50
- The search below is used to locate a fraudulent invoice. The names for the two firms are searched, and these are highlighted in the document. Again the company names are encased in single quotes. "'abc engineering' 'bobs construction'"~20
In the past we have seen very long proximity search strings where the syntax contained over 40 words separated by the OR operator. Constructing complex proximity searches like this is not recommended. Even if the syntax is
correct, 40 words in a proximity search makes the search string complex, cumbersome, prone for errors and difficult to troubleshoot.
- We have also received extremely long search syntax where all search terms contained wildcards. Such long and complex queries with many wildcards are known to have very poor performance, especially for hit highlighting in the Previewer window.
There are a couple a methods one could use to manage complex proximity searches that contain a large number of search terms separated by the OR operator. The first method is to break down the search string. The second method is to use keyword lists.
A complex search string can be broken down into several shorter proximity search strings. The shorter search strings can then be placed into a keyword list. E.g.
Intella will be able to process the list of shorter proximity searches more efficiently than one large complex search string.
With a small amount of Excel work you can create a keyword list that includes all of your shortened proximity searches in a single list
Keyword list 1 Keyword list 2
Next, run the two keyword lists and Tag the overlapping cluster. This cluster will contain the items that have search terms from both keyword lists.
Set this Tag as a Require search and run the proximity search. This provides faster searching as you are not searching over the entire dataset. However, be aware that hit highlighting can still be slow or have issues if the proximity search is complex and contains many wildcards.
Updated June 2022