Icon Full Text Indexing

Introduction
DBISAM provides the ability to index string and memo fields so that they may be quickly searched for a given word or words. This is known as full text indexing since it results in the indexing of every word in every column that is specified as part of the full text index for the table. This whole process is controlled by full text indexing parameters that are defined as part of the table structure when creating or altering the structure of tables, as well as events in the TDBISAMEngine component for customizing the full text indexing. Please see the Customizing the Engine for more information.

Information Full text indexing and searching is always case-insensitive in DBISAM. This means that words are always compared without regard for case, however the include and space character full text indexing parameters are compared on an exact character basis when parsing the text to be indexed or searched.

Text Indexing Parameters
The three parameters that control the full text indexing behavior for a given table are:

ParameterDescription
Stop Words ListThe stop words list is a list of words that are to be excluded from the full text index. These words are usually very common words and excluding them from the full text index can result in tremendous space savings for the physical index. The default stop words for a table are as follows:

A
AN
AND
BE
FOR
HOW
IN
IS
IT
OF
ON
OR
THAT
THE
THIS
TO
WAS
WHAT
WHEN
WHICH
WHY
WILL

The stop words list is always case-insensitive, as is the full text indexing in general.
Space CharactersThe space characters specify which characters in the ANSI character set are to be used for word separator characters. These characters usually consist of any character below the ordinal value of 33 and other separators such as backslashes (\)and commas (,). The default space characters for a table are as follows:

Characters 1 through 32
*+,-./:;<=>\
Include CharactersThe include characters specify which characters in the ANSI character set are to be included in the words that are finally used for the full text index. These characters usually consist of all alphanumeric characters as well as all high character values in the ANSI character set that are used by non-English languages for accented characters and other diacritically-marked characters. The default include characters for a table are as follows:

0123456789
ABCDEFGHIJKLMNOPQRSTUVWXYZ_
abcdefghijklmnopqrstuvwxyz





Information You must alter the structure of a table in order to change any of these parameters.

Performing a Text Search
DBISAM includes the filter and SQL TEXTSEARCH function in order to take advantage of the full text index and also as a general-purpose, brute-force, word search function when searching on string or memo columns that are not part of the full text index. The TEXTSEARCH function accepts a list of words in a search string constant and a column name as its two parameters. The following is an example of using the TEXTSEARCH function in an expression filter:

begin
   with MyTable do
      begin
      Filter:='TEXTSEARCH('+QuotedStr('DATABASE QUERY SPEED')+
              ' IN TextBody)';
      Filtered:=True;
      end;
end;

In the above example, if the TextBody column is included as part of the indexed fields that make up the full text index then the filter will execute very quickly. If the column is not part of the full text index, then the filter will be forced to resort to a brute-force scan of the TextBody column for every record in the table. To further explain how the text searching works, let's break down the previous process (assuming an optimized text search):

1) DBISAM parses the search string constant "DATABASE QUERY SPEED" into three words (DATABASE, QUERY, and SPEED) using the space characters and include characters specified for the table, which by default would allow for the space character ( ) as the word separator in this case. If there happened to be a backslash in the search string such as "DATABASE C:\TEMP" then the search string would be parsed into three words "DATABASE C TEMP". This is because the default full text indexing space characters include the colon (:) and the backslash (\).

2) DBISAM takes the list of words created from the text indexing parameters and performs a case-insensitive search of the stop words for the table to see if any of the words exist in the stop words. If one or more does, then they are ignored when performing the actual search.

3) Finally, DBISAM searches the full text index and builds a bitmap for each word indicating which records satisfy the search for that particular word. These bitmaps are ANDed together and the resultant bitmap is used for filtering the records in the table. This process is very similar to a what happends with a normal optimized filter expression or SQL WHERE clause.

Information DBISAM only executes an AND search for the multiple words in the search string. If you want to execute an OR search for multiple words you should split up the operation into multiple TEXTSEARCH calls as the following example illustrates:

begin
   with MyTable do
      begin
      Filter:='TEXTSEARCH('+QuotedStr('DATABASE')+
              ' IN TextBody) OR '+
              'TEXTSEARCH('+QuotedStr('QUERY')+
              ' IN TextBody) OR '+
              'TEXTSEARCH('+QuotedStr('SPEED')+
              ' IN TextBody)';
      Filtered:=True;
      end;
end;

This will give the desired results of returning all records where either DATABASE, QUERY, or SPEED are found in the TextBody column.

You can also specify partial-word searches using the asterisk as a trailing wildcard character. The following is an example of using the TEXTSEARCH function in an expression filter with a partial-word search string:

begin
   with MyTable do
      begin
      Filter:='TEXTSEARCH('+QuotedStr('DATA*')+
              ' IN TextBody)';
      Filtered:=True;
      end;
end;

Information You can mix and match partial words and whole words in the same search string.

If you wish to find out what words were searched on in a TEXTSEARCH function or just want to test various text indexing parameters, you can do so using the TDBISAMEngine BuildWordList method.

Retrieving the Number of Occurrences
DBISAM includes the filter and SQL TEXTOCCURS function in order to calculate the number of times a specific list of words in a given search string occurs in a string or memo column. This function is always a brute-force function and should not be used for the bulk of the filtering or searching but rather for ranking purposes or something similar after the search has been completed using the TEXTSEARCH function. The TEXTOCCURS function accepts a list of words in a search string constant and a column name as its two parameters. The following is an example of using the TEXTOCCURS function in an SQL SELECT statement:

SELECT GroupNo, No, TEXTOCCURS('DATABASE QUERY SPEED' IN TextBody) AS NumOccurs
FROM article
WHERE TEXTSEARCH('DATABASE QUERY SPEED' IN TextBody)
ORDER BY 3 DESC

As you can see, the TEXTOCCURS function is being used to provide ranking of the results by the number of times the search words occur in the TextBody column after the bulk of the search was already handled by the optimized TEXTSEARCH function.
Image