Icon Text Indexing

ElevateDB provides the ability to index CHAR, VARCHAR, or CLOB columns so that they may be quickly searched for a given word or words. This is known as text indexing since it results in the indexing of every word in a specified column.

The following image illustrates the general architecture of the text indexing in ElevateDB:

Image

You can use the CREATE TEXT INDEX statement to create a new text index on a given column. When creating a text index, you may specify the CHAR, VARCHAR, or CLOB column to index, the indexed word length, optionally another CHAR or VARCHAR column to use as a text filter type indicator to ElevateDB, and optionally a specific word generator module to use for generating the actual words that are added to the index.

Specifying the Indexed Column
Each text index can index one, and only one, CHAR, VARCHAR, or CLOB column. By default, ElevateDB always uses the collation of the column that is being indexed and modifies it so that all comparisons are case-insensitive. However, if a collation is explicitly specified along with the indexed column, then ElevateDB will use that collation instead of a case-insensitive version of the column being indexed. For example, the following text index overrides the default collation of the Notes column (ANSI_CI) so that the text index uses a case-sensitive version:

CREATE TEXT INDEX "Notes" ON "Customer"
(Notes COLLATE ANSI)
INDEXED WORD LENGTH 20

Information It is generally recommended that you always use a case-insensitive collation with any text index in order to reduce the size of the text index and to make searching easier.

Specifying the Indexed Word Length
The indexed word length controls how long each index key is in the actual text index. It does not affect which words are indexed in any way. However, if a word that is being indexed is longer than the indexed word length specified when the text index was created, then the word will be truncated to the indexed word length. If the indexed word length is not specified, then the default indexed word length of 30 characters is used. You should try to keep the indexed word length as small as possible in order to minimize the size of the text index.

Information The minimum word length indexed by the default word generator is 3 characters. Any word smaller than 3 characters will not be included in the text index.

Specifying a Filter Type Column
The contents of a filter type column indicate to ElevateDB what type of data is in the column being indexed. This means that you can store text with various types of formatting in the same column and still have the text index only index the non-formatting information. The filter type indicator is used to look up the applicable text filter in the defined Text Filters in the current Configuration Database. If a matching text filter is found, then the text to be indexed is first passed to the text filter before being passed on to the word generator (see below). If a matching text filter is not found, then the text is passed on directly to the word generator without being filtered.

For example, suppose that you have a column in your table called Notes and a column called TypeOfNotes. The Notes column may contain either plain text, HTML-formatted text, or RTF-formatted text, and the type of text is indicated by the TypeOfNotes column, which will contain either a NULL (plain text), 'HTML' (HTML Text), or 'RTF' (RTF Text) value in each row. In addition, you have defined two text filters that use external modules to parse out all non-formatting text and return it to ElevateDB for use in the word generation:

CREATE TEXT FILTER HTMLFilter
TYPE 'HTML'
MODULE HTMLTextFilterModule
DESCRIPTION 'HTML Text Filter'

CREATE TEXT FILTER RTFFilter
TYPE 'RTF'
MODULE RTFTextFilterModule
DESCRIPTION 'RTF Text Filter'

Whenever the Notes column is updated, the appropriate text filter will be called with the new contents of the Notes column, and the filtered text that is returned will be passed on to the word generation process.

Please see your product-specific manual for information on creating external modules that can implement text filtering.

Specifying a Word Generator
By default, ElevateDB will use the following parameters when parsing and generating words from text:

Space Characters

Space characters are used to determine which characters should be treated as whitespace. Word breaks always occur at any character that is considered whitespace.

#0..#47,#58..#64,
#91..#96,#123..#130,
#132..#137,#139,#141,
#143..#153,#155,#157,
#160..#191,#215,#247

All numeric values represent the ordinal character value in the 256 characters of the Windows ANSI Code Page 1252 character set.

Include Characters

Include characters are used to determine which characters should be included in the words that are generated. Any character that isn't an include character or space character is simply ignored.

'A'..'Z',
'a'..'z',#131,
#138,#140,#142,
#154,#156,#158..#159,
#192..#214,#216..#246,
#248..#255

All numeric values represent the ordinal character value in the 256 characters of the Windows ANSI Code Page 1252 character set.

Stop Words

Stop words are words that are so common in most text that they provide no value in terms of narrowing the search process and increase the size of the text index. Stop words are sometimes also referred to as noise words.

'ABOUT','ABOVE','AFAIK','ALL','ALONG','ALSO','ALTHOUGH','AND','ARE','ARENT',
'BECAUSE','BEEN','BTW','BUT','CAN','CANNOT','CANT','COULD','COULDNT','DID',
'DIDNT','DOES','DOESNT','DUH','EITHER','ETC','EVEN','EVER','FOR','FROM',
'FURTHERMORE','FYI','GET','GETS','GOT','GOTTEN','HAD','HADNT','HARDLY',
'HAS','HASNT','HAVING','HENCE','HER','HERE','HERS','HEREBY','HEREIN',
'HEREOF','HEREON','HERETO','HEREWITH','HIM','HIS','HOW','HOWEVER','IMHO','IMO',
'INTO','ISNT','ITS','LOL','MINE','NOR','NOT','ONTO','OTHER','OTOH','OUR',
'OURS','OUT','OVER','REALLY','ROTFL','SAID','SAME','SHE','SHOULD','SHOULDNT',
'SINCE','SOMEWHAT','SUCH','THAN','THAT','THATLL','THATS','THE','THEIR',
'THEIRS','THEM','THEN','THERE','THEREBY','THEREFORE','THEREFROM',
'THEREIN','THEREOF','THEREON','THERETO','THEREWITH','THESE','THEY',
'THEYLL','THEYRE','THIS','THOSE','THROUGH','THROUGHOUT','THUS','TIA','TOO',
'UNDER','UNTIL','UNTO','UPON','VERY','WAS','WASNT','WERE','WERENT','WHAT',
'WHEN','WHERE','WHEREBY','WHEREIN','WHETHER','WHICH','WHILE','WHO','WHOM',
'WHOS','WHOSE','WHY','WITH','WITHIN','WITHOUT','WONT','WOULD','WOULDNT',
'YOU','YOULL','YOUR','YOURE','YOURS'

In order to override the default word generation, one must specify a different word generator when creating a text index.

Please see your product-specific manual for information on creating external modules that can implement custom word generation.

Performing a Text Index Search
ElevateDB includes CONTAINS, DOES NOT CONTAIN, CONTAINS ANY, and DOES NOT CONTAIN ANY operators for searching a text index for a series of words. The difference between CONTAINS and CONTAINS ANY (and their negatives) is that the CONTAINS operator performs an ANDed search of all specified search words, while the CONTAINS ANY operator performs an ORed search of all specified search words. The following is an example of using the CONTAINS operator to search for the word 'Development':

SELECT *
FROM Customer
WHERE Notes CONTAINS 'Development'

Information The CONTAINS, CONTAINS ANY, DOES NOT CONTAIN, and DOES NOT CONTAIN ANY operators can only be used with columns that have been indexed with a text index. Using them with a non-text-indexed column will result in an error.

If multiple search words are specified with the CONTAINS or DOES NOT CONTAIN operators, then ElevateDB will return all rows that contain all of the search words. If you want to return all rows that contain only some of the search words, then you will need to use the CONTAINS ANY or DOES NOT CONTAIN ANY operators. For example, if you want to return all rows that contain either the word 'Development' or 'Vacation' in the Notes column, then you would use the following SELECT statement:

SELECT *
FROM Customer
WHERE Notes CONTAINS ANY 'Development Vacation'

You can also specify partial-word searches by using an asterisk (*) anywhere in any of the search words. The following is an example of using the DOES NOT CONTAIN operator to find all rows that don't contain any version of the word 'Develop' in the Notes column:

SELECT *
FROM Customer
WHERE Notes DOES NOT CONTAIN 'Develop*'

The following is an example of using the CONTAINS operator to find all rows that contain 'invest' in any words in the Notes column:

SELECT *
FROM Customer
WHERE Notes CONTAINS '*invest*'

You can mix and match search words with wildcards and whole search words in the same search string.
Image