Cache DB는 2007버전에서 부터 (영어) 자연어 검색을 포함한 텍스트에 대한 검색 기능을 지원하고 있습니다. 비트맵 인덱스를 사용하여 어떠한 텍스트에 대해서도 1초 이내의 속도를 제공한다고 하니, 고려해 볼만 하겠군요.
관련하여, 아래와 같은 기능 정리가 있습니다. 우선, 영문으로 공유!
Free Text Search
Support has been added for indexing and searching textual data, specifically:
A new %Text data type.
A new SQL selection operator, %CONTAINS.
Language-specific parsers for English, Spanish, French, Italian, German, Japanese, and Portuguese.
Support for multi-word ("n-gram") indices. For instance, with an n-gram length of 1, all individual words are indexed. With an n-gram length of 2, all word pairs that occur together are also indexed.
Support for "stemming" to map multiple forms of a word (go, goes, going, went, …) to a common root.
Noise word filtering, to eliminate common words (a, and, the, …) from an index.
-----------
Cache SQL 사용하기 - 데이터베이스에 쿼리하기,
Using Free-text Search
Caché supports what is called “free-text search,” which includes support for:
Stemming
Multiple-word searches (also called n-grams )
Automatic classification
Dictionary management
This feature enables SQL to support full text indexing, and also enables SQL to index and reference individual elements of a collection without projecting the collection property as a child table. While the underlying mechanisms that support collection indexing and full text indexing are closely related, text retrieval has many special properties, and therefore special classes and SQL features have been provided for text retrieval.
For details on the underlying classes that support these features, see the %Text.Text class.
Full Text Indexing and Text Retrieval through SQL
The %Library.Text class and the %Text package has been provided to index text and to search textual data with SQL. To use the feature on an existing %String property, change %String to %Text and set the LANGUAGECLASS parameter. For English text, the declaration would be as follows:
Property myDocument As %Text ( LANGUAGECLASS="%Text.English" );
Index myFullTextIndex On myDocument(KEYS);
LANGUAGECLASS specifies the name of a helper class that provides the necessary interface to SQL and to the indexer so that efficient text indexing and text search may be carried out. The default MAXLEN for %Text is 256 bytes.
Once the %Text property has been declared and optionally indexed, you can issue full text queries using the SQL %CONTAINS operator, as follows:
SELECT plaintiff, legalBrief FROM transcript
WHERE plaintiff = 'John Doe' AND
legalBrief %CONTAINS ('neighbor', 'tree', 'roof')
The query above returns all transcripts where the plaintiff is 'John Doe' and where the terms 'neighbor' AND 'tree' AND 'roof' are in its legalBrief.
Caché includes language-specific parsers in the %Text package for English, Spanish, French, Italian, German, Japanese, Portuguese. While easy to use, the language specific subclasses can be configured to perform many sophisticated functions, such as word stemming to map various forms of the same word into a common root (block, blocks, blocking, and so on), or to support multi-word phrases (n-grams), or to filter out noise words, or to perform automatic text classification (for example, for junk-mail filtering), as well as other features.
Multi-word strings may be specified to %CONTAINS, even if the type class is not configured to support n-grams of the full length of the query. For example, the following predicate may be specified even if the Text class is configured to store only individual words:
SELECT author FROM famousQuotations WHERE
document %CONTAINS('put all your eggs in one basket',
'but watch that basket!')
The query above would return all authors where the documents contain exactly the specified phrases, even if the text class represents the document only with single words (NGRAMLEN=1). When the pattern is longer than the n-gram length as in the case above, SQL filters the rows with the "[" operator. Because the "[" operator is case-sensitive, the %CONTAINS operator is also case-sensitive on patterns longer than NGRAMLEN. An implication is that NGRAMLEN can also affect which rows get returned, as in the following case:
mission %CONTAINS('Fortune 5')
If NGRAMLEN >= 2, then only the rows containing 'Fortune 5' are returned.
If NGRAMLEN < 2, then rows containing 'Fortune 5', 'Fortune 50', 'Fortune 500', and so on may be returned, since all of these patterns satisfy the "[" test.
See the class documentation of the %Text.Text class and its language-specific subclasses (such as %Text.English) for more information about the capabilities of the Text interface.
Collection Indexing and Querying Collections through SQL
Collections may be referenced from the SQL WHERE clause with a FOR clause:
FOR SOME %ELEMENT(collectionRef) [AS label] (predicate)
where predicate may contain one reference to the pseudo-columns %KEY, %VALUE, or both. A few examples should help to clarify how the FOR SOME %ELEMENT clause may be used. The following returns the name and the list of FavoriteColors for each person whose FavoriteColors include 'Red'.
A list collection is considered a special case of an array collection that has sequential numeric keys 1, 2, and so on. Array collections may have arbitrary non-null keys:
FOR SOME (children) (%Key = 'betty' AND %Value > 5)
In addition to the built-in list and array collection types, generalized collections may be created by providing a BuildValueArray class method for any property. The BuildValueArray class method transforms the value of a property into a local array, where each subscript of the array is a %KEY and the value is the corresponding %VALUE.
In addition to simple selections on the %KEY or %VALUE, it is also possible to logically connect two collections, as in the following example:
FOR SOME %ELEMENT(flavors) AS f
(f.%VALUE IN ('Chocolate', 'Vanilla') AND
FOR SOME %ELEMENT(toppings) AS t
(t.%VALUE = 'Butterscotch' AND
f.%KEY = t.%KEY))
This example has two collections: flavors and toppings, that are positionally related through their key. The query qualifies a row that has chocolate or vanilla specified as an element of flavors, and that also has butterscotch listed as the corresponding topping, where the correspondence is established through the %KEY.
Usage Notes and Restrictions
FOR SOME %ELEMENT may only appear in the WHERE clause.
%CONTAINS may only appear in the WHERE clause.
%KEY and/or %VALUE may only appear in a FOR predicate.
Any particular %KEY or %VALUE may be referenced only once.
%KEY and %VALUE may not appear in an outer join.
%KEY and %VALUE may not appear in a value expression (only in a predicate).
Streams longer than the maximum strength of a string require the use of the %CONTAINSTERM predicate rather than the %CONTAINS predicate. For information on the maximum length of a string, see the section “Long Strings” in the article InterSystems Product Miscellany.
Streams longer than the maximum strength of a string only support the use of %SIMILARITY on indexed fields. For information on the maximum length of a string, see the section “Long Strings” in the article InterSystems Product Miscellany.