Class: Ferret::Analysis::StemFilter

Summary

A StemFilter takes a term and transforms the term as per the SnowBall stemming algorithm. Note: the input to the stemming filter must already be in lower case, so you will need to use LowerCaseFilter or lowercasing Tokenizer further down the Tokenizer chain in order for this to work properly!

Available algorithms and encodings

  Algorithm       Algorithm Pseudonyms       Encoding
  ----------------------------------------------------------------
   "danish",     | "da", "dan"              | "ISO_8859_1", "UTF_8"
   "dutch",      | "dut", "nld"             | "ISO_8859_1", "UTF_8"
   "english",    | "en", "eng"              | "ISO_8859_1", "UTF_8"
   "finnish",    | "fi", "fin"              | "ISO_8859_1", "UTF_8"
   "french",     | "fr", "fra", "fre"       | "ISO_8859_1", "UTF_8"
   "german",     | "de", "deu", "ge", "ger" | "ISO_8859_1", "UTF_8"
   "hungarian",  | "hu", "hun"              | "ISO_8859_1", "UTF_8"
   "italian",    | "it", "ita"              | "ISO_8859_1", "UTF_8"
   "norwegian",  | "nl", "no"               | "ISO_8859_1", "UTF_8"
   "porter",     |                          | "ISO_8859_1", "UTF_8"
   "portuguese", | "por", "pt"              | "ISO_8859_1", "UTF_8"
   "romanian",   | "ro", "ron", "rum"       | "ISO_8859_2", "UTF_8"
   "russian",    | "ru", "rus"              | "KOI8_R",     "UTF_8"
   "spanish",    | "es", "esl"              | "ISO_8859_1", "UTF_8"
   "swedish",    | "sv", "swe"              | "ISO_8859_1", "UTF_8"
   "turkish",    | "tr", "tur"              |               "UTF_8"

New Stemmers

The following stemmers have recently benn added. Please try them out;

  * Hungarian
  * Romanian
  * Turkish

Example

To use this filter with other analyzers, you‘ll want to write an Analyzer class that sets up the TokenStream chain as you want it. To use this with a lowercasing Tokenizer, for example, you‘d write an analyzer like this:

  def MyAnalyzer < Analyzer
    def token_stream(field, str)
      return StemFilter.new(LowerCaseFilter.new(StandardTokenizer.new(str)))
    end
  end

  "debate debates debated debating debater"
    => ["debat", "debat", "debat", "debat", "debat"]

Attributes

token_stream:TokenStream to be filtered
algorithm:The algorithm (or language) to use (default: "english")
encoding:The encoding of the data (default: "UTF-8")

Public Class Methods


StemFilter.new(token_stream) → token_stream
StemFilter.new(token_stream,
algorithm="english",
encoding="UTF-8") → token_stream

Create an StemFilter which uses a snowball stemmer (thank you Martin Porter) to stem words. You can optionally specify the algorithm (default: "english") and encoding (default: "UTF-8").

token_stream:TokenStream to be filtered
algorithm:The algorithm (or language) to use
encoding:The encoding of the data (default: "UTF-8")
/* 
 *  call-seq:
 *     StemFilter.new(token_stream) -> token_stream
 *     StemFilter.new(token_stream,
 *                    algorithm="english",
 *                    encoding="UTF-8") -> token_stream
 *
 *  Create an StemFilter which uses a snowball stemmer (thank you Martin
 *  Porter) to stem words. You can optionally specify the algorithm (default:
 *  "english") and encoding (default: "UTF-8").
 *
 *  token_stream:: TokenStream to be filtered
 *  algorithm::    The algorithm (or language) to use
 *  encoding::     The encoding of the data (default: "UTF-8")
 */
static VALUE
frt_stem_filter_init(int argc, VALUE *argv, VALUE self) 
{
    VALUE rsub_ts, ralgorithm, rcharenc;
    char *algorithm = "english";
    char *charenc = NULL;
    TokenStream *ts;
    rb_scan_args(argc, argv, "12", &rsub_ts, &ralgorithm, &rcharenc);
    ts = frt_get_cwrapped_rts(rsub_ts);
    switch (argc) {
        case 3: charenc = rs2s(rb_obj_as_string(rcharenc));
        case 2: algorithm = rs2s(rb_obj_as_string(ralgorithm));
    }
    ts = stem_filter_new(ts, algorithm, charenc);
    object_add(&(TkFilt(ts)->sub_ts), rsub_ts);

    Frt_Wrap_Struct(self, &frt_tf_mark, &frt_tf_free, ts);
    object_add(ts, self);
    if (((StemFilter *)ts)->stemmer == NULL) {
        rb_raise(rb_eArgError, "No stemmer could be found with the encoding "
                 "%s and the language %s", charenc, algorithm);
    }
    return self;
}