Class: Ferret::Analysis::LetterTokenizer

Summary

A LetterTokenizer is a tokenizer that divides text at non-letters. That is to say, it defines tokens as maximal strings of adjacent letters, as defined by the regular expression _/[[:alpha:]]+/_ where [:alpha] matches all characters in your local locale.

Example

  "Dave's résumé, at http://www.davebalmain.com/ 1234"
    => ["Dave", "s", "résumé", "at", "http", "www", "davebalmain", "com"]

Public Class Methods


LetterTokenizer.new(lower = true) → tokenizer

Create a new LetterTokenizer which optionally downcases tokens. Downcasing is done according the current locale.

lower:set to false if you don‘t wish to downcase tokens
/*
 *  call-seq:
 *     LetterTokenizer.new(lower = true) -> tokenizer
 *
 *  Create a new LetterTokenizer which optionally downcases tokens. Downcasing
 *  is done according the current locale.
 *
 *  lower:: set to false if you don't wish to downcase tokens
 */
static VALUE
frt_letter_tokenizer_init(int argc, VALUE *argv, VALUE self) 
{
    TS_ARGS(false);
#ifndef POSH_OS_WIN32
    if (!frt_locale) frt_locale = setlocale(LC_CTYPE, "");
#endif
    return get_wrapped_ts(self, rstr, mb_letter_tokenizer_new(lower));
}