Class: Ferret::Analysis::LetterTokenizer
Summary
A LetterTokenizer is a tokenizer that divides text at non-letters. That is to say, it defines tokens as maximal strings of adjacent letters, as defined by the regular expression _/[[:alpha:]]+/_ where [:alpha] matches all characters in your local locale.
Example
"Dave's résumé, at http://www.davebalmain.com/ 1234"
=> ["Dave", "s", "résumé", "at", "http", "www", "davebalmain", "com"]
Public Class Methods
LetterTokenizer.new(lower = true) → tokenizer
Create a new LetterTokenizer which optionally downcases tokens. Downcasing is done according the current locale.
| lower: | set to false if you don‘t wish to downcase tokens |
/*
* call-seq:
* LetterTokenizer.new(lower = true) -> tokenizer
*
* Create a new LetterTokenizer which optionally downcases tokens. Downcasing
* is done according the current locale.
*
* lower:: set to false if you don't wish to downcase tokens
*/
static VALUE
frt_letter_tokenizer_init(int argc, VALUE *argv, VALUE self)
{
TS_ARGS(false);
#ifndef POSH_OS_WIN32
if (!frt_locale) frt_locale = setlocale(LC_CTYPE, "");
#endif
return get_wrapped_ts(self, rstr, mb_letter_tokenizer_new(lower));
}