Class: Ferret::Analysis::StandardTokenizer
Summary
The standard tokenizer is an advanced tokenizer which tokenizes most words correctly as well as tokenizing things like email addresses, web addresses, phone numbers, etc.
Example
"Dave's résumé, at http://www.davebalmain.com/ 1234"
=> ["Dave's", "résumé", "at", "http://www.davebalmain.com", "1234"]
Public Class Methods
StandardTokenizer.new(lower = true) → tokenizer
Create a new StandardTokenizer which optionally downcases tokens. Downcasing is done according the current locale.
| lower: | set to false if you don‘t wish to downcase tokens |
/*
* call-seq:
* StandardTokenizer.new(lower = true) -> tokenizer
*
* Create a new StandardTokenizer which optionally downcases tokens.
* Downcasing is done according the current locale.
*
* lower:: set to false if you don't wish to downcase tokens
*/
static VALUE
frt_standard_tokenizer_init(VALUE self, VALUE rstr)
{
#ifndef POSH_OS_WIN32
if (!frt_locale) frt_locale = setlocale(LC_CTYPE, "");
#endif
return get_wrapped_ts(self, rstr, mb_standard_tokenizer_new());
}