Class: Ferret::Analysis::RegExpTokenizer
Summary
A tokenizer that recognizes tokens based on a regular expression passed to the constructor. Most possible tokenizers can be created using this class.
Example
Below is an example of a simple implementation of a LetterTokenizer using an RegExpTokenizer. Basically, a token is a sequence of alphabetic characters separated by one or more non-alphabetic characters.
# of course you would add more than just é
RegExpTokenizer.new(input, /[[:alpha:]é]+/)
"Dave's résumé, at http://www.davebalmain.com/ 1234"
=> ["Dave", "s", "résumé", "at", "http", "www", "davebalmain", "com"]
Constants
| Name | Value |
|---|---|
| REGEXP | rtoken_re |
Public Class Methods
RegExpTokenizer.new(input, /[[:alpha:]]+/)
Create a new tokenizer based on a regular expression
| input: | text to tokenizer |
| regexp: | regular expression used to recognize tokens in the input |
/*
* call-seq:
* RegExpTokenizer.new(input, /[[:alpha:]]+/)
*
* Create a new tokenizer based on a regular expression
*
* input:: text to tokenizer
* regexp:: regular expression used to recognize tokens in the input
*/
static VALUE
frt_rets_init(int argc, VALUE *argv, VALUE self)
{
VALUE rtext, regex, proc;
TokenStream *ts;
rb_scan_args(argc, argv, "11&", &rtext, ®ex, &proc);
ts = rets_new(rtext, regex, proc);
Frt_Wrap_Struct(self, &frt_rets_mark, &frt_rets_free, ts);
object_add(ts, self);
return self;
}Public Instance Methods
tokenizer.text = text → text
Get the text being tokenized by the tokenizer.
/*
* call-seq:
* tokenizer.text = text -> text
*
* Get the text being tokenized by the tokenizer.
*/
static VALUE
frt_rets_get_text(VALUE self)
{
TokenStream *ts;
GET_TS(ts, self);
return RETS(ts)->rtext;
}
tokenizer.text = text → text
Set the text to be tokenized by the tokenizer. The tokenizer gets reset to tokenize the text from the beginning.
/*
* call-seq:
* tokenizer.text = text -> text
*
* Set the text to be tokenized by the tokenizer. The tokenizer gets reset to
* tokenize the text from the beginning.
*/
static VALUE
frt_rets_set_text(VALUE self, VALUE rtext)
{
TokenStream *ts;
GET_TS(ts, self);
rb_hash_aset(object_space, ((VALUE)ts)|1, rtext);
StringValue(rtext);
RETS(ts)->rtext = rtext;
RETS(ts)->curr_ind = 0;
return rtext;
}