|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||
java.lang.Objectmoj.lang.GenericTokenizer
public class GenericTokenizer
GenericTokenizer is a wrapper class that tries to extend BreakIterators
usability. It does so by removing token boundaries set by BreakIterator
where an abbreviation from a supplied abbreviation list is encountered.
It also provides handy methods for converting a String of text
to an array of String tokens, which can be either the characters,
words, sentences or the vocabulary of the text.
| Constructor Summary | |
|---|---|
GenericTokenizer()
Create a new GenericTokenizer with English locale. |
|
GenericTokenizer(java.util.Locale where)
Create a new GenericTokenizer with the supplied locale. |
|
| Method Summary | |
|---|---|
void |
addAbbreviation(java.lang.String abbreviation)
Add abbreviation to the tokenizers internal list of abbreviations. |
int |
addAbbreviations(java.lang.String abbreviationFile)
Add a set of abbreviations from a file containing exactly one abbreviation per line. |
void |
addAbbreviations(java.lang.String[] abbreviations)
Add a set of abbreviations from an array containing exactly one abbreviation per element. |
java.lang.String[] |
getCharacters(java.lang.String text)
Tokenize the supplied String into character tokens and
return them in a String array. |
java.util.Locale |
getLocale()
Get the Locale to which the tokenizer should conform to. |
java.lang.String[] |
getSentences(java.lang.String text)
Tokenize the supplied String into sentence tokens and return
them in a String array. |
java.lang.String[] |
getVocabulary(java.lang.String text)
Extract the vocabulary, i.e. |
java.lang.String[] |
getVocabulary(java.lang.String text,
boolean removePunctuation)
Extract the vocabulary, i.e. |
java.lang.String[] |
getWords(java.lang.String text)
Tokenize the supplied String into words and punctuation
tokens and return them in a String array. |
java.lang.String[] |
getWords(java.lang.String text,
boolean removePunctuation)
Tokenize the supplied String into word level tokens and
return them in a String array. |
static void |
main(java.lang.String[] args)
Usage: moj.lang.GenericTokenizer <file> (<token type>) <file> : file to tokenize <token type> : ([s]entenences|[w]ords|[v]ocabulary|[c]haracters|[l]inebreak) |
java.util.Locale |
setLocale(java.util.Locale where)
Set the Locale to which the tokenizer should conform to. |
java.lang.String[] |
wrapText(java.lang.String text,
int width)
Wrap the supplied String at a certain width by inserting
line breaks into the text at proper places without breaking non-hyphenated
words. |
| Methods inherited from class java.lang.Object |
|---|
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Constructor Detail |
|---|
public GenericTokenizer()
GenericTokenizer with English locale.
public GenericTokenizer(java.util.Locale where)
GenericTokenizer with the supplied locale.
where - which language settings we want to use.| Method Detail |
|---|
public java.util.Locale getLocale()
Locale to which the tokenizer should conform to.
Localepublic java.util.Locale setLocale(java.util.Locale where)
Locale to which the tokenizer should conform to.
where - Locale to be set
Locale
public java.lang.String[] wrapText(java.lang.String text,
int width)
String at a certain width by inserting
line breaks into the text at proper places without breaking non-hyphenated
words. The formatted string is returned in a String array
where each element is a line.
text - the text we want to have wrapped at a cretain widthwidth - the maximum width when breaking the text into lines
public java.lang.String[] getCharacters(java.lang.String text)
String into character tokens and
return them in a String array.
text - the text we want to have tokenized into characters
textpublic java.lang.String[] getWords(java.lang.String text)
String into words and punctuation
tokens and return them in a String array.
text - the text we want to have tokenized into words
text
public java.lang.String[] getWords(java.lang.String text,
boolean removePunctuation)
String into word level tokens and
return them in a String array.
text - the text we want to have tokenized into word level tokensremovePunctuation - removes punctuation before tokenizing if
true
textpublic java.lang.String[] getSentences(java.lang.String text)
String into sentence tokens and return
them in a String array.
text - the text we want to have tokenized into sentences
textpublic java.lang.String[] getVocabulary(java.lang.String text)
String and return them (alphabetically) sorted
in an array. Punctuation tokens are removed.
text - the text we want the vocabulary of
text
public java.lang.String[] getVocabulary(java.lang.String text,
boolean removePunctuation)
String and return them (alphabetically) sorted
in an array.
text - the text we want the vocabulary ofremovePunctuation - removes punctuation before tokenizing if
true
textpublic void addAbbreviation(java.lang.String abbreviation)
abbreviation - abbreviation to addpublic void addAbbreviations(java.lang.String[] abbreviations)
abbreviations - abbreviations to be added to the tokenizerpublic int addAbbreviations(java.lang.String abbreviationFile)
abbreviationFile - full path to file containing abbreviationss to
be added to the tokenizerpublic static void main(java.lang.String[] args)
|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||