|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectmoj.lang.GenericTokenizer
public class GenericTokenizer
GenericTokenizer is a wrapper class that tries to extend BreakIterator
s
usability. It does so by removing token boundaries set by BreakIterator
where an abbreviation from a supplied abbreviation list is encountered.
It also provides handy methods for converting a String
of text
to an array of String
tokens, which can be either the characters,
words, sentences or the vocabulary of the text.
Constructor Summary | |
---|---|
GenericTokenizer()
Create a new GenericTokenizer with English locale. |
|
GenericTokenizer(java.util.Locale where)
Create a new GenericTokenizer with the supplied locale. |
Method Summary | |
---|---|
void |
addAbbreviation(java.lang.String abbreviation)
Add abbreviation to the tokenizers internal list of abbreviations. |
int |
addAbbreviations(java.lang.String abbreviationFile)
Add a set of abbreviations from a file containing exactly one abbreviation per line. |
void |
addAbbreviations(java.lang.String[] abbreviations)
Add a set of abbreviations from an array containing exactly one abbreviation per element. |
java.lang.String[] |
getCharacters(java.lang.String text)
Tokenize the supplied String into character tokens and
return them in a String array. |
java.util.Locale |
getLocale()
Get the Locale to which the tokenizer should conform to. |
java.lang.String[] |
getSentences(java.lang.String text)
Tokenize the supplied String into sentence tokens and return
them in a String array. |
java.lang.String[] |
getVocabulary(java.lang.String text)
Extract the vocabulary, i.e. |
java.lang.String[] |
getVocabulary(java.lang.String text,
boolean removePunctuation)
Extract the vocabulary, i.e. |
java.lang.String[] |
getWords(java.lang.String text)
Tokenize the supplied String into words and punctuation
tokens and return them in a String array. |
java.lang.String[] |
getWords(java.lang.String text,
boolean removePunctuation)
Tokenize the supplied String into word level tokens and
return them in a String array. |
static void |
main(java.lang.String[] args)
Usage: moj.lang.GenericTokenizer <file> (<token type>) <file> : file to tokenize <token type> : ([s]entenences|[w]ords|[v]ocabulary|[c]haracters|[l]inebreak) |
java.util.Locale |
setLocale(java.util.Locale where)
Set the Locale to which the tokenizer should conform to. |
java.lang.String[] |
wrapText(java.lang.String text,
int width)
Wrap the supplied String at a certain width by inserting
line breaks into the text at proper places without breaking non-hyphenated
words. |
Methods inherited from class java.lang.Object |
---|
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
---|
public GenericTokenizer()
GenericTokenizer
with English locale.
public GenericTokenizer(java.util.Locale where)
GenericTokenizer
with the supplied locale.
where
- which language settings we want to use.Method Detail |
---|
public java.util.Locale getLocale()
Locale
to which the tokenizer should conform to.
Locale
public java.util.Locale setLocale(java.util.Locale where)
Locale
to which the tokenizer should conform to.
where
- Locale
to be set
Locale
public java.lang.String[] wrapText(java.lang.String text, int width)
String
at a certain width by inserting
line breaks into the text at proper places without breaking non-hyphenated
words. The formatted string is returned in a String
array
where each element is a line.
text
- the text we want to have wrapped at a cretain widthwidth
- the maximum width when breaking the text into lines
public java.lang.String[] getCharacters(java.lang.String text)
String
into character tokens and
return them in a String
array.
text
- the text we want to have tokenized into characters
text
public java.lang.String[] getWords(java.lang.String text)
String
into words and punctuation
tokens and return them in a String
array.
text
- the text we want to have tokenized into words
text
public java.lang.String[] getWords(java.lang.String text, boolean removePunctuation)
String
into word level tokens and
return them in a String
array.
text
- the text we want to have tokenized into word level tokensremovePunctuation
- removes punctuation before tokenizing if
true
text
public java.lang.String[] getSentences(java.lang.String text)
String
into sentence tokens and return
them in a String
array.
text
- the text we want to have tokenized into sentences
text
public java.lang.String[] getVocabulary(java.lang.String text)
String
and return them (alphabetically) sorted
in an array. Punctuation tokens are removed.
text
- the text we want the vocabulary of
text
public java.lang.String[] getVocabulary(java.lang.String text, boolean removePunctuation)
String
and return them (alphabetically) sorted
in an array.
text
- the text we want the vocabulary ofremovePunctuation
- removes punctuation before tokenizing if
true
text
public void addAbbreviation(java.lang.String abbreviation)
abbreviation
- abbreviation to addpublic void addAbbreviations(java.lang.String[] abbreviations)
abbreviations
- abbreviations to be added to the tokenizerpublic int addAbbreviations(java.lang.String abbreviationFile)
abbreviationFile
- full path to file containing abbreviationss to
be added to the tokenizerpublic static void main(java.lang.String[] args)
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |