moj.lang
Class GenericTokenizer

java.lang.Object
  extended by moj.lang.GenericTokenizer

public class GenericTokenizer
extends java.lang.Object

GenericTokenizer is a wrapper class that tries to extend BreakIterators usability. It does so by removing token boundaries set by BreakIterator where an abbreviation from a supplied abbreviation list is encountered. It also provides handy methods for converting a String of text to an array of String tokens, which can be either the characters, words, sentences or the vocabulary of the text.

Version:
2007-Aug-09
Author:
Martin Hassel

Constructor Summary
GenericTokenizer()
          Create a new GenericTokenizer with English locale.
GenericTokenizer(java.util.Locale where)
          Create a new GenericTokenizer with the supplied locale.
 
Method Summary
 void addAbbreviation(java.lang.String abbreviation)
          Add abbreviation to the tokenizers internal list of abbreviations.
 int addAbbreviations(java.lang.String abbreviationFile)
          Add a set of abbreviations from a file containing exactly one abbreviation per line.
 void addAbbreviations(java.lang.String[] abbreviations)
          Add a set of abbreviations from an array containing exactly one abbreviation per element.
 java.lang.String[] getCharacters(java.lang.String text)
          Tokenize the supplied String into character tokens and return them in a String array.
 java.util.Locale getLocale()
          Get the Locale to which the tokenizer should conform to.
 java.lang.String[] getSentences(java.lang.String text)
          Tokenize the supplied String into sentence tokens and return them in a String array.
 java.lang.String[] getVocabulary(java.lang.String text)
          Extract the vocabulary, i.e.
 java.lang.String[] getVocabulary(java.lang.String text, boolean removePunctuation)
          Extract the vocabulary, i.e.
 java.lang.String[] getWords(java.lang.String text)
          Tokenize the supplied String into words and punctuation tokens and return them in a String array.
 java.lang.String[] getWords(java.lang.String text, boolean removePunctuation)
          Tokenize the supplied String into word level tokens and return them in a String array.
static void main(java.lang.String[] args)
          Usage: moj.lang.GenericTokenizer <file> (<token type>)
<file> : file to tokenize
<token type> : ([s]entenences|[w]ords|[v]ocabulary|[c]haracters|[l]inebreak)
 java.util.Locale setLocale(java.util.Locale where)
          Set the Locale to which the tokenizer should conform to.
 java.lang.String[] wrapText(java.lang.String text, int width)
          Wrap the supplied String at a certain width by inserting line breaks into the text at proper places without breaking non-hyphenated words.
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

GenericTokenizer

public GenericTokenizer()
Create a new GenericTokenizer with English locale.


GenericTokenizer

public GenericTokenizer(java.util.Locale where)
Create a new GenericTokenizer with the supplied locale.

Parameters:
where - which language settings we want to use.
Method Detail

getLocale

public java.util.Locale getLocale()
Get the Locale to which the tokenizer should conform to.

Returns:
the currently set Locale

setLocale

public java.util.Locale setLocale(java.util.Locale where)
Set the Locale to which the tokenizer should conform to.

Parameters:
where - Locale to be set
Returns:
the previously set Locale

wrapText

public java.lang.String[] wrapText(java.lang.String text,
                                   int width)
Wrap the supplied String at a certain width by inserting line breaks into the text at proper places without breaking non-hyphenated words. The formatted string is returned in a String array where each element is a line.

Parameters:
text - the text we want to have wrapped at a cretain width
width - the maximum width when breaking the text into lines
Returns:
an array with all new lines

getCharacters

public java.lang.String[] getCharacters(java.lang.String text)
Tokenize the supplied String into character tokens and return them in a String array.

Parameters:
text - the text we want to have tokenized into characters
Returns:
an array with all characters in text

getWords

public java.lang.String[] getWords(java.lang.String text)
Tokenize the supplied String into words and punctuation tokens and return them in a String array.

Parameters:
text - the text we want to have tokenized into words
Returns:
an array with all words in text

getWords

public java.lang.String[] getWords(java.lang.String text,
                                   boolean removePunctuation)
Tokenize the supplied String into word level tokens and return them in a String array.

Parameters:
text - the text we want to have tokenized into word level tokens
removePunctuation - removes punctuation before tokenizing if true
Returns:
an array with all words in text

getSentences

public java.lang.String[] getSentences(java.lang.String text)
Tokenize the supplied String into sentence tokens and return them in a String array.

Parameters:
text - the text we want to have tokenized into sentences
Returns:
an array with all sentences in text

getVocabulary

public java.lang.String[] getVocabulary(java.lang.String text)
Extract the vocabulary, i.e. all unique word tokens from the supplied String and return them (alphabetically) sorted in an array. Punctuation tokens are removed.

Parameters:
text - the text we want the vocabulary of
Returns:
an array with all unique words in text

getVocabulary

public java.lang.String[] getVocabulary(java.lang.String text,
                                        boolean removePunctuation)
Extract the vocabulary, i.e. all unique tokens ("words"), from the supplied String and return them (alphabetically) sorted in an array.

Parameters:
text - the text we want the vocabulary of
removePunctuation - removes punctuation before tokenizing if true
Returns:
an array with all unique tokens ("words") in text

addAbbreviation

public void addAbbreviation(java.lang.String abbreviation)
Add abbreviation to the tokenizers internal list of abbreviations.

Parameters:
abbreviation - abbreviation to add

addAbbreviations

public void addAbbreviations(java.lang.String[] abbreviations)
Add a set of abbreviations from an array containing exactly one abbreviation per element.

Parameters:
abbreviations - abbreviations to be added to the tokenizer

addAbbreviations

public int addAbbreviations(java.lang.String abbreviationFile)
Add a set of abbreviations from a file containing exactly one abbreviation per line.

Parameters:
abbreviationFile - full path to file containing abbreviationss to be added to the tokenizer

main

public static void main(java.lang.String[] args)
Usage: moj.lang.GenericTokenizer <file> (<token type>)
<file> : file to tokenize
<token type> : ([s]entenences|[w]ords|[v]ocabulary|[c]haracters|[l]inebreak)