GenericTokenizer

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

moj.lang
Class GenericTokenizer

java.lang.Object
  moj.lang.GenericTokenizer

public class GenericTokenizer
extends java.lang.Object
extends java.lang.Object

GenericTokenizer is a wrapper class that tries to extend BreakIterators usability. It does so by removing token boundaries set by BreakIterator where an abbreviation from a supplied abbreviation list is encountered. It also provides handy methods for converting a String of text to an array of String tokens, which can be either the characters, words, sentences or the vocabulary of the text.

Version:: 2007-Aug-09
Author:: Martin Hassel

Constructor Summary
`GenericTokenizer()` Create a new `GenericTokenizer` with English locale.
`GenericTokenizer(java.util.Locale where)` Create a new `GenericTokenizer` with the supplied locale.

Method Summary
`void`	`addAbbreviation(java.lang.String abbreviation)` Add abbreviation to the tokenizers internal list of abbreviations.
`int`	`addAbbreviations(java.lang.String abbreviationFile)` Add a set of abbreviations from a file containing exactly one abbreviation per line.
`void`	`addAbbreviations(java.lang.String[] abbreviations)` Add a set of abbreviations from an array containing exactly one abbreviation per element.
`java.lang.String[]`	`getCharacters(java.lang.String text)` Tokenize the supplied `String` into character tokens and return them in a `String` array.
`java.util.Locale`	`getLocale()` Get the `Locale` to which the tokenizer should conform to.
`java.lang.String[]`	`getSentences(java.lang.String text)` Tokenize the supplied `String` into sentence tokens and return them in a `String` array.
`java.lang.String[]`	`getVocabulary(java.lang.String text)` Extract the vocabulary, i.e.
`java.lang.String[]`	`getVocabulary(java.lang.String text, boolean removePunctuation)` Extract the vocabulary, i.e.
`java.lang.String[]`	`getWords(java.lang.String text)` Tokenize the supplied `String` into words and punctuation tokens and return them in a `String` array.
`java.lang.String[]`	`getWords(java.lang.String text, boolean removePunctuation)` Tokenize the supplied `String` into word level tokens and return them in a `String` array.
`static void`	`main(java.lang.String[] args)` Usage: moj.lang.GenericTokenizer <file> (<token type>) <file> : file to tokenize <token type> : ([s]entenences\|[w]ords\|[v]ocabulary\|[c]haracters\|[l]inebreak)
`java.util.Locale`	`setLocale(java.util.Locale where)` Set the `Locale` to which the tokenizer should conform to.
`java.lang.String[]`	`wrapText(java.lang.String text, int width)` Wrap the supplied `String` at a certain width by inserting line breaks into the text at proper places without breaking non-hyphenated words.

Methods inherited from class java.lang.Object
`equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Constructor Detail

GenericTokenizer

public GenericTokenizer()

Create a new GenericTokenizer with English locale.

GenericTokenizer

public GenericTokenizer(java.util.Locale where)

Create a new GenericTokenizer with the supplied locale.

Parameters:: where - which language settings we want to use.

Method Detail

getLocale

public java.util.Locale getLocale()

Get the Locale to which the tokenizer should conform to.

Returns:: the currently set Locale

setLocale

public java.util.Locale setLocale(java.util.Locale where)

Set the Locale to which the tokenizer should conform to.

Parameters:: where - Locale to be set
Returns:: the previously set Locale

wrapText

public java.lang.String[] wrapText(java.lang.String text,
                                   int width)

Wrap the supplied String at a certain width by inserting line breaks into the text at proper places without breaking non-hyphenated words. The formatted string is returned in a String array where each element is a line.

Parameters:: text - the text we want to have wrapped at a cretain width; width - the maximum width when breaking the text into lines
Returns:: an array with all new lines

getCharacters

public java.lang.String[] getCharacters(java.lang.String text)

Tokenize the supplied String into character tokens and return them in a String array.

Parameters:: text - the text we want to have tokenized into characters
Returns:: an array with all characters in text

getWords

public java.lang.String[] getWords(java.lang.String text)

Tokenize the supplied String into words and punctuation tokens and return them in a String array.

Parameters:: text - the text we want to have tokenized into words
Returns:: an array with all words in text

getWords

public java.lang.String[] getWords(java.lang.String text,
                                   boolean removePunctuation)

Tokenize the supplied String into word level tokens and return them in a String array.

Parameters:: text - the text we want to have tokenized into word level tokens; removePunctuation - removes punctuation before tokenizing if true
Returns:: an array with all words in text

getSentences

public java.lang.String[] getSentences(java.lang.String text)

Tokenize the supplied String into sentence tokens and return them in a String array.

Parameters:: text - the text we want to have tokenized into sentences
Returns:: an array with all sentences in text

getVocabulary

public java.lang.String[] getVocabulary(java.lang.String text)

Extract the vocabulary, i.e. all unique word tokens from the supplied String and return them (alphabetically) sorted in an array. Punctuation tokens are removed.

Parameters:: text - the text we want the vocabulary of
Returns:: an array with all unique words in text

getVocabulary

public java.lang.String[] getVocabulary(java.lang.String text,
                                        boolean removePunctuation)

Extract the vocabulary, i.e. all unique tokens ("words"), from the supplied String and return them (alphabetically) sorted in an array.

Parameters:: text - the text we want the vocabulary of; removePunctuation - removes punctuation before tokenizing if true
Returns:: an array with all unique tokens ("words") in text

addAbbreviation

public void addAbbreviation(java.lang.String abbreviation)

Add abbreviation to the tokenizers internal list of abbreviations.

Parameters:: abbreviation - abbreviation to add

addAbbreviations

public void addAbbreviations(java.lang.String[] abbreviations)

Add a set of abbreviations from an array containing exactly one abbreviation per element.

Parameters:: abbreviations - abbreviations to be added to the tokenizer

addAbbreviations

public int addAbbreviations(java.lang.String abbreviationFile)

Add a set of abbreviations from a file containing exactly one abbreviation per line.

Parameters:: abbreviationFile - full path to file containing abbreviationss to be added to the tokenizer

main

public static void main(java.lang.String[] args)

Usage: moj.lang.GenericTokenizer <file> (<token type>)
<file> : file to tokenize
<token type> : ([s]entenences|[w]ords|[v]ocabulary|[c]haracters|[l]inebreak)

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

moj.lang Class GenericTokenizer

GenericTokenizer

GenericTokenizer

getLocale

setLocale

wrapText

getCharacters

getWords

getWords

getSentences

getVocabulary

getVocabulary

addAbbreviation

addAbbreviations

addAbbreviations

main

moj.lang
Class GenericTokenizer