FrequencyList

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

moj.lang
Class FrequencyList<Entry>

java.lang.Object
  moj.lang.FrequencyList<Entry>

public class FrequencyList<Entry>
extends java.lang.Object
extends java.lang.Object

FrequencyList holds a list of tokens (index terms) and their respective term and document frequencies in the texts indexed (or 'added') so far. Sublists can be extracted based on frequency thresholds.

Version:: 2007-Oct-02
Author:: Martin Hassel

Nested Class Summary
`class`	`FrequencyList.CompareAlphabetically` For ordering words alphabetically.
`class`	`FrequencyList.CompareByDF` For ordering words from lowest to highest document frequency.
`class`	`FrequencyList.CompareByDFfalling` For ordering words from highest to lowest document frequency.
`class`	`FrequencyList.CompareByTF` For ordering words from lowest to highest term frequency.
`class`	`FrequencyList.CompareByTFfalling` For ordering words from highest to lowest term frequency.

Constructor Summary
`FrequencyList()` Creates a new empty `FrequencyList`.

Method Summary
`int`	`addText(java.lang.String[] text)` Adds tokenized text to the `FrequencyList` and updates the term and document frequencies for each encountered token.
`int`	`addText(java.lang.String[] text, int mintokenlen, int maxtokenlen, boolean prefixonly)` Adds tokenized text to the `FrequencyList` and updates the term and document frequencies for each encountered token/prefix.
`FrequencyList<java.util.Map.Entry<java.lang.String,int[]>>`	`filterByDFinside(int lowerBound, int upperBound)` Extract sublist containing only the index terms with a document frequency between `lowerBound` and `upperBound`.
`FrequencyList<java.util.Map.Entry<java.lang.String,int[]>>`	`filterByDFoutside(int lowerBound, int upperBound)` Extract sublist containing only the index terms with a document frequency lower than `lowerBound` or higher than `upperBound`.
`FrequencyList<java.util.Map.Entry<java.lang.String,int[]>>`	`filterByTFinside(int lowerBound, int upperBound)` Extract sublist containing only the index terms with a term frequency between `lowerBound` and `upperBound`.
`FrequencyList<java.util.Map.Entry<java.lang.String,int[]>>`	`filterByTFoutside(int lowerBound, int upperBound)` Extract sublist containing only the index terms with a term frequency lower than `lowerBound` or higher than `upperBound`.
`int`	`getDF(java.lang.String word)` Returns the document frequency for the given word.
`int`	`getDocumentCount()`
`int`	`getHighestDF()`
`int`	`getHighestTF()`
`int`	`getLowestDF()`
`int`	`getLowestTF()`
`int`	`getTF(java.lang.String word)` Returns the term frequency for the given word.
`java.util.Set<java.lang.String>`	`getUniqueWords()`
`int`	`getWordCount()`
`void`	`getWordFrequency(java.lang.String[] words, int[] tf, int[] df)` Stores index terms and their corresponding term and document frequencies in the parallel array parameters.
`void`	`getWordFrequency(java.lang.String[] words, int[] tf, int[] df, java.util.Comparator<java.util.Map.Entry<java.lang.String,int[]>> comp)` Stores index terms and their corresponding term and document frequencies in the parallel array parameters.
`static void`	`main(java.lang.String[] args)` Usage: moj.lang.FrequencyList <file> (<minimum token length>) <file> : file to build frequency counts on <minimum token length> : minimum length for a token for it to be counted
`java.lang.String`	`toString()` Returns the frequency statistics sorted alphabetically.
`java.lang.String`	`toString(java.util.Comparator<java.util.Map.Entry<java.lang.String,int[]>> comp)` Returns the frequency statistics using the supplied `Comparator`.

Methods inherited from class java.lang.Object
`equals, getClass, hashCode, notify, notifyAll, wait, wait, wait`

Constructor Detail

FrequencyList

public FrequencyList()

Creates a new empty FrequencyList.

Method Detail

filterByTFinside

public FrequencyList<java.util.Map.Entry<java.lang.String,int[]>> filterByTFinside(int lowerBound,
                                                                                   int upperBound)

Extract sublist containing only the index terms with a term frequency between lowerBound and upperBound.

Parameters:: lowerBound - lowest acceptable term frequency; upperBound - highest acceptable term frequency
Returns:: a new FrequencyList containing the sublist

filterByDFinside

public FrequencyList<java.util.Map.Entry<java.lang.String,int[]>> filterByDFinside(int lowerBound,
                                                                                   int upperBound)

Extract sublist containing only the index terms with a document frequency between lowerBound and upperBound.

Parameters:: lowerBound - lowest acceptable document frequency; upperBound - highest acceptable document frequency
Returns:: a new FrequencyList containing the sublist

filterByTFoutside

public FrequencyList<java.util.Map.Entry<java.lang.String,int[]>> filterByTFoutside(int lowerBound,
                                                                                    int upperBound)

Extract sublist containing only the index terms with a term frequency lower than lowerBound or higher than upperBound.

Parameters:: lowerBound - lower term frequency boundary; upperBound - higher term frequency boundary
Returns:: a new FrequencyList containing the sublist

filterByDFoutside

public FrequencyList<java.util.Map.Entry<java.lang.String,int[]>> filterByDFoutside(int lowerBound,
                                                                                    int upperBound)

Extract sublist containing only the index terms with a document frequency lower than lowerBound or higher than upperBound.

Parameters:: lowerBound - lower document frequency boundary; upperBound - higher document frequency boundary
Returns:: a new FrequencyList containing the sublist

getHighestTF

public int getHighestTF()

Returns:: the highest term frequency in the frequency list

getLowestTF

public int getLowestTF()

Returns:: the lowest term frequency in the frequency list

getHighestDF

public int getHighestDF()

Returns:: the highest document frequency in the frequency list

getLowestDF

public int getLowestDF()

Returns:: the lowest document frequency in the frequency list

getDocumentCount

public int getDocumentCount()

Returns:: the number of documents represented in the frequency list

getWordCount

public int getWordCount()

Returns:: the number of words represented in the frequency list

getUniqueWords

public java.util.Set<java.lang.String> getUniqueWords()

Returns:: a Set containing each unique word represented in the frequency list

getTF

public int getTF(java.lang.String word)

Returns the term frequency for the given word. If the word is not in the frequency list then a frequency of zero is returned.

Parameters:: word - the word (token) to get the term frequency for
Returns:: the term frequency for the given word

getDF

public int getDF(java.lang.String word)

Returns the document frequency for the given word. If the word is not in the frequency list then a frequency of zero is returned.

Parameters:: word - the word (token) to get the document frequency for
Returns:: the document frequency for the given word

addText

public int addText(java.lang.String[] text)

Adds tokenized text to the FrequencyList and updates the term and document frequencies for each encountered token.

Parameters:: text - tokenized text to be added
Returns:: number of tokens (words) added to FrequencyList

addText

public int addText(java.lang.String[] text,
                   int mintokenlen,
                   int maxtokenlen,
                   boolean prefixonly)

Adds tokenized text to the FrequencyList and updates the term and document frequencies for each encountered token/prefix.

Parameters:: text - tokenized text to be added; mintokenlen - minimum length of a token for it to be added to the list; maxtokenlen - maximum length of a token for it to be added to the list, if maxtokenlen < mintokenlen then maxtokenlen = unlimited.; prefixonly - save only the maxtokenlen number of characters
Returns:: number of tokens (words) added to FrequencyList

getWordFrequency

public void getWordFrequency(java.lang.String[] words,
                             int[] tf,
                             int[] df)

Stores index terms and their corresponding term and document frequencies in the parallel array parameters. The arrays will be filled with values until full, or with at most getUniqueWords().size() values. The index terms are sorted alphabetically.

Parameters:: words - array to be filled with the unique words that were encountered in the source document(s); tf - array to be filled with the term frequencies of the words at corresponding index in the words array; df - array to be filled with the document frequencies of the words at corresponding index in the words array

getWordFrequency

public void getWordFrequency(java.lang.String[] words,
                             int[] tf,
                             int[] df,
                             java.util.Comparator<java.util.Map.Entry<java.lang.String,int[]>> comp)

Parameters:: words - array to be filled with the unique words that were encountered in the source document(s); tf - array to be filled with the term frequencies of the words at corresponding index in the words array; df - array to be filled with the document frequencies of the words at corresponding index in the words array; comp - Comparator denoting how the parallell arrays should be ordered

toString

public java.lang.String toString()

Returns the frequency statistics sorted alphabetically.

Overrides:: toString in class java.lang.Object

Returns:: A String containing the sorted entries

toString

public java.lang.String toString(java.util.Comparator<java.util.Map.Entry<java.lang.String,int[]>> comp)

Returns the frequency statistics using the supplied Comparator.

Parameters:: comp - Comparator to use for sorting entries
Returns:: A String containing the sorted entries

main

public static void main(java.lang.String[] args)

Usage: moj.lang.FrequencyList <file> (<minimum token length>)
<file> : file to build frequency counts on
<minimum token length> : minimum length for a token for it to be counted

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

moj.lang Class FrequencyList<Entry>

FrequencyList

filterByTFinside

filterByDFinside

filterByTFoutside

filterByDFoutside

getHighestTF

getLowestTF

getHighestDF

getLowestDF

getDocumentCount

getWordCount

getUniqueWords

getTF

getDF

addText

addText

getWordFrequency

getWordFrequency

toString

toString

main

moj.lang
Class FrequencyList<Entry>