moj.lang
Class FrequencyList<Entry>

java.lang.Object
  extended by moj.lang.FrequencyList<Entry>

public class FrequencyList<Entry>
extends java.lang.Object

FrequencyList holds a list of tokens (index terms) and their respective term and document frequencies in the texts indexed (or 'added') so far. Sublists can be extracted based on frequency thresholds.

Version:
2007-Oct-02
Author:
Martin Hassel

Nested Class Summary
 class FrequencyList.CompareAlphabetically
          For ordering words alphabetically.
 class FrequencyList.CompareByDF
          For ordering words from lowest to highest document frequency.
 class FrequencyList.CompareByDFfalling
          For ordering words from highest to lowest document frequency.
 class FrequencyList.CompareByTF
          For ordering words from lowest to highest term frequency.
 class FrequencyList.CompareByTFfalling
          For ordering words from highest to lowest term frequency.
 
Constructor Summary
FrequencyList()
          Creates a new empty FrequencyList.
 
Method Summary
 int addText(java.lang.String[] text)
          Adds tokenized text to the FrequencyList and updates the term and document frequencies for each encountered token.
 int addText(java.lang.String[] text, int mintokenlen, int maxtokenlen, boolean prefixonly)
          Adds tokenized text to the FrequencyList and updates the term and document frequencies for each encountered token/prefix.
 FrequencyList<java.util.Map.Entry<java.lang.String,int[]>> filterByDFinside(int lowerBound, int upperBound)
          Extract sublist containing only the index terms with a document frequency between lowerBound and upperBound.
 FrequencyList<java.util.Map.Entry<java.lang.String,int[]>> filterByDFoutside(int lowerBound, int upperBound)
          Extract sublist containing only the index terms with a document frequency lower than lowerBound or higher than upperBound.
 FrequencyList<java.util.Map.Entry<java.lang.String,int[]>> filterByTFinside(int lowerBound, int upperBound)
          Extract sublist containing only the index terms with a term frequency between lowerBound and upperBound.
 FrequencyList<java.util.Map.Entry<java.lang.String,int[]>> filterByTFoutside(int lowerBound, int upperBound)
          Extract sublist containing only the index terms with a term frequency lower than lowerBound or higher than upperBound.
 int getDF(java.lang.String word)
          Returns the document frequency for the given word.
 int getDocumentCount()
           
 int getHighestDF()
           
 int getHighestTF()
           
 int getLowestDF()
           
 int getLowestTF()
           
 int getTF(java.lang.String word)
          Returns the term frequency for the given word.
 java.util.Set<java.lang.String> getUniqueWords()
           
 int getWordCount()
           
 void getWordFrequency(java.lang.String[] words, int[] tf, int[] df)
          Stores index terms and their corresponding term and document frequencies in the parallel array parameters.
 void getWordFrequency(java.lang.String[] words, int[] tf, int[] df, java.util.Comparator<java.util.Map.Entry<java.lang.String,int[]>> comp)
          Stores index terms and their corresponding term and document frequencies in the parallel array parameters.
static void main(java.lang.String[] args)
          Usage: moj.lang.FrequencyList <file> (<minimum token length>)
<file> : file to build frequency counts on
<minimum token length> : minimum length for a token for it to be counted
 java.lang.String toString()
          Returns the frequency statistics sorted alphabetically.
 java.lang.String toString(java.util.Comparator<java.util.Map.Entry<java.lang.String,int[]>> comp)
          Returns the frequency statistics using the supplied Comparator.
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Constructor Detail

FrequencyList

public FrequencyList()
Creates a new empty FrequencyList.

Method Detail

filterByTFinside

public FrequencyList<java.util.Map.Entry<java.lang.String,int[]>> filterByTFinside(int lowerBound,
                                                                                   int upperBound)
Extract sublist containing only the index terms with a term frequency between lowerBound and upperBound.

Parameters:
lowerBound - lowest acceptable term frequency
upperBound - highest acceptable term frequency
Returns:
a new FrequencyList containing the sublist

filterByDFinside

public FrequencyList<java.util.Map.Entry<java.lang.String,int[]>> filterByDFinside(int lowerBound,
                                                                                   int upperBound)
Extract sublist containing only the index terms with a document frequency between lowerBound and upperBound.

Parameters:
lowerBound - lowest acceptable document frequency
upperBound - highest acceptable document frequency
Returns:
a new FrequencyList containing the sublist

filterByTFoutside

public FrequencyList<java.util.Map.Entry<java.lang.String,int[]>> filterByTFoutside(int lowerBound,
                                                                                    int upperBound)
Extract sublist containing only the index terms with a term frequency lower than lowerBound or higher than upperBound.

Parameters:
lowerBound - lower term frequency boundary
upperBound - higher term frequency boundary
Returns:
a new FrequencyList containing the sublist

filterByDFoutside

public FrequencyList<java.util.Map.Entry<java.lang.String,int[]>> filterByDFoutside(int lowerBound,
                                                                                    int upperBound)
Extract sublist containing only the index terms with a document frequency lower than lowerBound or higher than upperBound.

Parameters:
lowerBound - lower document frequency boundary
upperBound - higher document frequency boundary
Returns:
a new FrequencyList containing the sublist

getHighestTF

public int getHighestTF()
Returns:
the highest term frequency in the frequency list

getLowestTF

public int getLowestTF()
Returns:
the lowest term frequency in the frequency list

getHighestDF

public int getHighestDF()
Returns:
the highest document frequency in the frequency list

getLowestDF

public int getLowestDF()
Returns:
the lowest document frequency in the frequency list

getDocumentCount

public int getDocumentCount()
Returns:
the number of documents represented in the frequency list

getWordCount

public int getWordCount()
Returns:
the number of words represented in the frequency list

getUniqueWords

public java.util.Set<java.lang.String> getUniqueWords()
Returns:
a Set containing each unique word represented in the frequency list

getTF

public int getTF(java.lang.String word)
Returns the term frequency for the given word. If the word is not in the frequency list then a frequency of zero is returned.

Parameters:
word - the word (token) to get the term frequency for
Returns:
the term frequency for the given word

getDF

public int getDF(java.lang.String word)
Returns the document frequency for the given word. If the word is not in the frequency list then a frequency of zero is returned.

Parameters:
word - the word (token) to get the document frequency for
Returns:
the document frequency for the given word

addText

public int addText(java.lang.String[] text)
Adds tokenized text to the FrequencyList and updates the term and document frequencies for each encountered token.

Parameters:
text - tokenized text to be added
Returns:
number of tokens (words) added to FrequencyList

addText

public int addText(java.lang.String[] text,
                   int mintokenlen,
                   int maxtokenlen,
                   boolean prefixonly)
Adds tokenized text to the FrequencyList and updates the term and document frequencies for each encountered token/prefix.

Parameters:
text - tokenized text to be added
mintokenlen - minimum length of a token for it to be added to the list
maxtokenlen - maximum length of a token for it to be added to the list, if maxtokenlen < mintokenlen then maxtokenlen = unlimited.
prefixonly - save only the maxtokenlen number of characters
Returns:
number of tokens (words) added to FrequencyList

getWordFrequency

public void getWordFrequency(java.lang.String[] words,
                             int[] tf,
                             int[] df)
Stores index terms and their corresponding term and document frequencies in the parallel array parameters. The arrays will be filled with values until full, or with at most getUniqueWords().size() values. The index terms are sorted alphabetically.

Parameters:
words - array to be filled with the unique words that were encountered in the source document(s)
tf - array to be filled with the term frequencies of the words at corresponding index in the words array
df - array to be filled with the document frequencies of the words at corresponding index in the words array

getWordFrequency

public void getWordFrequency(java.lang.String[] words,
                             int[] tf,
                             int[] df,
                             java.util.Comparator<java.util.Map.Entry<java.lang.String,int[]>> comp)
Stores index terms and their corresponding term and document frequencies in the parallel array parameters. The arrays will be filled with values until full, or with at most getUniqueWords().size() values. The index terms are sorted according to the comparator comp.

Parameters:
words - array to be filled with the unique words that were encountered in the source document(s)
tf - array to be filled with the term frequencies of the words at corresponding index in the words array
df - array to be filled with the document frequencies of the words at corresponding index in the words array
comp - Comparator denoting how the parallell arrays should be ordered

toString

public java.lang.String toString()
Returns the frequency statistics sorted alphabetically.

Overrides:
toString in class java.lang.Object
Returns:
A String containing the sorted entries

toString

public java.lang.String toString(java.util.Comparator<java.util.Map.Entry<java.lang.String,int[]>> comp)
Returns the frequency statistics using the supplied Comparator.

Parameters:
comp - Comparator to use for sorting entries
Returns:
A String containing the sorted entries

main

public static void main(java.lang.String[] args)
Usage: moj.lang.FrequencyList <file> (<minimum token length>)
<file> : file to build frequency counts on
<minimum token length> : minimum length for a token for it to be counted