moj.lang
Class StopList

java.lang.Object
  extended by moj.lang.StopList

public class StopList
extends java.lang.Object

A StopList holds a set of index terms that are to be stripped away from a text (i.e. "stopped") before it is further processed. It also supports thresholds in form of minimum and maximum allowed length of index terms as well as minimum number of tokens ("words") a document must hold in order to be processed.

These thresholds can be set in a properties file. The properties file can contain any of the following items:
stoplist_file = <stopword file>
shortest_word = <length in characters>
longest_word = <length in characters>
minimum_words_per_file = <length in words>

The file containing the list of stop words should have one stop word per line, and stop words should be in the first column if several columns are used for different data (tab "\t" is used as column separator.

Version:
2009-April-21
Author:
Martin Hassel

Constructor Summary
StopList()
          Creates a StopList using the default properties file 'StopList.properties'
StopList(SLprops properties)
          Creates a StopList using the stop list properties properties
StopList(java.lang.String propertiesFile)
          Creates a StopList using the properties file propertiesFile
 
Method Summary
 void addStopWord(java.lang.String stopword)
          Add stop word to the StopLists internal list of stop words.
 void addStopWords(java.lang.String[] stopwords)
          Add a set of stop words from an array containing exactly one stop word per element.
 SLprops getProperties()
          Gets the Properties for the StopList
 java.util.Set<java.lang.String> getStopWords()
          Return the current Set of stop words to be removed from a text.
static void main(java.lang.String[] args)
          Usage: moj.lang.StopList <file>
<file> : file to remove stopwords from
(<properties>) : properties file

The properties file can contain any of the following items:
stoplist_file = <stopword file>
shortest_word = <length in characters>
longest_word = <length in characters>
minimum_words_per_file = <length in words>
 java.lang.String removeStopWords(java.lang.String text)
          Removes stop words in StopList from the String text
 java.lang.String[] removeStopWords(java.lang.String[] text)
          Removes stop words in StopList from the String array text.
 java.lang.String[] removeStopWords(java.lang.String[] text, boolean verbose)
          Removes stop words in StopList from the String array text and, optionally, output the number of removed words to System.out.
 java.lang.String removeStopWords(java.lang.String text, boolean verbose)
          Removes stop words in StopList from the String text and output the number of removed words to System.out
 java.util.Set<java.lang.String> setStopWords(java.util.HashSet<java.lang.String> stopwords)
          Sets the set of stop words to the provided HashSet.
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

StopList

public StopList()
Creates a StopList using the default properties file 'StopList.properties'


StopList

public StopList(SLprops properties)
Creates a StopList using the stop list properties properties


StopList

public StopList(java.lang.String propertiesFile)
Creates a StopList using the properties file propertiesFile

Method Detail

getProperties

public SLprops getProperties()
Gets the Properties for the StopList

Returns:
the Properties for the StopList

setStopWords

public java.util.Set<java.lang.String> setStopWords(java.util.HashSet<java.lang.String> stopwords)
Sets the set of stop words to the provided HashSet.

Parameters:
stopwords - HashSet containing the stop words to be removed from a text.
Returns:
a Set containing the previous set of stop words.

getStopWords

public java.util.Set<java.lang.String> getStopWords()
Return the current Set of stop words to be removed from a text.

Returns:
a Set containing the current set of stop words.

addStopWord

public void addStopWord(java.lang.String stopword)
Add stop word to the StopLists internal list of stop words.

Parameters:
stopword - stop word to add

addStopWords

public void addStopWords(java.lang.String[] stopwords)
Add a set of stop words from an array containing exactly one stop word per element.

Parameters:
stopwords - stop words to be added to the StopList

removeStopWords

public java.lang.String removeStopWords(java.lang.String text)
Removes stop words in StopList from the String text

Parameters:
text - the text which is to have stop words removed
Returns:
text with stop words removed

removeStopWords

public java.lang.String removeStopWords(java.lang.String text,
                                        boolean verbose)
Removes stop words in StopList from the String text and output the number of removed words to System.out

Parameters:
text - the text which is to have stop words removed
verbose - output the number of removed words to System.out (true/false)
Returns:
text with stop words removed

removeStopWords

public java.lang.String[] removeStopWords(java.lang.String[] text)
Removes stop words in StopList from the String array text.

Parameters:
text - the text which is to have stop words removed
Returns:
text with stop words removed

removeStopWords

public java.lang.String[] removeStopWords(java.lang.String[] text,
                                          boolean verbose)
Removes stop words in StopList from the String array text and, optionally, output the number of removed words to System.out. All processed words are transformed to lower case and some cleanup is attempted (i.e. removing non-alphanumeric characters) before they are checked against the filtering criteria (e.g. inclusion in the list of stop words, word length constraints etc).

Parameters:
text - the text which is to have stop words removed
verbose - output the number of removed words to System.out (true/false)
Returns:
text with stop words removed

main

public static void main(java.lang.String[] args)
Usage: moj.lang.StopList <file>
<file> : file to remove stopwords from
(<properties>) : properties file

The properties file can contain any of the following items:
stoplist_file = <stopword file>
shortest_word = <length in characters>
longest_word = <length in characters>
minimum_words_per_file = <length in words>