|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||
java.lang.Objectmoj.lang.StopList
public class StopList
A StopList holds a set of index terms that are to be stripped
away from a text (i.e. "stopped") before it is further processed. It also
supports thresholds in form of minimum and maximum allowed length of index
terms as well as minimum number of tokens ("words") a document must
hold in order to be processed.
These thresholds can be set in a properties file. The properties file can
contain any of the following items:
stoplist_file = <stopword file>
shortest_word = <length in characters>
longest_word = <length in characters>
minimum_words_per_file = <length in words>
The file containing the list of stop words should have one stop word per
line, and stop words should be in the first column if several columns are
used for different data (tab "\t" is used as column separator.
| Constructor Summary | |
|---|---|
StopList()
Creates a StopList using the default properties file 'StopList.properties' |
|
StopList(SLprops properties)
Creates a StopList using the stop list properties properties |
|
StopList(java.lang.String propertiesFile)
Creates a StopList using the properties file propertiesFile |
|
| Method Summary | |
|---|---|
void |
addStopWord(java.lang.String stopword)
Add stop word to the StopLists internal list of stop words. |
void |
addStopWords(java.lang.String[] stopwords)
Add a set of stop words from an array containing exactly one stop word per element. |
SLprops |
getProperties()
Gets the Properties for the StopList |
java.util.Set<java.lang.String> |
getStopWords()
Return the current Set of stop words to be removed from a text. |
static void |
main(java.lang.String[] args)
Usage: moj.lang.StopList <file> <file> : file to remove stopwords from (<properties>) : properties file The properties file can contain any of the following items: stoplist_file = <stopword file> shortest_word = <length in characters> longest_word = <length in characters> minimum_words_per_file = <length in words> |
java.lang.String |
removeStopWords(java.lang.String text)
Removes stop words in StopList from the String text |
java.lang.String[] |
removeStopWords(java.lang.String[] text)
Removes stop words in StopList from the String
array text. |
java.lang.String[] |
removeStopWords(java.lang.String[] text,
boolean verbose)
Removes stop words in StopList from the String
array text and, optionally, output the number of removed words
to System.out. |
java.lang.String |
removeStopWords(java.lang.String text,
boolean verbose)
Removes stop words in StopList from the String text
and output the number of removed words to System.out |
java.util.Set<java.lang.String> |
setStopWords(java.util.HashSet<java.lang.String> stopwords)
Sets the set of stop words to the provided HashSet. |
| Methods inherited from class java.lang.Object |
|---|
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Constructor Detail |
|---|
public StopList()
StopList using the default properties file 'StopList.properties'
public StopList(SLprops properties)
StopList using the stop list properties properties
public StopList(java.lang.String propertiesFile)
StopList using the properties file propertiesFile
| Method Detail |
|---|
public SLprops getProperties()
Properties for the StopList
Properties for the StopListpublic java.util.Set<java.lang.String> setStopWords(java.util.HashSet<java.lang.String> stopwords)
HashSet.
stopwords - HashSet containing the stop words to be
removed from a text.
Set containing the previous set of stop words.public java.util.Set<java.lang.String> getStopWords()
Set of stop words to be removed from a text.
Set containing the current set of stop words.public void addStopWord(java.lang.String stopword)
stopword - stop word to addpublic void addStopWords(java.lang.String[] stopwords)
stopwords - stop words to be added to the StopListpublic java.lang.String removeStopWords(java.lang.String text)
StopList from the String text
text - the text which is to have stop words removed
text with stop words removed
public java.lang.String removeStopWords(java.lang.String text,
boolean verbose)
StopList from the String text
and output the number of removed words to System.out
text - the text which is to have stop words removedverbose - output the number of removed words to System.out
(true/false)
text with stop words removedpublic java.lang.String[] removeStopWords(java.lang.String[] text)
StopList from the String
array text.
text - the text which is to have stop words removed
text with stop words removed
public java.lang.String[] removeStopWords(java.lang.String[] text,
boolean verbose)
StopList from the String
array text and, optionally, output the number of removed words
to System.out. All processed words are transformed to lower case
and some cleanup is attempted (i.e. removing non-alphanumeric characters)
before they are checked against the filtering criteria (e.g. inclusion in
the list of stop words, word length constraints etc).
text - the text which is to have stop words removedverbose - output the number of removed words to System.out
(true/false)
text with stop words removedpublic static void main(java.lang.String[] args)
|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||