|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectmoj.lang.StopList
public class StopList
A StopList
holds a set of index terms that are to be stripped
away from a text (i.e. "stopped") before it is further processed. It also
supports thresholds in form of minimum and maximum allowed length of index
terms as well as minimum number of tokens ("words") a document must
hold in order to be processed.
These thresholds can be set in a properties file. The properties file can
contain any of the following items:
stoplist_file = <stopword file>
shortest_word = <length in characters>
longest_word = <length in characters>
minimum_words_per_file = <length in words>
The file containing the list of stop words should have one stop word per
line, and stop words should be in the first column if several columns are
used for different data (tab "\t" is used as column separator.
Constructor Summary | |
---|---|
StopList()
Creates a StopList using the default properties file 'StopList.properties' |
|
StopList(SLprops properties)
Creates a StopList using the stop list properties properties |
|
StopList(java.lang.String propertiesFile)
Creates a StopList using the properties file propertiesFile |
Method Summary | |
---|---|
void |
addStopWord(java.lang.String stopword)
Add stop word to the StopLists internal list of stop words. |
void |
addStopWords(java.lang.String[] stopwords)
Add a set of stop words from an array containing exactly one stop word per element. |
SLprops |
getProperties()
Gets the Properties for the StopList |
java.util.Set<java.lang.String> |
getStopWords()
Return the current Set of stop words to be removed from a text. |
static void |
main(java.lang.String[] args)
Usage: moj.lang.StopList <file> <file> : file to remove stopwords from (<properties>) : properties file The properties file can contain any of the following items: stoplist_file = <stopword file> shortest_word = <length in characters> longest_word = <length in characters> minimum_words_per_file = <length in words> |
java.lang.String |
removeStopWords(java.lang.String text)
Removes stop words in StopList from the String text |
java.lang.String[] |
removeStopWords(java.lang.String[] text)
Removes stop words in StopList from the String
array text . |
java.lang.String[] |
removeStopWords(java.lang.String[] text,
boolean verbose)
Removes stop words in StopList from the String
array text and, optionally, output the number of removed words
to System.out . |
java.lang.String |
removeStopWords(java.lang.String text,
boolean verbose)
Removes stop words in StopList from the String text
and output the number of removed words to System.out |
java.util.Set<java.lang.String> |
setStopWords(java.util.HashSet<java.lang.String> stopwords)
Sets the set of stop words to the provided HashSet . |
Methods inherited from class java.lang.Object |
---|
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
---|
public StopList()
StopList
using the default properties file 'StopList.properties'
public StopList(SLprops properties)
StopList
using the stop list properties properties
public StopList(java.lang.String propertiesFile)
StopList
using the properties file propertiesFile
Method Detail |
---|
public SLprops getProperties()
Properties
for the StopList
Properties
for the StopList
public java.util.Set<java.lang.String> setStopWords(java.util.HashSet<java.lang.String> stopwords)
HashSet
.
stopwords
- HashSet
containing the stop words to be
removed from a text.
Set
containing the previous set of stop words.public java.util.Set<java.lang.String> getStopWords()
Set
of stop words to be removed from a text.
Set
containing the current set of stop words.public void addStopWord(java.lang.String stopword)
stopword
- stop word to addpublic void addStopWords(java.lang.String[] stopwords)
stopwords
- stop words to be added to the StopListpublic java.lang.String removeStopWords(java.lang.String text)
StopList
from the String text
text
- the text which is to have stop words removed
text
with stop words removedpublic java.lang.String removeStopWords(java.lang.String text, boolean verbose)
StopList
from the String text
and output the number of removed words to System.out
text
- the text which is to have stop words removedverbose
- output the number of removed words to System.out
(true
/false
)
text
with stop words removedpublic java.lang.String[] removeStopWords(java.lang.String[] text)
StopList
from the String
array text
.
text
- the text which is to have stop words removed
text
with stop words removedpublic java.lang.String[] removeStopWords(java.lang.String[] text, boolean verbose)
StopList
from the String
array text
and, optionally, output the number of removed words
to System.out
. All processed words are transformed to lower case
and some cleanup is attempted (i.e. removing non-alphanumeric characters)
before they are checked against the filtering criteria (e.g. inclusion in
the list of stop words, word length constraints etc).
text
- the text which is to have stop words removedverbose
- output the number of removed words to System.out
(true
/false
)
text
with stop words removedpublic static void main(java.lang.String[] args)
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |