|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectmoj.ri.RandomIndex
public class RandomIndex
A Random Index is an index that contextually indexes the texts that are fed to it. Not only is it possible to ask the RandomIndex for the term frequency for a specific index term, but also for its semantically closest relatives (based upon co-occurrence statistics and random labels).
Constructor Summary | |
---|---|
RandomIndex()
Create a new RandomIndex of RandomLabels with a dimensionality of 1800, a degree of initial randomness of 8 and a window size for contextual updates of 3 (i.e. |
|
RandomIndex(int dimensionality,
int randomDegree,
int seed,
int leftWindowSize,
int rightWindowSize,
WeightingScheme weightingScheme)
Create a new RandomIndex of RandomLabels with the given dimensionality, degree of initial randomness and window size for contextual updates. |
Method Summary | |
---|---|
boolean |
addRandomLabel(RandomLabel label)
Add an existing RandomLabel to the RandomIndex. |
int |
addText(java.lang.String text)
Add text to the Random Index and contextually update all words in, and added to, the index, tokens are separated by white-space. |
int |
addText(java.lang.String[] text)
Add text to the Random Index and contextually update all words in, and added to, the index (if the index is not in a purged state). |
int |
addText(java.lang.String text,
java.lang.String pattern)
Add text to the Random Index and contextually update all words in, and added to, the index (if the index is not in a purged state), tokens are separated according to the supplied pattern. |
int |
addVocabulary(java.util.Set<java.lang.String> vocabulary)
Adds words (index terms) to the restricted vocabulary. |
void |
addVocabulary(java.lang.String[] words)
Adds words (index terms) to the restricted vocabulary. |
boolean |
contains(RandomLabel label)
Returns true if this RandomIndex contains the given RandomLabel. |
boolean |
contains(java.lang.String word)
Returns true if this RandomIndex contains the given word (index term). |
java.util.Set<java.util.Map.Entry<java.lang.String,RandomLabel>> |
entrySet()
Returns a set view of the mappings contained in this RandomIndex . |
void |
finalize()
Shuts down the thread pool, used for scheduling thread (re)usage when adding texts to the RandomIndex, as soon as all threads updating the RandomIndex have finished running. |
boolean |
finishedUpdating()
Checks if all threads updating the RandomIndex have finished running, only then the RandomIndex is guaranteed to be up to date. |
boolean |
getAllLowerCase()
Gets the state of the RandomIndex denoting the wish to henceforth automagically convert all words (index terms) to all lower case. |
int |
getDimensionality()
Gets the dimensionality of RandomLabels in the RandomIndex. |
boolean |
getDocumentLabels()
Gets the state of the RandomIndex denoting the wish to henceforth index on document level rather than word context level. |
long |
getDocumentsIndexed()
Gets the number of documents indexed by the Random Index. |
int |
getLeftWindowSize()
Gets the size of the context window to the left used when updating context labels in RandonLabels in the RandomIndex. |
int |
getMaxNumThreads()
Returns the maximum number of concurrent threads reserved for RI when adding text to the index. |
int |
getRandomDegree()
Gets the random degree of RandomLabels in the RandomIndex. |
RandomLabel |
getRandomLabel(java.lang.String word)
Get the RandomLabel for the given word if it
exists in the RandomIndex . |
int |
getRightWindowSize()
Gets the size of the context window to the right used when updating context labels in RandonLabels in the RandomIndex. |
int |
getSeed()
Gets the seed used to create RandomLabels in the RandomIndex. |
long |
getSleepFactor()
Returns the number of milliseconds a thread should sleep when trying to update a RandomLabel that is already undergoing updating by another thread. |
boolean |
getUnaryLabels()
Gets the state of the RandomIndex denoting the wish to henceforth use unary labels, rather than random. |
java.util.Set<java.lang.String> |
getVocabulary()
Gets the Set denoting the current restricted vocabulary. |
WeightingScheme |
getWeightingScheme()
Gets the name (class) of the WeightingScheme used by the RandomIndex when updating context labels in RandonLabels in the RandomIndex. |
long |
getWordsIndexed()
Gets the number of words indexed by the Random Index. |
boolean |
isEmpty()
Returns true if this RandomIndex contains no elements. |
boolean |
isPurged()
Note: Updating the RandomIndex after labels have been removed from the index may result in new labels being generated for tokens that have already been encountered. |
java.util.Set<java.lang.String> |
keySet()
Returns a set view of the keys contained in this RandomIndex . |
void |
pruneIndex(long minTF,
long maxTF)
Prunes RandomLabels with a term frequency lower than minTF
or higher than maxTF by setting their context vectors to
null . |
void |
pruneIndex(java.lang.String[] words)
Prunes the RandomLabels associated to the given words by setting their context vectors to null . |
int |
purgeIndex()
Removes the RandomLabels associated to tokens previously pruned. |
int |
purgeIndex(long minTF,
long maxTF)
Removes RandomLabels with a term frequency lower than minTF
or higher than maxTF from the RandomIndex . |
int |
purgeIndex(java.lang.String[] words)
Removes the RandomLabels associated to the given words from the
RandomIndex . |
void |
revokePurgedState()
Revoking the RandomIndex from purged state makes it possible to further add data to the index even though labels influencing the data have been removed from the index. |
boolean |
setAllLowerCase(boolean allLowerCase)
Sets the RandomIndex to henceforth automagically convert all words (index terms) to all lower case and returns the previous state. |
boolean |
setDocumentLabels(boolean documentLabels)
Sets the RandomIndex to henceforth index on document level instead of the usual word context level. |
int |
setMaxNumThreads(int maxNumThreads)
The maximum number of concurrent threads that should be reserved for RI for adding text to the index. |
long |
setSleepFactor(long millis)
The number of milliseconds a thread should sleep when trying to update a RandomLabel that is already undergoing updating by another thread. |
boolean |
setUnaryLabels(boolean unaryLabels)
Sets the RandomIndex to henceforth use unary labels, incremented from seed, and returns the previous state. |
java.util.Set<java.lang.String> |
setVocabulary(java.util.Set<java.lang.String> vocabulary)
Sets the restricted vocabulary to the supplied Set and
returns the old vocabulary set.In order to reset the vocabulary you can supply an empty Set ,
the RandomLabel s with pruned context vectors will however
stay pruned (i.e. |
int |
size()
Returns the number of Random Labels in the Random Index. |
java.lang.String |
toString()
String representation of RandomIndex. |
Methods inherited from class java.lang.Object |
---|
equals, getClass, hashCode, notify, notifyAll, wait, wait, wait |
Constructor Detail |
---|
public RandomIndex(int dimensionality, int randomDegree, int seed, int leftWindowSize, int rightWindowSize, WeightingScheme weightingScheme)
dimensionality
- The dimensionality all RandomLabels in the RandomIndex
should have, this can not be altered at a later state.randomDegree
- The number of random values all RandomLabels in the
RandomIndex initially should have.seed
- a seed for each label's random generator. This seed, in combination
with the word, makes it very likely that the created random label is
"unique" yet reproducible.leftWindowSize
- The maximum number of words behind the focus word
to include in the context window when updating a label.rightWindowSize
- The maximum number of words in front of the focus
word to include in the context window when updating a label.weightingScheme
- a visiting object defining the weighting of the context
labels to the left and to the right of the focus word.public RandomIndex()
Method Detail |
---|
public java.util.Set<java.lang.String> keySet()
RandomIndex
.
The set is backed by the RandomIndex
, so changes to the
RandomIndex
are reflected in the set, and vice-versa.
If the RandomIndex
is modified while an iteration over the
set is in progress, the results of the iteration are undefined.
RandomIndex
.public java.util.Set<java.util.Map.Entry<java.lang.String,RandomLabel>> entrySet()
RandomIndex
.
Each element in the returned set is a Map.Entry
. The set is
backed by the RandomIndex
, so changes to the
RandomIndex
are reflected in the set, and vice-versa.
If the RandomIndex
is modified while an iteration over the
set is in progress, the results of the iteration are undefined.
RandomIndex
.public boolean addRandomLabel(RandomLabel label)
label
- RandomLabel to be added.
true
upon success, else false
.public boolean setUnaryLabels(boolean unaryLabels)
unaryLabels
- boolean value denoting the wish to henceforth use
unary labels, rather than random, or not.
true
or false
.public boolean getUnaryLabels()
true
or false
.public boolean setDocumentLabels(boolean documentLabels)
documentLabels
- boolean value denoting the wish to henceforth
index on document level or not.
true
or false
.public boolean getDocumentLabels()
true
or false
.public boolean setAllLowerCase(boolean allLowerCase)
allLowerCase
- boolean value denoting the wish to have all index terms henceforth
converted to lower case or not.
true
or false
.public boolean getAllLowerCase()
true
or false
.public int getDimensionality()
public int getRandomDegree()
public int getSeed()
public int getLeftWindowSize()
public int getRightWindowSize()
public WeightingScheme getWeightingScheme()
public long getWordsIndexed()
public long getDocumentsIndexed()
public int setMaxNumThreads(int maxNumThreads)
maxNumThreads
- maximum number of threads reserved for RI
public int getMaxNumThreads()
public long setSleepFactor(long millis)
millis
- the new sleep factor
public long getSleepFactor()
public boolean finishedUpdating()
true
before finalize()
has been called.
true
if all threads have finished running,
otherwise false
public void finalize()
finalize
in class java.lang.Object
public int addText(java.lang.String[] text)
text
- a text tokenized on word level where each element in the
list text
contains one token.
public int addText(java.lang.String text, java.lang.String pattern)
text
- a text that is to be tokenized according to the supplied
pattern
.pattern
- a pattern according to which the text string is to be
split into tokens.
public int addText(java.lang.String text)
text
- a text that is to be tokenized according to the default
pattern which is "\\s"
.
public RandomLabel getRandomLabel(java.lang.String word)
RandomLabel
for the given word
if it
exists in the RandomIndex
. If the RandomIndex
does not contain a corresponding RandomLabel
, return a
zero sized RandomLabel
.
word
- the word in the RandomIndex
that we want the
corresponding RandomLabel
for.
RandomLabel corresponding to the given
word
if it exists in the RandomIndex
,
if not a zero sized RandomLabel
is returned.
public java.lang.String toString()
toString
in class java.lang.Object
public boolean isEmpty()
true
if this RandomIndex contains no elements.
true
if this RandomIndex contains no elements,
otherwise false.public int size()
public boolean contains(java.lang.String word)
true
if this RandomIndex contains the given word (index term).
word
- The word which existence in the RandomIndex is to be determined.
true
if this set contains the word, otherwise false.public boolean contains(RandomLabel label)
true
if this RandomIndex contains the given RandomLabel.
label
- The RandomLabel which existence in the RandomIndex is to be determined.
true
if this set contains the RandomIndex, otherwise false.public void pruneIndex(java.lang.String[] words)
null
. This can be used to lower memory
constraints and save file size if there are a lot of tokens that are
only interesting as context to other tokens, and not as actual index
terms themselves.
words
- a string vector of words whose RandomLabels
are to be pruned.
public void pruneIndex(long minTF, long maxTF)
minTF
or higher than maxTF
by setting their context vectors to
null
. This can be used to lower memory constraints and save
file size if there are a lot of tokens that are only interesting as
context to other tokens, and not as actual index terms themselves.
minTF
- lower frequency thresholdmaxTF
- higher frequency thresholdpublic int purgeIndex()
RandomLabels
associated to tokens previously pruned.
RandomLabels
removedpublic int purgeIndex(java.lang.String[] words)
RandomLabels
associated to the given words from the
RandomIndex
.
words
- a string vector of words whose RandomLabels
are to be removed.
- Returns:
- the number of
RandomLabels
removed
public int purgeIndex(long minTF, long maxTF)
RandomLabels
with a term frequency lower than minTF
or higher than maxTF
from the RandomIndex
.
minTF
- lower frequency thresholdmaxTF
- higher frequency threshold
RandomLabels
removedpublic boolean isPurged()
true
if labels have been removed from the index,
if not false
public void revokePurgedState()
public void addVocabulary(java.lang.String[] words)
RandomLabel
s with
pruned context vectors (i.e. contextvector = null
).
words
- String
array with words to be addedpublic int addVocabulary(java.util.Set<java.lang.String> vocabulary)
RandomLabel
s with
pruned context vectors (i.e. contextvector = null
).
vocabulary
- Set
with words to be addedpublic java.util.Set<java.lang.String> setVocabulary(java.util.Set<java.lang.String> vocabulary)
Set
and
returns the old vocabulary set.Set
,
the RandomLabel
s with pruned context vectors will however
stay pruned (i.e. contextvector = null
).
vocabulary
- Set
of words (index terms) denoting the
desired restricted vocabulary
Set
(possibly empty)public java.util.Set<java.lang.String> getVocabulary()
Set
denoting the current restricted vocabulary.
Set
(possibly empty)
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |