moj.ri
Class RandomIndex

java.lang.Object
  extended by moj.ri.RandomIndex
All Implemented Interfaces:
java.io.Serializable
Direct Known Subclasses:
SparseDistributedMemory

public class RandomIndex
extends java.lang.Object
implements java.io.Serializable

A Random Index is an index that contextually indexes the texts that are fed to it. Not only is it possible to ask the RandomIndex for the term frequency for a specific index term, but also for its semantically closest relatives (based upon co-occurrence statistics and random labels).

Version:
2008-Dec-12
Author:
Martin Hassel
See Also:
Serialized Form

Constructor Summary
RandomIndex()
          Create a new RandomIndex of RandomLabels with a dimensionality of 1800, a degree of initial randomness of 8 and a window size for contextual updates of 3 (i.e.
RandomIndex(int dimensionality, int randomDegree, int seed, int leftWindowSize, int rightWindowSize, WeightingScheme weightingScheme)
          Create a new RandomIndex of RandomLabels with the given dimensionality, degree of initial randomness and window size for contextual updates.
 
Method Summary
 boolean addRandomLabel(RandomLabel label)
          Add an existing RandomLabel to the RandomIndex.
 int addText(java.lang.String text)
          Add text to the Random Index and contextually update all words in, and added to, the index, tokens are separated by white-space.
 int addText(java.lang.String[] text)
          Add text to the Random Index and contextually update all words in, and added to, the index (if the index is not in a purged state).
 int addText(java.lang.String text, java.lang.String pattern)
          Add text to the Random Index and contextually update all words in, and added to, the index (if the index is not in a purged state), tokens are separated according to the supplied pattern.
 int addVocabulary(java.util.Set<java.lang.String> vocabulary)
          Adds words (index terms) to the restricted vocabulary.
 void addVocabulary(java.lang.String[] words)
          Adds words (index terms) to the restricted vocabulary.
 boolean contains(RandomLabel label)
          Returns true if this RandomIndex contains the given RandomLabel.
 boolean contains(java.lang.String word)
          Returns true if this RandomIndex contains the given word (index term).
 java.util.Set<java.util.Map.Entry<java.lang.String,RandomLabel>> entrySet()
          Returns a set view of the mappings contained in this RandomIndex.
 void finalize()
          Shuts down the thread pool, used for scheduling thread (re)usage when adding texts to the RandomIndex, as soon as all threads updating the RandomIndex have finished running.
 boolean finishedUpdating()
          Checks if all threads updating the RandomIndex have finished running, only then the RandomIndex is guaranteed to be up to date.
 boolean getAllLowerCase()
          Gets the state of the RandomIndex denoting the wish to henceforth automagically convert all words (index terms) to all lower case.
 int getDimensionality()
          Gets the dimensionality of RandomLabels in the RandomIndex.
 boolean getDocumentLabels()
          Gets the state of the RandomIndex denoting the wish to henceforth index on document level rather than word context level.
 long getDocumentsIndexed()
          Gets the number of documents indexed by the Random Index.
 int getLeftWindowSize()
          Gets the size of the context window to the left used when updating context labels in RandonLabels in the RandomIndex.
 int getMaxNumThreads()
          Returns the maximum number of concurrent threads reserved for RI when adding text to the index.
 int getRandomDegree()
          Gets the random degree of RandomLabels in the RandomIndex.
 RandomLabel getRandomLabel(java.lang.String word)
          Get the RandomLabel for the given word if it exists in the RandomIndex.
 int getRightWindowSize()
          Gets the size of the context window to the right used when updating context labels in RandonLabels in the RandomIndex.
 int getSeed()
          Gets the seed used to create RandomLabels in the RandomIndex.
 long getSleepFactor()
          Returns the number of milliseconds a thread should sleep when trying to update a RandomLabel that is already undergoing updating by another thread.
 boolean getUnaryLabels()
          Gets the state of the RandomIndex denoting the wish to henceforth use unary labels, rather than random.
 java.util.Set<java.lang.String> getVocabulary()
          Gets the Set denoting the current restricted vocabulary.
 WeightingScheme getWeightingScheme()
          Gets the name (class) of the WeightingScheme used by the RandomIndex when updating context labels in RandonLabels in the RandomIndex.
 long getWordsIndexed()
          Gets the number of words indexed by the Random Index.
 boolean isEmpty()
          Returns true if this RandomIndex contains no elements.
 boolean isPurged()
          Note: Updating the RandomIndex after labels have been removed from the index may result in new labels being generated for tokens that have already been encountered.
 java.util.Set<java.lang.String> keySet()
          Returns a set view of the keys contained in this RandomIndex.
 void pruneIndex(long minTF, long maxTF)
          Prunes RandomLabels with a term frequency lower than minTF or higher than maxTF by setting their context vectors to null.
 void pruneIndex(java.lang.String[] words)
          Prunes the RandomLabels associated to the given words by setting their context vectors to null.
 int purgeIndex()
          Removes the RandomLabels associated to tokens previously pruned.
 int purgeIndex(long minTF, long maxTF)
          Removes RandomLabels with a term frequency lower than minTF or higher than maxTF from the RandomIndex.
 int purgeIndex(java.lang.String[] words)
          Removes the RandomLabels associated to the given words from the RandomIndex.
 void revokePurgedState()
          Revoking the RandomIndex from purged state makes it possible to further add data to the index even though labels influencing the data have been removed from the index.
 boolean setAllLowerCase(boolean allLowerCase)
          Sets the RandomIndex to henceforth automagically convert all words (index terms) to all lower case and returns the previous state.
 boolean setDocumentLabels(boolean documentLabels)
          Sets the RandomIndex to henceforth index on document level instead of the usual word context level.
 int setMaxNumThreads(int maxNumThreads)
          The maximum number of concurrent threads that should be reserved for RI for adding text to the index.
 long setSleepFactor(long millis)
          The number of milliseconds a thread should sleep when trying to update a RandomLabel that is already undergoing updating by another thread.
 boolean setUnaryLabels(boolean unaryLabels)
          Sets the RandomIndex to henceforth use unary labels, incremented from seed, and returns the previous state.
 java.util.Set<java.lang.String> setVocabulary(java.util.Set<java.lang.String> vocabulary)
          Sets the restricted vocabulary to the supplied Set and returns the old vocabulary set.
In order to reset the vocabulary you can supply an empty Set, the RandomLabels with pruned context vectors will however stay pruned (i.e.
 int size()
          Returns the number of Random Labels in the Random Index.
 java.lang.String toString()
          String representation of RandomIndex.
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Constructor Detail

RandomIndex

public RandomIndex(int dimensionality,
                   int randomDegree,
                   int seed,
                   int leftWindowSize,
                   int rightWindowSize,
                   WeightingScheme weightingScheme)
Create a new RandomIndex of RandomLabels with the given dimensionality, degree of initial randomness and window size for contextual updates.

Parameters:
dimensionality - The dimensionality all RandomLabels in the RandomIndex should have, this can not be altered at a later state.
randomDegree - The number of random values all RandomLabels in the RandomIndex initially should have.
seed - a seed for each label's random generator. This seed, in combination with the word, makes it very likely that the created random label is "unique" yet reproducible.
leftWindowSize - The maximum number of words behind the focus word to include in the context window when updating a label.
rightWindowSize - The maximum number of words in front of the focus word to include in the context window when updating a label.
weightingScheme - a visiting object defining the weighting of the context labels to the left and to the right of the focus word.

RandomIndex

public RandomIndex()
Create a new RandomIndex of RandomLabels with a dimensionality of 1800, a degree of initial randomness of 8 and a window size for contextual updates of 3 (i.e. three words look-behind and look-ahead). The seed for randomisation is set to 123.

Method Detail

keySet

public java.util.Set<java.lang.String> keySet()
Returns a set view of the keys contained in this RandomIndex. The set is backed by the RandomIndex, so changes to the RandomIndex are reflected in the set, and vice-versa. If the RandomIndex is modified while an iteration over the set is in progress, the results of the iteration are undefined.

Returns:
a set view of the keys contained in this RandomIndex.

entrySet

public java.util.Set<java.util.Map.Entry<java.lang.String,RandomLabel>> entrySet()
Returns a set view of the mappings contained in this RandomIndex. Each element in the returned set is a Map.Entry. The set is backed by the RandomIndex, so changes to the RandomIndex are reflected in the set, and vice-versa. If the RandomIndex is modified while an iteration over the set is in progress, the results of the iteration are undefined.

Returns:
a set view of the mappings contained in this RandomIndex.

addRandomLabel

public boolean addRandomLabel(RandomLabel label)
Add an existing RandomLabel to the RandomIndex. Will only succeed if the word (index term) associated with the label does not yet exist in the RandomIndex, the label has the same dimensionality as is set for the RandomIndex and the RandomIndex is not in a purged state.

Parameters:
label - RandomLabel to be added.
Returns:
true upon success, else false.

setUnaryLabels

public boolean setUnaryLabels(boolean unaryLabels)
Sets the RandomIndex to henceforth use unary labels, incremented from seed, and returns the previous state. Index terms already in the index will not be changed.

Parameters:
unaryLabels - boolean value denoting the wish to henceforth use unary labels, rather than random, or not.
Returns:
the previous state, i.e. either true or false.

getUnaryLabels

public boolean getUnaryLabels()
Gets the state of the RandomIndex denoting the wish to henceforth use unary labels, rather than random.

Returns:
the current state, i.e. either true or false.

setDocumentLabels

public boolean setDocumentLabels(boolean documentLabels)
Sets the RandomIndex to henceforth index on document level instead of the usual word context level. Index terms already in the index will not be changed.

Parameters:
documentLabels - boolean value denoting the wish to henceforth index on document level or not.
Returns:
the previous state, i.e. either true or false.

getDocumentLabels

public boolean getDocumentLabels()
Gets the state of the RandomIndex denoting the wish to henceforth index on document level rather than word context level.

Returns:
the current state, i.e. either true or false.

setAllLowerCase

public boolean setAllLowerCase(boolean allLowerCase)
Sets the RandomIndex to henceforth automagically convert all words (index terms) to all lower case and returns the previous state. Index terms already in the index will not be changed.

Parameters:
allLowerCase - boolean value denoting the wish to have all index terms henceforth converted to lower case or not.
Returns:
the previous state, i.e. either true or false.

getAllLowerCase

public boolean getAllLowerCase()
Gets the state of the RandomIndex denoting the wish to henceforth automagically convert all words (index terms) to all lower case.

Returns:
the current state, i.e. either true or false.

getDimensionality

public int getDimensionality()
Gets the dimensionality of RandomLabels in the RandomIndex.

Returns:
the dimensionality of RandomLabels in the RandomIndex.

getRandomDegree

public int getRandomDegree()
Gets the random degree of RandomLabels in the RandomIndex.

Returns:
the random degree of RandomLabels in the RandomIndex.

getSeed

public int getSeed()
Gets the seed used to create RandomLabels in the RandomIndex.

Returns:
the seed used to create RandomLabels in the RandomIndex.

getLeftWindowSize

public int getLeftWindowSize()
Gets the size of the context window to the left used when updating context labels in RandonLabels in the RandomIndex.

Returns:
the size of the context window to the left.

getRightWindowSize

public int getRightWindowSize()
Gets the size of the context window to the right used when updating context labels in RandonLabels in the RandomIndex.

Returns:
the size of the context window to the right.

getWeightingScheme

public WeightingScheme getWeightingScheme()
Gets the name (class) of the WeightingScheme used by the RandomIndex when updating context labels in RandonLabels in the RandomIndex.

Returns:
the name (class) of the WeightingScheme used by the RandomIndex.

getWordsIndexed

public long getWordsIndexed()
Gets the number of words indexed by the Random Index.

Returns:
the number of words indexed by the Random Index.

getDocumentsIndexed

public long getDocumentsIndexed()
Gets the number of documents indexed by the Random Index.

Returns:
the number of documents indexed by the Random Index.

setMaxNumThreads

public int setMaxNumThreads(int maxNumThreads)
The maximum number of concurrent threads that should be reserved for RI for adding text to the index.

Parameters:
maxNumThreads - maximum number of threads reserved for RI
Returns:
the previously set maximum number of threads reserved for RI

getMaxNumThreads

public int getMaxNumThreads()
Returns the maximum number of concurrent threads reserved for RI when adding text to the index.

Returns:
the previously set maximum number of threads reserved for RI

setSleepFactor

public long setSleepFactor(long millis)
The number of milliseconds a thread should sleep when trying to update a RandomLabel that is already undergoing updating by another thread.

Parameters:
millis - the new sleep factor
Returns:
the previously set sleep factor

getSleepFactor

public long getSleepFactor()
Returns the number of milliseconds a thread should sleep when trying to update a RandomLabel that is already undergoing updating by another thread.

Returns:
the previously set sleep factor

finishedUpdating

public boolean finishedUpdating()
Checks if all threads updating the RandomIndex have finished running, only then the RandomIndex is guaranteed to be up to date. This will never return true before finalize() has been called.

Returns:
true if all threads have finished running, otherwise false

finalize

public void finalize()
Shuts down the thread pool, used for scheduling thread (re)usage when adding texts to the RandomIndex, as soon as all threads updating the RandomIndex have finished running. Should be done e.g. before saving the index.

Overrides:
finalize in class java.lang.Object

addText

public int addText(java.lang.String[] text)
Add text to the Random Index and contextually update all words in, and added to, the index (if the index is not in a purged state). The words updated/added are contextually "coloured" according to the set window size and weighting scheme.

Parameters:
text - a text tokenized on word level where each element in the list text contains one token.
Returns:
number of words (or index terms) added/updated.

addText

public int addText(java.lang.String text,
                   java.lang.String pattern)
Add text to the Random Index and contextually update all words in, and added to, the index (if the index is not in a purged state), tokens are separated according to the supplied pattern. The words updated/added are contextually "coloured" according to the set window size and weighting scheme.

Parameters:
text - a text that is to be tokenized according to the supplied pattern.
pattern - a pattern according to which the text string is to be split into tokens.
Returns:
number of words (index terms) added/updated.

addText

public int addText(java.lang.String text)
Add text to the Random Index and contextually update all words in, and added to, the index, tokens are separated by white-space. The words updated/added are contextually "coloured" according to the set window size and weighting scheme.

Parameters:
text - a text that is to be tokenized according to the default pattern which is "\\s".
Returns:
number of words (index terms) added/updated.

getRandomLabel

public RandomLabel getRandomLabel(java.lang.String word)
Get the RandomLabel for the given word if it exists in the RandomIndex. If the RandomIndex does not contain a corresponding RandomLabel, return a zero sized RandomLabel.

Parameters:
word - the word in the RandomIndex that we want the corresponding RandomLabel for.
Returns:
the RandomLabel corresponding to the given word if it exists in the RandomIndex, if not a zero sized RandomLabel is returned.

toString

public java.lang.String toString()
String representation of RandomIndex.

Overrides:
toString in class java.lang.Object
Returns:
a string containing string representations of RandomLabels where each label is separated by a newline (the string also ends in newline).

isEmpty

public boolean isEmpty()
Returns true if this RandomIndex contains no elements.

Returns:
true if this RandomIndex contains no elements, otherwise false.

size

public int size()
Returns the number of Random Labels in the Random Index.

Returns:
the number of Random Labels in the Random Index.

contains

public boolean contains(java.lang.String word)
Returns true if this RandomIndex contains the given word (index term).

Parameters:
word - The word which existence in the RandomIndex is to be determined.
Returns:
true if this set contains the word, otherwise false.

contains

public boolean contains(RandomLabel label)
Returns true if this RandomIndex contains the given RandomLabel.

Parameters:
label - The RandomLabel which existence in the RandomIndex is to be determined.
Returns:
true if this set contains the RandomIndex, otherwise false.

pruneIndex

public void pruneIndex(java.lang.String[] words)
Prunes the RandomLabels associated to the given words by setting their context vectors to null. This can be used to lower memory constraints and save file size if there are a lot of tokens that are only interesting as context to other tokens, and not as actual index terms themselves.

Parameters:
words - a string vector of words whose RandomLabels are to be pruned.

pruneIndex

public void pruneIndex(long minTF,
                       long maxTF)
Prunes RandomLabels with a term frequency lower than minTF or higher than maxTF by setting their context vectors to null. This can be used to lower memory constraints and save file size if there are a lot of tokens that are only interesting as context to other tokens, and not as actual index terms themselves.

Parameters:
minTF - lower frequency threshold
maxTF - higher frequency threshold

purgeIndex

public int purgeIndex()
Removes the RandomLabels associated to tokens previously pruned.

Returns:
the number of RandomLabels removed

purgeIndex

public int purgeIndex(java.lang.String[] words)
Removes the RandomLabels associated to the given words from the RandomIndex.

Parameters:
words - a string vector of words whose RandomLabels are to be removed.
Returns:
the number of RandomLabels removed

purgeIndex

public int purgeIndex(long minTF,
                      long maxTF)
Removes RandomLabels with a term frequency lower than minTF or higher than maxTF from the RandomIndex.

Parameters:
minTF - lower frequency threshold
maxTF - higher frequency threshold
Returns:
the number of RandomLabels removed

isPurged

public boolean isPurged()
Note: Updating the RandomIndex after labels have been removed from the index may result in new labels being generated for tokens that have already been encountered. This will lead to incorrectly updated context vectors and unreliable co-occurrence data.

Returns:
true if labels have been removed from the index, if not false

revokePurgedState

public void revokePurgedState()
Revoking the RandomIndex from purged state makes it possible to further add data to the index even though labels influencing the data have been removed from the index. This is NOT recommended, and after revocation it is entirely up to the user to remember if an index has been previously purged or not!


addVocabulary

public void addVocabulary(java.lang.String[] words)
Adds words (index terms) to the restricted vocabulary. Words not in the restricted vocabulary will have RandomLabels with pruned context vectors (i.e. contextvector = null).

Parameters:
words - String array with words to be added

addVocabulary

public int addVocabulary(java.util.Set<java.lang.String> vocabulary)
Adds words (index terms) to the restricted vocabulary. Words not in the restricted vocabulary will have RandomLabels with pruned context vectors (i.e. contextvector = null).

Parameters:
vocabulary - Set with words to be added

setVocabulary

public java.util.Set<java.lang.String> setVocabulary(java.util.Set<java.lang.String> vocabulary)
Sets the restricted vocabulary to the supplied Set and returns the old vocabulary set.
In order to reset the vocabulary you can supply an empty Set, the RandomLabels with pruned context vectors will however stay pruned (i.e. contextvector = null).

Parameters:
vocabulary - Set of words (index terms) denoting the desired restricted vocabulary
Returns:
previous vocabulary Set (possibly empty)

getVocabulary

public java.util.Set<java.lang.String> getVocabulary()
Gets the Set denoting the current restricted vocabulary.

Returns:
current vocabulary Set (possibly empty)