RandomIndex

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

moj.ri
Class RandomIndex

java.lang.Object
  moj.ri.RandomIndex

All Implemented Interfaces:: java.io.Serializable

Direct Known Subclasses:: SparseDistributedMemory

public class RandomIndex
extends java.lang.Object
implements java.io.Serializable
extends java.lang.Object
implements java.io.Serializable

A Random Index is an index that contextually indexes the texts that are fed to it. Not only is it possible to ask the RandomIndex for the term frequency for a specific index term, but also for its semantically closest relatives (based upon co-occurrence statistics and random labels).

Version:: 2008-Dec-12
Author:: Martin Hassel
See Also:: Serialized Form

Constructor Summary
`RandomIndex()` Create a new RandomIndex of RandomLabels with a dimensionality of 1800, a degree of initial randomness of 8 and a window size for contextual updates of 3 (i.e.
`RandomIndex(int dimensionality, int randomDegree, int seed, int leftWindowSize, int rightWindowSize, WeightingScheme weightingScheme)` Create a new RandomIndex of RandomLabels with the given dimensionality, degree of initial randomness and window size for contextual updates.

Method Summary
`boolean`	`addRandomLabel(RandomLabel label)` Add an existing RandomLabel to the RandomIndex.
`int`	`addText(java.lang.String text)` Add text to the Random Index and contextually update all words in, and added to, the index, tokens are separated by white-space.
`int`	`addText(java.lang.String[] text)` Add text to the Random Index and contextually update all words in, and added to, the index (if the index is not in a purged state).
`int`	`addText(java.lang.String text, java.lang.String pattern)` Add text to the Random Index and contextually update all words in, and added to, the index (if the index is not in a purged state), tokens are separated according to the supplied pattern.
`int`	`addVocabulary(java.util.Set<java.lang.String> vocabulary)` Adds words (index terms) to the restricted vocabulary.
`void`	`addVocabulary(java.lang.String[] words)` Adds words (index terms) to the restricted vocabulary.
`boolean`	`contains(RandomLabel label)` Returns `true` if this RandomIndex contains the given RandomLabel.
`boolean`	`contains(java.lang.String word)` Returns `true` if this RandomIndex contains the given word (index term).
`java.util.Set<java.util.Map.Entry<java.lang.String,RandomLabel>>`	`entrySet()` Returns a set view of the mappings contained in this `RandomIndex`.
`void`	`finalize()` Shuts down the thread pool, used for scheduling thread (re)usage when adding texts to the RandomIndex, as soon as all threads updating the RandomIndex have finished running.
`boolean`	`finishedUpdating()` Checks if all threads updating the RandomIndex have finished running, only then the RandomIndex is guaranteed to be up to date.
`boolean`	`getAllLowerCase()` Gets the state of the RandomIndex denoting the wish to henceforth automagically convert all words (index terms) to all lower case.
`int`	`getDimensionality()` Gets the dimensionality of RandomLabels in the RandomIndex.
`boolean`	`getDocumentLabels()` Gets the state of the RandomIndex denoting the wish to henceforth index on document level rather than word context level.
`long`	`getDocumentsIndexed()` Gets the number of documents indexed by the Random Index.
`int`	`getLeftWindowSize()` Gets the size of the context window to the left used when updating context labels in RandonLabels in the RandomIndex.
`int`	`getMaxNumThreads()` Returns the maximum number of concurrent threads reserved for RI when adding text to the index.
`int`	`getRandomDegree()` Gets the random degree of RandomLabels in the RandomIndex.
`RandomLabel`	`getRandomLabel(java.lang.String word)` Get the `RandomLabel` for the given `word` if it exists in the `RandomIndex`.
`int`	`getRightWindowSize()` Gets the size of the context window to the right used when updating context labels in RandonLabels in the RandomIndex.
`int`	`getSeed()` Gets the seed used to create RandomLabels in the RandomIndex.
`long`	`getSleepFactor()` Returns the number of milliseconds a thread should sleep when trying to update a RandomLabel that is already undergoing updating by another thread.
`boolean`	`getUnaryLabels()` Gets the state of the RandomIndex denoting the wish to henceforth use unary labels, rather than random.
`java.util.Set<java.lang.String>`	`getVocabulary()` Gets the `Set` denoting the current restricted vocabulary.
`WeightingScheme`	`getWeightingScheme()` Gets the name (class) of the WeightingScheme used by the RandomIndex when updating context labels in RandonLabels in the RandomIndex.
`long`	`getWordsIndexed()` Gets the number of words indexed by the Random Index.
`boolean`	`isEmpty()` Returns `true` if this RandomIndex contains no elements.
`boolean`	`isPurged()` Note: Updating the RandomIndex after labels have been removed from the index may result in new labels being generated for tokens that have already been encountered.
`java.util.Set<java.lang.String>`	`keySet()` Returns a set view of the keys contained in this `RandomIndex`.
`void`	`pruneIndex(long minTF, long maxTF)` Prunes RandomLabels with a term frequency lower than `minTF` or higher than `maxTF` by setting their context vectors to `null`.
`void`	`pruneIndex(java.lang.String[] words)` Prunes the RandomLabels associated to the given words by setting their context vectors to `null`.
`int`	`purgeIndex()` Removes the `RandomLabels` associated to tokens previously pruned.
`int`	`purgeIndex(long minTF, long maxTF)` Removes `RandomLabels` with a term frequency lower than `minTF` or higher than `maxTF` from the `RandomIndex`.
`int`	`purgeIndex(java.lang.String[] words)` Removes the `RandomLabels` associated to the given words from the `RandomIndex`.
`void`	`revokePurgedState()` Revoking the RandomIndex from purged state makes it possible to further add data to the index even though labels influencing the data have been removed from the index.
`boolean`	`setAllLowerCase(boolean allLowerCase)` Sets the RandomIndex to henceforth automagically convert all words (index terms) to all lower case and returns the previous state.
`boolean`	`setDocumentLabels(boolean documentLabels)` Sets the RandomIndex to henceforth index on document level instead of the usual word context level.
`int`	`setMaxNumThreads(int maxNumThreads)` The maximum number of concurrent threads that should be reserved for RI for adding text to the index.
`long`	`setSleepFactor(long millis)` The number of milliseconds a thread should sleep when trying to update a RandomLabel that is already undergoing updating by another thread.
`boolean`	`setUnaryLabels(boolean unaryLabels)` Sets the RandomIndex to henceforth use unary labels, incremented from seed, and returns the previous state.
`java.util.Set<java.lang.String>`	`setVocabulary(java.util.Set<java.lang.String> vocabulary)` Sets the restricted vocabulary to the supplied `Set` and returns the old vocabulary set. In order to reset the vocabulary you can supply an empty `Set`, the `RandomLabel`s with pruned context vectors will however stay pruned (i.e.
`int`	`size()` Returns the number of Random Labels in the Random Index.
`java.lang.String`	`toString()` String representation of RandomIndex.

Methods inherited from class java.lang.Object
`equals, getClass, hashCode, notify, notifyAll, wait, wait, wait`

Constructor Detail

RandomIndex

public RandomIndex(int dimensionality,
                   int randomDegree,
                   int seed,
                   int leftWindowSize,
                   int rightWindowSize,
                   WeightingScheme weightingScheme)

Create a new RandomIndex of RandomLabels with the given dimensionality, degree of initial randomness and window size for contextual updates.

Parameters:: dimensionality - The dimensionality all RandomLabels in the RandomIndex should have, this can not be altered at a later state.; randomDegree - The number of random values all RandomLabels in the RandomIndex initially should have.; seed - a seed for each label's random generator. This seed, in combination with the word, makes it very likely that the created random label is "unique" yet reproducible.; leftWindowSize - The maximum number of words behind the focus word to include in the context window when updating a label.; rightWindowSize - The maximum number of words in front of the focus word to include in the context window when updating a label.; weightingScheme - a visiting object defining the weighting of the context labels to the left and to the right of the focus word.

RandomIndex

public RandomIndex()

Create a new RandomIndex of RandomLabels with a dimensionality of 1800, a degree of initial randomness of 8 and a window size for contextual updates of 3 (i.e. three words look-behind and look-ahead). The seed for randomisation is set to 123.

Method Detail

keySet

public java.util.Set<java.lang.String> keySet()

Returns a set view of the keys contained in this RandomIndex. The set is backed by the RandomIndex, so changes to the RandomIndex are reflected in the set, and vice-versa. If the RandomIndex is modified while an iteration over the set is in progress, the results of the iteration are undefined.

Returns:: a set view of the keys contained in this RandomIndex.

entrySet

public java.util.Set<java.util.Map.Entry<java.lang.String,RandomLabel>> entrySet()

Returns a set view of the mappings contained in this RandomIndex. Each element in the returned set is a Map.Entry. The set is backed by the RandomIndex, so changes to the RandomIndex are reflected in the set, and vice-versa. If the RandomIndex is modified while an iteration over the set is in progress, the results of the iteration are undefined.

Returns:: a set view of the mappings contained in this RandomIndex.

addRandomLabel

public boolean addRandomLabel(RandomLabel label)

Add an existing RandomLabel to the RandomIndex. Will only succeed if the word (index term) associated with the label does not yet exist in the RandomIndex, the label has the same dimensionality as is set for the RandomIndex and the RandomIndex is not in a purged state.

Parameters:: label - RandomLabel to be added.
Returns:: true upon success, else false.

setUnaryLabels

public boolean setUnaryLabels(boolean unaryLabels)

Sets the RandomIndex to henceforth use unary labels, incremented from seed, and returns the previous state. Index terms already in the index will not be changed.

Parameters:: unaryLabels - boolean value denoting the wish to henceforth use unary labels, rather than random, or not.
Returns:: the previous state, i.e. either true or false.

getUnaryLabels

public boolean getUnaryLabels()

Gets the state of the RandomIndex denoting the wish to henceforth use unary labels, rather than random.

Returns:: the current state, i.e. either true or false.

setDocumentLabels

public boolean setDocumentLabels(boolean documentLabels)

Sets the RandomIndex to henceforth index on document level instead of the usual word context level. Index terms already in the index will not be changed.

Parameters:: documentLabels - boolean value denoting the wish to henceforth index on document level or not.
Returns:: the previous state, i.e. either true or false.

getDocumentLabels

public boolean getDocumentLabels()

Gets the state of the RandomIndex denoting the wish to henceforth index on document level rather than word context level.

Returns:: the current state, i.e. either true or false.

setAllLowerCase

public boolean setAllLowerCase(boolean allLowerCase)

Sets the RandomIndex to henceforth automagically convert all words (index terms) to all lower case and returns the previous state. Index terms already in the index will not be changed.

Parameters:: allLowerCase - boolean value denoting the wish to have all index terms henceforth converted to lower case or not.
Returns:: the previous state, i.e. either true or false.

getAllLowerCase

public boolean getAllLowerCase()

Gets the state of the RandomIndex denoting the wish to henceforth automagically convert all words (index terms) to all lower case.

Returns:: the current state, i.e. either true or false.

getDimensionality

public int getDimensionality()

Gets the dimensionality of RandomLabels in the RandomIndex.

Returns:: the dimensionality of RandomLabels in the RandomIndex.

getRandomDegree

public int getRandomDegree()

Gets the random degree of RandomLabels in the RandomIndex.

Returns:: the random degree of RandomLabels in the RandomIndex.

getSeed

public int getSeed()

Gets the seed used to create RandomLabels in the RandomIndex.

Returns:: the seed used to create RandomLabels in the RandomIndex.

getLeftWindowSize

public int getLeftWindowSize()

Gets the size of the context window to the left used when updating context labels in RandonLabels in the RandomIndex.

Returns:: the size of the context window to the left.

getRightWindowSize

public int getRightWindowSize()

Gets the size of the context window to the right used when updating context labels in RandonLabels in the RandomIndex.

Returns:: the size of the context window to the right.

getWeightingScheme

public WeightingScheme getWeightingScheme()

Gets the name (class) of the WeightingScheme used by the RandomIndex when updating context labels in RandonLabels in the RandomIndex.

Returns:: the name (class) of the WeightingScheme used by the RandomIndex.

getWordsIndexed

public long getWordsIndexed()

Gets the number of words indexed by the Random Index.

Returns:: the number of words indexed by the Random Index.

getDocumentsIndexed

public long getDocumentsIndexed()

Gets the number of documents indexed by the Random Index.

Returns:: the number of documents indexed by the Random Index.

setMaxNumThreads

public int setMaxNumThreads(int maxNumThreads)

The maximum number of concurrent threads that should be reserved for RI for adding text to the index.

Parameters:: maxNumThreads - maximum number of threads reserved for RI
Returns:: the previously set maximum number of threads reserved for RI

getMaxNumThreads

public int getMaxNumThreads()

Returns the maximum number of concurrent threads reserved for RI when adding text to the index.

Returns:: the previously set maximum number of threads reserved for RI

setSleepFactor

public long setSleepFactor(long millis)

The number of milliseconds a thread should sleep when trying to update a RandomLabel that is already undergoing updating by another thread.

Parameters:: millis - the new sleep factor
Returns:: the previously set sleep factor

getSleepFactor

public long getSleepFactor()

Returns the number of milliseconds a thread should sleep when trying to update a RandomLabel that is already undergoing updating by another thread.

Returns:: the previously set sleep factor

finishedUpdating

public boolean finishedUpdating()

Checks if all threads updating the RandomIndex have finished running, only then the RandomIndex is guaranteed to be up to date. This will never return true before finalize() has been called.

Returns:: true if all threads have finished running, otherwise false

finalize

public void finalize()

Shuts down the thread pool, used for scheduling thread (re)usage when adding texts to the RandomIndex, as soon as all threads updating the RandomIndex have finished running. Should be done e.g. before saving the index.

Overrides:: finalize in class java.lang.Object

addText

public int addText(java.lang.String[] text)

Add text to the Random Index and contextually update all words in, and added to, the index (if the index is not in a purged state). The words updated/added are contextually "coloured" according to the set window size and weighting scheme.

Parameters:: text - a text tokenized on word level where each element in the list text contains one token.
Returns:: number of words (or index terms) added/updated.

addText

public int addText(java.lang.String text,
                   java.lang.String pattern)

Add text to the Random Index and contextually update all words in, and added to, the index (if the index is not in a purged state), tokens are separated according to the supplied pattern. The words updated/added are contextually "coloured" according to the set window size and weighting scheme.

Parameters:: text - a text that is to be tokenized according to the supplied pattern.; pattern - a pattern according to which the text string is to be split into tokens.
Returns:: number of words (index terms) added/updated.

addText

public int addText(java.lang.String text)

Add text to the Random Index and contextually update all words in, and added to, the index, tokens are separated by white-space. The words updated/added are contextually "coloured" according to the set window size and weighting scheme.

Parameters:: text - a text that is to be tokenized according to the default pattern which is "\\s".
Returns:: number of words (index terms) added/updated.

getRandomLabel

public RandomLabel getRandomLabel(java.lang.String word)

Get the RandomLabel for the given word if it exists in the RandomIndex. If the RandomIndex does not contain a corresponding RandomLabel, return a zero sized RandomLabel.

Parameters:: word - the word in the RandomIndex that we want the corresponding RandomLabel for.
Returns:: the RandomLabel corresponding to the given word if it exists in the RandomIndex, if not a zero sized RandomLabel is returned.





toString
public java.lang.String toString()

String representation of RandomIndex.


Overrides:
toString in class java.lang.Object



Returns:
a string containing string representations of RandomLabels
         where each label is separated by a newline (the string also
         ends in newline).





isEmpty
public boolean isEmpty()

Returns true if this RandomIndex contains no elements.






Returns:
true if this RandomIndex contains no elements,
         otherwise false.





size
public int size()

Returns the number of Random Labels in the Random Index.






Returns:
the number of Random Labels in the Random Index.





contains
public boolean contains(java.lang.String word)

Returns true if this RandomIndex contains the given word (index term).





Parameters:
word - The word which existence in the RandomIndex is to be determined.
Returns:
true if this set contains the word, otherwise false.





contains
public boolean contains(RandomLabel label)

Returns true if this RandomIndex contains the given RandomLabel.





Parameters:
label - The RandomLabel which existence in the RandomIndex is to be determined.
Returns:
true if this set contains the RandomIndex, otherwise false.





pruneIndex
public void pruneIndex(java.lang.String[] words)

Prunes the RandomLabels associated to the given words by setting their
 context vectors to null. This can be used to lower memory
 constraints and save file size if there are a lot of tokens that are
 only interesting as context to other tokens, and not as actual index
 terms themselves.





Parameters:
words - a string vector of words whose RandomLabels
        are to be pruned.






pruneIndex
public void pruneIndex(long minTF,
                       long maxTF)

Prunes RandomLabels with a term frequency lower than minTF
 or higher than maxTF by setting their context vectors to
 null. This can be used to lower memory constraints and save
 file size if there are a lot of tokens that are only interesting as
 context to other tokens, and not as actual index terms themselves.





Parameters:
minTF - lower frequency threshold
maxTF - higher frequency threshold





purgeIndex
public int purgeIndex()

Removes the RandomLabels associated to tokens previously pruned.






Returns:
the number of RandomLabels removed





purgeIndex
public int purgeIndex(java.lang.String[] words)

Removes the RandomLabels associated to the given words from the
 RandomIndex.





Parameters:
words - a string vector of words whose RandomLabels
        are to be removed.
Returns:
the number of RandomLabels removed






purgeIndex
public int purgeIndex(long minTF,
                      long maxTF)

Removes RandomLabels with a term frequency lower than minTF
 or higher than maxTF from the RandomIndex.





Parameters:
minTF - lower frequency threshold
maxTF - higher frequency threshold
Returns:
the number of RandomLabels removed





isPurged
public boolean isPurged()

Note: Updating the RandomIndex after labels have been removed from the
 index may result in new labels being generated for tokens that have
 already been encountered. This will lead to incorrectly updated context
 vectors and unreliable co-occurrence data.






Returns:
true if labels have been removed from the index,
         if not false





revokePurgedState
public void revokePurgedState()

Revoking the RandomIndex from purged state makes it possible to further
 add data to the index even though labels influencing the data have been
 removed from the index. This is NOT recommended, and after revocation it
 is entirely up to the user to remember if an index has been previously
 purged or not!











addVocabulary
public void addVocabulary(java.lang.String[] words)

Adds words (index terms) to the restricted vocabulary. Words not
 in the restricted vocabulary will have RandomLabels with
 pruned context vectors (i.e. contextvector = null).





Parameters:
words - String array with words to be added





addVocabulary
public int addVocabulary(java.util.Set<java.lang.String> vocabulary)

Adds words (index terms) to the restricted vocabulary. Words not
 in the restricted vocabulary will have RandomLabels with
 pruned context vectors (i.e. contextvector = null).





Parameters:
vocabulary - Set with words to be added





setVocabulary
public java.util.Set<java.lang.String> setVocabulary(java.util.Set<java.lang.String> vocabulary)

Sets the restricted vocabulary to the supplied Set and
 returns the old vocabulary set.

 In order to reset the vocabulary you can supply an empty Set,
 the RandomLabels with pruned context vectors will however
 stay pruned (i.e. contextvector = null).





Parameters:
vocabulary - Set of words (index terms) denoting the
        desired restricted vocabulary
Returns:
previous vocabulary Set (possibly empty)





getVocabulary
public java.util.Set<java.lang.String> getVocabulary()

Gets the Set denoting the current restricted vocabulary.






Returns:
current vocabulary Set (possibly empty)














  
      Overview 
      Package 
    Class 
      Tree 
      Deprecated 
      Index 
      Help 
  









 PREV CLASS 
 NEXT CLASS

  FRAMES   
 NO FRAMES   
 







  SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

moj.ri Class RandomIndex

RandomIndex

RandomIndex

keySet

entrySet

addRandomLabel

setUnaryLabels

getUnaryLabels

setDocumentLabels

getDocumentLabels

setAllLowerCase

getAllLowerCase

getDimensionality

getRandomDegree

getSeed

getLeftWindowSize

getRightWindowSize

getWeightingScheme

getWordsIndexed

getDocumentsIndexed

setMaxNumThreads

getMaxNumThreads

setSleepFactor

getSleepFactor

finishedUpdating

finalize

addText

addText

addText

getRandomLabel

toString

isEmpty

size

contains

contains

pruneIndex

pruneIndex

purgeIndex

purgeIndex

purgeIndex

isPurged

revokePurgedState

addVocabulary

addVocabulary

setVocabulary

getVocabulary

moj.ri
Class RandomIndex