moj.ri
Class SparseDistributedMemory

java.lang.Object
  extended by moj.ri.RandomIndex
      extended by moj.ri.SparseDistributedMemory
All Implemented Interfaces:
java.io.Serializable

public class SparseDistributedMemory
extends RandomIndex

SparseDistributedMemory extends RandomIndex with saving and loading of the RandomIndex. The index is saved as a huge XML-file so saving/loading directly to/from a zip compressed archive is also provided (usually results in a compression rate of about 98% without loss of speed, but be prepared for that converting all those byte and float vectors to strings does take time.) SparseDistributedMemory also extends RandomIndex with some useful functions for extracting different types of ordered subsets of the RandomIndex.

Version:
2009-April-21
Author:
Martin Hassel
See Also:
Serialized Form

Constructor Summary
SparseDistributedMemory()
          Create a new SparseDistributedMemory of RandomLabels with a dimensionality of 1800, a degree of initial randomness of 8 and a window size for contextual updates of 3 (i.e.
SparseDistributedMemory(int dimensionality, int randomDegree, int seed, int leftWindowSize, int rightWindowSize, WeightingScheme weightingScheme)
          Create a new SparseDistributedMemory of RandomLabels with the given dimensionality, degree of initial randomness and window size for contextual updates.
 
Method Summary
 int addTextFromFile(java.lang.String filename)
          Adds text from a text file to the RandomIndex, i.e.
 java.lang.String[] getCorrelations(java.lang.String word)
          Generate a set of "semantic relatives" for a given word.
 java.lang.String[] getCorrelations(java.lang.String word, int setSize)
          Generate a set of "semantic relatives" for a given word.
 java.lang.String[] getCorrelations(java.lang.String word, int setSize, long minTermFrequency, long maxTermFrequency)
          Generate a set of "semantic relatives" for a given word.
 java.lang.String[] getCorrelations(java.lang.String word, int setSize, long minTermFrequency, long maxTermFrequency, java.util.HashSet<java.lang.String> restrictedResultSet)
          Generate a set of "semantic relatives" for a given word.
 float[] getDocumentVector(java.lang.String[] document, java.util.Map<java.lang.String,java.lang.Number> weights, boolean idfWeighting)
          Get a Document Vector, with the same dimensionality as the RandomLabels in the SparseDistributedMemory, representing all words in document that are present in the SDM.
 java.lang.String[] getTfIDfRank(int setSize)
          Generate the setSize best "descriptors" (i.e.
 int load(java.lang.String filename)
          Load RandomIndex from file.
 int load(java.lang.String filename, java.util.Set<java.lang.String> vocabulary)
          Load RandomIndex from file.
 int loadCompressed(java.lang.String filename)
          Load RandomIndex from compressed file (i.e.
 int loadCompressed(java.lang.String filename, java.util.Set<java.lang.String> vocabulary)
          Load RandomIndex from compressed file (i.e.
 int save(java.lang.String filename)
          Saves the RandomIndex to the given filename with the extension ".xml" added.
 int saveCompressed(java.lang.String filename)
          Saves the RandomIndex to the given filename as a zip-file with the extension ".xml.zip" added.
 
Methods inherited from class moj.ri.RandomIndex
addRandomLabel, addText, addText, addText, addVocabulary, addVocabulary, contains, contains, entrySet, finalize, finishedUpdating, getAllLowerCase, getDimensionality, getDocumentLabels, getDocumentsIndexed, getLeftWindowSize, getMaxNumThreads, getRandomDegree, getRandomLabel, getRightWindowSize, getSeed, getSleepFactor, getUnaryLabels, getVocabulary, getWeightingScheme, getWordsIndexed, isEmpty, isPurged, keySet, pruneIndex, pruneIndex, purgeIndex, purgeIndex, purgeIndex, revokePurgedState, setAllLowerCase, setDocumentLabels, setMaxNumThreads, setSleepFactor, setUnaryLabels, setVocabulary, size, toString
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Constructor Detail

SparseDistributedMemory

public SparseDistributedMemory(int dimensionality,
                               int randomDegree,
                               int seed,
                               int leftWindowSize,
                               int rightWindowSize,
                               WeightingScheme weightingScheme)
Create a new SparseDistributedMemory of RandomLabels with the given dimensionality, degree of initial randomness and window size for contextual updates.

Parameters:
dimensionality - The dimensionality all RandomLabels in the SparseDistributedMemory should have, this can not be altered at a later state.
randomDegree - The number of random values all RandomLabels in the SparseDistributedMemory initially should have.
seed - a seed for each label's random generator. This seed, in combination with the word, makes it very likely that the created random label is unique yet reproducable.
leftWindowSize - The maximum number of words behind the focus word to include in the context window when updating a label.
rightWindowSize - The maximum number of words in front of the focus word to include in the context window when updating a label.
weightingScheme - a visiting object defining the weighting of the context labels to the left and to the right of the focus word.

SparseDistributedMemory

public SparseDistributedMemory()
Create a new SparseDistributedMemory of RandomLabels with a dimensionality of 1800, a degree of initial randomness of 8 and a window size for contextual updates of 3 (i.e. three words look-behind and look-ahead).

Method Detail

load

public int load(java.lang.String filename)
Load RandomIndex from file. If non-compressed compressed file is not fount it tries to load compressed file at the same location.

Parameters:
filename - the path and filename (without extension, '.xml' will be added) that the RandomIndex is to be loaded from.
Returns:
number of RandomLabels loaded.

load

public int load(java.lang.String filename,
                java.util.Set<java.lang.String> vocabulary)
Load RandomIndex from file. If non-compressed compressed file is not fount it tries to load compressed file at the same location.

Parameters:
filename - the path and filename (without extension, '.xml' will be added) that the RandomIndex is to be loaded from.
vocabulary - the set of words to load from the RandomIndex, or null if all words are to be loaded into memory.
Returns:
number of RandomLabels loaded.

loadCompressed

public int loadCompressed(java.lang.String filename)
Load RandomIndex from compressed file (i.e. zip archive).

Parameters:
filename - the path and filename (without extension, '.xml.zip' will be added) that the RandomIndex is to be loaded from.
Returns:
number of RandomLabels loaded.

loadCompressed

public int loadCompressed(java.lang.String filename,
                          java.util.Set<java.lang.String> vocabulary)
Load RandomIndex from compressed file (i.e. zip archive).

Parameters:
filename - the path and filename (without extension, '.xml.zip' will be added) that the RandomIndex is to be loaded from.
vocabulary - the set of words to load from the RandomIndex, or null if all words are to be loaded into memory.
Returns:
number of RandomLabels loaded.

save

public int save(java.lang.String filename)
Saves the RandomIndex to the given filename with the extension ".xml" added.

Parameters:
filename - file with full path to save the RandomIndex to.
Returns:
number of RandomLabels that were saved.

saveCompressed

public int saveCompressed(java.lang.String filename)
Saves the RandomIndex to the given filename as a zip-file with the extension ".xml.zip" added.

Parameters:
filename - file with full path to save the compressed RandomIndex to.
Returns:
number of RandomLabels that were saved.

addTextFromFile

public int addTextFromFile(java.lang.String filename)
Adds text from a text file to the RandomIndex, i.e. the text in the text file is read and words, in the order they are encountered in the text, are added to the RandomIndex if they aren't already represented in the index and contextually updated if they are already present.

Parameters:
filename - name, with full path, of the text file which text is to be added to the RandomIndex from.
Returns:
number of words (index terms) added/updated.

getCorrelations

public java.lang.String[] getCorrelations(java.lang.String word,
                                          int setSize,
                                          long minTermFrequency,
                                          long maxTermFrequency,
                                          java.util.HashSet<java.lang.String> restrictedResultSet)
Generate a set of "semantic relatives" for a given word. The returned String array contains the setSize closest index terms sorted by cosine together with their respective Euclidean distance to the given word. However, if the SparseDistributedMemory is empty, that is isEmpty() == true, a zero sized String array will be returned.

Parameters:
word - index term we want to generate a set of semantic relatives for.
setSize - number of desired members of the generated set of semantic relatives.
minTermFrequency - minimum TermFrequency required for a semantic relative to be included in the set.
maxTermFrequency - maximum TermFrequency allowed for a semantic relative to be included in the set.
restrictedResultSet - HashSet with index terms that are allowed to end up in the generated set of semantic relatives.
Returns:
a String array of size setSize if setSize <= SparseDistributedMemory.size(), otherwise it returns a String array of size SparseDistributedMemory.size().

getCorrelations

public java.lang.String[] getCorrelations(java.lang.String word,
                                          int setSize,
                                          long minTermFrequency,
                                          long maxTermFrequency)
Generate a set of "semantic relatives" for a given word. The returned String array contains the setSize closest index terms sorted by cosine together with their respective Euclidean distance to the given word. However, if the SparseDistributedMemory is empty, that is isEmpty() == true, a zero sized String array will be returned.

Parameters:
word - index term we want to generate a set of semantic relatives for.
setSize - number of desired members of the generated set of semantic relatives.
minTermFrequency - minimum TermFrequency required for a semantic relative to be included in the set.
maxTermFrequency - maximum TermFrequency allowed for a semantic relative to be included in the set.
Returns:
a String array of size setSize if setSize <= SparseDistributedMemory.size(), otherwise it returns a String array of size SparseDistributedMemory.size().

getCorrelations

public java.lang.String[] getCorrelations(java.lang.String word,
                                          int setSize)
Generate a set of "semantic relatives" for a given word. The returned String array contains the setSize closest index terms sorted by cosine together with their respective Euclidean distance to the given word. However, if the SparseDistributedMemory is empty, that is isEmpty() == true, a zero sized String array will be returned.

Parameters:
word - index term we want to get a set of "semantic relatives" for.
setSize - number of desired members of the generated set of semantic relatives.
Returns:
a String array of size setSize if setSize <= SparseDistributedMemory.size(), otherwise it returns a String array of size SparseDistributedMemory.size(). The String array contains the setSize closest index terms sorted by cosine together with their respective Euclidean distance to the given word. However, if the SparseDistributedMemory is empty, that is isEmpty() == true, a zero sized String array will be returned.

getCorrelations

public java.lang.String[] getCorrelations(java.lang.String word)
Generate a set of "semantic relatives" for a given word. The returned String array contains the setSize closest index terms sorted by cosine together with their respective Euclidean distance to the given word. However, if the SparseDistributedMemory is empty, that is isEmpty() == true, a zero sized String array will be returned.

Parameters:
word - index term we want to get a set of 10 "semantic relatives" for.
Returns:
a String array of size 10 if SparseDistributedMemory.size() >= 10, otherwise it returns a String array of size SparseDistributedMemory.size(). The String array contains the 10 (or less) closest index terms sorted by cosine together with their respective Euclidean distance to the given word. However, if the SparseDistributedMemory is empty, that is isEmpty() == true, a zero sized String array will be returned.

getTfIDfRank

public java.lang.String[] getTfIDfRank(int setSize)
Generate the setSize best "descriptors" (i.e. the index terms with the highest information value according to tf*idf).

Parameters:
setSize - number of desired members of the generated set of top ranking index terms.
Returns:
a String array of size setSize if setSize <= SparseDistributedMemory.size(), otherwise it returns a String array of size SparseDistributedMemory.size(). The String array contains the top index terms sorted by tf*idf together with their tf*idf value ("term=value"). However, if the SparseDistributedMemory is empty, that is isEmpty() == true, a zero sized String array will be returned.

getDocumentVector

public float[] getDocumentVector(java.lang.String[] document,
                                 java.util.Map<java.lang.String,java.lang.Number> weights,
                                 boolean idfWeighting)
Get a Document Vector, with the same dimensionality as the RandomLabels in the SparseDistributedMemory, representing all words in document that are present in the SDM. This vector is weighted with the supplied weights in the <word, weight>-pairs in document.

Parameters:
document - tokenized document where each String element represents one lexical item (i.e. a word).
weights - a Map containing <word, weight>-pairs where the weight should be an object of type Number. It is allowed to pass null to indicate no weighting should be done.
idfWeighting - true if weighting should be modified with the Inverse Document Frequency (log2 N/n). In this case the weight in the <word,weight>-pairs could be used to represent the word's Term Frequency within the document.
Returns:
a vector of floats representing the Document Vector.