|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||
java.lang.Objectmoj.ri.RandomIndex
moj.ri.SparseDistributedMemory
public class SparseDistributedMemory
SparseDistributedMemory extends RandomIndex with saving and loading of the RandomIndex. The index is saved as a huge XML-file so saving/loading directly to/from a zip compressed archive is also provided (usually results in a compression rate of about 98% without loss of speed, but be prepared for that converting all those byte and float vectors to strings does take time.) SparseDistributedMemory also extends RandomIndex with some useful functions for extracting different types of ordered subsets of the RandomIndex.
| Constructor Summary | |
|---|---|
SparseDistributedMemory()
Create a new SparseDistributedMemory of RandomLabels with a dimensionality of 1800, a degree of initial randomness of 8 and a window size for contextual updates of 3 (i.e. |
|
SparseDistributedMemory(int dimensionality,
int randomDegree,
int seed,
int leftWindowSize,
int rightWindowSize,
WeightingScheme weightingScheme)
Create a new SparseDistributedMemory of RandomLabels with the given dimensionality, degree of initial randomness and window size for contextual updates. |
|
| Method Summary | |
|---|---|
int |
addTextFromFile(java.lang.String filename)
Adds text from a text file to the RandomIndex, i.e. |
java.lang.String[] |
getCorrelations(java.lang.String word)
Generate a set of "semantic relatives" for a given word. |
java.lang.String[] |
getCorrelations(java.lang.String word,
int setSize)
Generate a set of "semantic relatives" for a given word. |
java.lang.String[] |
getCorrelations(java.lang.String word,
int setSize,
long minTermFrequency,
long maxTermFrequency)
Generate a set of "semantic relatives" for a given word. |
java.lang.String[] |
getCorrelations(java.lang.String word,
int setSize,
long minTermFrequency,
long maxTermFrequency,
java.util.HashSet<java.lang.String> restrictedResultSet)
Generate a set of "semantic relatives" for a given word. |
float[] |
getDocumentVector(java.lang.String[] document,
java.util.Map<java.lang.String,java.lang.Number> weights,
boolean idfWeighting)
Get a Document Vector, with the same dimensionality as the RandomLabels
in the SparseDistributedMemory, representing all words
in document that are present in the SDM. |
java.lang.String[] |
getTfIDfRank(int setSize)
Generate the setSize best "descriptors" (i.e. |
int |
load(java.lang.String filename)
Load RandomIndex from file. |
int |
load(java.lang.String filename,
java.util.Set<java.lang.String> vocabulary)
Load RandomIndex from file. |
int |
loadCompressed(java.lang.String filename)
Load RandomIndex from compressed file (i.e. |
int |
loadCompressed(java.lang.String filename,
java.util.Set<java.lang.String> vocabulary)
Load RandomIndex from compressed file (i.e. |
int |
save(java.lang.String filename)
Saves the RandomIndex to the given filename
with the extension ".xml" added. |
int |
saveCompressed(java.lang.String filename)
Saves the RandomIndex to the given filename as
a zip-file with the extension ".xml.zip" added. |
| Methods inherited from class java.lang.Object |
|---|
equals, getClass, hashCode, notify, notifyAll, wait, wait, wait |
| Constructor Detail |
|---|
public SparseDistributedMemory(int dimensionality,
int randomDegree,
int seed,
int leftWindowSize,
int rightWindowSize,
WeightingScheme weightingScheme)
dimensionality - The dimensionality all RandomLabels in the SparseDistributedMemory
should have, this can not be altered at a later state.randomDegree - The number of random values all RandomLabels in the
SparseDistributedMemory initially should have.seed - a seed for each label's random generator. This seed, in combination
with the word, makes it very likely that the created random label is
unique yet reproducable.leftWindowSize - The maximum number of words behind the focus word
to include in the context window when updating a label.rightWindowSize - The maximum number of words in front of the focus
word to include in the context window when updating a label.weightingScheme - a visiting object defining the weighting of the context
labels to the left and to the right of the focus word.public SparseDistributedMemory()
| Method Detail |
|---|
public int load(java.lang.String filename)
filename - the path and filename (without extension, '.xml' will be
added) that the RandomIndex is to be loaded from.
RandomLabels loaded.
public int load(java.lang.String filename,
java.util.Set<java.lang.String> vocabulary)
filename - the path and filename (without extension, '.xml' will be
added) that the RandomIndex is to be loaded from.vocabulary - the set of words to load from the RandomIndex, or null
if all words are to be loaded into memory.
RandomLabels loaded.public int loadCompressed(java.lang.String filename)
filename - the path and filename (without extension, '.xml.zip' will be
added) that the RandomIndex is to be loaded from.
RandomLabels loaded.
public int loadCompressed(java.lang.String filename,
java.util.Set<java.lang.String> vocabulary)
filename - the path and filename (without extension, '.xml.zip' will be
added) that the RandomIndex is to be loaded from.vocabulary - the set of words to load from the RandomIndex, or null
if all words are to be loaded into memory.
RandomLabels loaded.public int save(java.lang.String filename)
RandomIndex to the given filename
with the extension ".xml" added.
filename - file with full path to save the RandomIndex to.
RandomLabels that were saved.public int saveCompressed(java.lang.String filename)
RandomIndex to the given filename as
a zip-file with the extension ".xml.zip" added.
filename - file with full path to save the compressed
RandomIndex to.
RandomLabels that were saved.public int addTextFromFile(java.lang.String filename)
RandomIndex, i.e. the
text in the text file is read and words, in the order they are
encountered in the text, are added to the RandomIndex
if they aren't already represented in the index and contextually
updated if they are already present.
filename - name, with full path, of the text file which text is to be
added to the RandomIndex from.
public java.lang.String[] getCorrelations(java.lang.String word,
int setSize,
long minTermFrequency,
long maxTermFrequency,
java.util.HashSet<java.lang.String> restrictedResultSet)
word.
The returned String array contains the setSize closest index
terms sorted by cosine together with their respective Euclidean
distance to the given word. However, if the SparseDistributedMemory
is empty, that is isEmpty() == true, a zero sized String array
will be returned.
word - index term we want to generate a set of semantic relatives for.setSize - number of desired members of the generated set of semantic relatives.minTermFrequency - minimum TermFrequency required for a semantic
relative to be included in the set.maxTermFrequency - maximum TermFrequency allowed for a semantic
relative to be included in the set.restrictedResultSet - HashSet with index terms that are
allowed to end up in the generated set of semantic relatives.
setSize if setSize <=
SparseDistributedMemory.size(), otherwise it returns a
String array of size SparseDistributedMemory.size().
public java.lang.String[] getCorrelations(java.lang.String word,
int setSize,
long minTermFrequency,
long maxTermFrequency)
word.
The returned String array contains the setSize closest index
terms sorted by cosine together with their respective Euclidean
distance to the given word. However, if the SparseDistributedMemory
is empty, that is isEmpty() == true, a zero sized String array
will be returned.
word - index term we want to generate a set of semantic relatives for.setSize - number of desired members of the generated set of semantic relatives.minTermFrequency - minimum TermFrequency required for a semantic
relative to be included in the set.maxTermFrequency - maximum TermFrequency allowed for a semantic
relative to be included in the set.
setSize if setSize <=
SparseDistributedMemory.size(), otherwise it returns a
String array of size SparseDistributedMemory.size().
public java.lang.String[] getCorrelations(java.lang.String word,
int setSize)
word.
The returned String array contains the setSize closest index
terms sorted by cosine together with their respective Euclidean
distance to the given word. However, if the SparseDistributedMemory
is empty, that is isEmpty() == true, a zero sized String array
will be returned.
word - index term we want to get a set of "semantic relatives" for.setSize - number of desired members of the generated set of semantic relatives.
setSize if setSize <=
SparseDistributedMemory.size(), otherwise it returns a
String array of size SparseDistributedMemory.size().
The String array contains the setSize closest index terms
sorted by cosine together with their respective Euclidean
distance to the given word. However, if the SparseDistributedMemory
is empty, that is isEmpty() == true, a zero sized String array
will be returned.public java.lang.String[] getCorrelations(java.lang.String word)
word.
The returned String array contains the setSize closest index
terms sorted by cosine together with their respective Euclidean
distance to the given word. However, if the SparseDistributedMemory
is empty, that is isEmpty() == true, a zero sized String array
will be returned.
word - index term we want to get a set of 10 "semantic relatives" for.
SparseDistributedMemory.size()
>= 10, otherwise it returns a String array of size
SparseDistributedMemory.size(). The String array
contains the 10 (or less) closest index terms sorted by
cosine together with their respective Euclidean
distance to the given word. However, if the
SparseDistributedMemory is empty, that is isEmpty() ==
true, a zero sized String array will be returned.public java.lang.String[] getTfIDfRank(int setSize)
setSize best "descriptors" (i.e. the index terms
with the highest information value according to tf*idf).
setSize - number of desired members of the generated set of top ranking
index terms.
setSize if setSize
<= SparseDistributedMemory.size(), otherwise it returns
a String array of size SparseDistributedMemory.size().
The String array contains the top index terms sorted by tf*idf
together with their tf*idf value ("term=value").
However, if the SparseDistributedMemory is empty, that is
isEmpty() == true, a zero sized String array will be returned.
public float[] getDocumentVector(java.lang.String[] document,
java.util.Map<java.lang.String,java.lang.Number> weights,
boolean idfWeighting)
RandomLabels
in the SparseDistributedMemory, representing all words
in document that are present in the SDM. This vector is weighted
with the supplied weights in the <word,
weight>-pairs in document.
document - tokenized document where each String element
represents one lexical item (i.e. a word).weights - a Map containing <word,
weight>-pairs where the weight should
be an object of type Number. It is allowed to pass
null to indicate no weighting should be done.idfWeighting - true if weighting should be modified
with the Inverse Document Frequency (log2 N/n). In this case
the weight in the <word,weight>-pairs
could be used to represent the word's Term Frequency
within the document.
floats representing the Document Vector.
|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||