|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectmoj.ri.RandomIndex
moj.ri.SparseDistributedMemory
public class SparseDistributedMemory
SparseDistributedMemory extends RandomIndex with saving and loading of the RandomIndex. The index is saved as a huge XML-file so saving/loading directly to/from a zip compressed archive is also provided (usually results in a compression rate of about 98% without loss of speed, but be prepared for that converting all those byte and float vectors to strings does take time.) SparseDistributedMemory also extends RandomIndex with some useful functions for extracting different types of ordered subsets of the RandomIndex.
Constructor Summary | |
---|---|
SparseDistributedMemory()
Create a new SparseDistributedMemory of RandomLabels with a dimensionality of 1800, a degree of initial randomness of 8 and a window size for contextual updates of 3 (i.e. |
|
SparseDistributedMemory(int dimensionality,
int randomDegree,
int seed,
int leftWindowSize,
int rightWindowSize,
WeightingScheme weightingScheme)
Create a new SparseDistributedMemory of RandomLabels with the given dimensionality, degree of initial randomness and window size for contextual updates. |
Method Summary | |
---|---|
int |
addTextFromFile(java.lang.String filename)
Adds text from a text file to the RandomIndex , i.e. |
java.lang.String[] |
getCorrelations(java.lang.String word)
Generate a set of "semantic relatives" for a given word . |
java.lang.String[] |
getCorrelations(java.lang.String word,
int setSize)
Generate a set of "semantic relatives" for a given word . |
java.lang.String[] |
getCorrelations(java.lang.String word,
int setSize,
long minTermFrequency,
long maxTermFrequency)
Generate a set of "semantic relatives" for a given word . |
java.lang.String[] |
getCorrelations(java.lang.String word,
int setSize,
long minTermFrequency,
long maxTermFrequency,
java.util.HashSet<java.lang.String> restrictedResultSet)
Generate a set of "semantic relatives" for a given word . |
float[] |
getDocumentVector(java.lang.String[] document,
java.util.Map<java.lang.String,java.lang.Number> weights,
boolean idfWeighting)
Get a Document Vector, with the same dimensionality as the RandomLabel s
in the SparseDistributedMemory , representing all word s
in document that are present in the SDM. |
java.lang.String[] |
getTfIDfRank(int setSize)
Generate the setSize best "descriptors" (i.e. |
int |
load(java.lang.String filename)
Load RandomIndex from file. |
int |
load(java.lang.String filename,
java.util.Set<java.lang.String> vocabulary)
Load RandomIndex from file. |
int |
loadCompressed(java.lang.String filename)
Load RandomIndex from compressed file (i.e. |
int |
loadCompressed(java.lang.String filename,
java.util.Set<java.lang.String> vocabulary)
Load RandomIndex from compressed file (i.e. |
int |
save(java.lang.String filename)
Saves the RandomIndex to the given filename
with the extension ".xml" added. |
int |
saveCompressed(java.lang.String filename)
Saves the RandomIndex to the given filename as
a zip-file with the extension ".xml.zip" added. |
Methods inherited from class java.lang.Object |
---|
equals, getClass, hashCode, notify, notifyAll, wait, wait, wait |
Constructor Detail |
---|
public SparseDistributedMemory(int dimensionality, int randomDegree, int seed, int leftWindowSize, int rightWindowSize, WeightingScheme weightingScheme)
dimensionality
- The dimensionality all RandomLabels in the SparseDistributedMemory
should have, this can not be altered at a later state.randomDegree
- The number of random values all RandomLabels in the
SparseDistributedMemory initially should have.seed
- a seed for each label's random generator. This seed, in combination
with the word, makes it very likely that the created random label is
unique yet reproducable.leftWindowSize
- The maximum number of words behind the focus word
to include in the context window when updating a label.rightWindowSize
- The maximum number of words in front of the focus
word to include in the context window when updating a label.weightingScheme
- a visiting object defining the weighting of the context
labels to the left and to the right of the focus word.public SparseDistributedMemory()
Method Detail |
---|
public int load(java.lang.String filename)
filename
- the path and filename (without extension, '.xml' will be
added) that the RandomIndex is to be loaded from.
RandomLabel
s loaded.public int load(java.lang.String filename, java.util.Set<java.lang.String> vocabulary)
filename
- the path and filename (without extension, '.xml' will be
added) that the RandomIndex is to be loaded from.vocabulary
- the set of words to load from the RandomIndex, or null
if all words are to be loaded into memory.
RandomLabel
s loaded.public int loadCompressed(java.lang.String filename)
filename
- the path and filename (without extension, '.xml.zip' will be
added) that the RandomIndex is to be loaded from.
RandomLabel
s loaded.public int loadCompressed(java.lang.String filename, java.util.Set<java.lang.String> vocabulary)
filename
- the path and filename (without extension, '.xml.zip' will be
added) that the RandomIndex is to be loaded from.vocabulary
- the set of words to load from the RandomIndex, or null
if all words are to be loaded into memory.
RandomLabel
s loaded.public int save(java.lang.String filename)
RandomIndex
to the given filename
with the extension ".xml" added.
filename
- file with full path to save the RandomIndex
to.
RandomLabel
s that were saved.public int saveCompressed(java.lang.String filename)
RandomIndex
to the given filename
as
a zip-file with the extension ".xml.zip" added.
filename
- file with full path to save the compressed
RandomIndex
to.
RandomLabel
s that were saved.public int addTextFromFile(java.lang.String filename)
RandomIndex
, i.e. the
text in the text file is read and words, in the order they are
encountered in the text, are added to the RandomIndex
if they aren't already represented in the index and contextually
updated if they are already present.
filename
- name, with full path, of the text file which text is to be
added to the RandomIndex
from.
public java.lang.String[] getCorrelations(java.lang.String word, int setSize, long minTermFrequency, long maxTermFrequency, java.util.HashSet<java.lang.String> restrictedResultSet)
word
.
The returned String array contains the setSize
closest index
terms sorted by cosine together with their respective Euclidean
distance to the given word
. However, if the SparseDistributedMemory
is empty, that is isEmpty() == true
, a zero sized String array
will be returned.
word
- index term we want to generate a set of semantic relatives for.setSize
- number of desired members of the generated set of semantic relatives.minTermFrequency
- minimum TermFrequency
required for a semantic
relative to be included in the set.maxTermFrequency
- maximum TermFrequency
allowed for a semantic
relative to be included in the set.restrictedResultSet
- HashSet
with index terms that are
allowed to end up in the generated set of semantic relatives.
setSize
if setSize <=
SparseDistributedMemory.size()
, otherwise it returns a
String array of size SparseDistributedMemory.size()
.public java.lang.String[] getCorrelations(java.lang.String word, int setSize, long minTermFrequency, long maxTermFrequency)
word
.
The returned String array contains the setSize
closest index
terms sorted by cosine together with their respective Euclidean
distance to the given word
. However, if the SparseDistributedMemory
is empty, that is isEmpty() == true
, a zero sized String array
will be returned.
word
- index term we want to generate a set of semantic relatives for.setSize
- number of desired members of the generated set of semantic relatives.minTermFrequency
- minimum TermFrequency
required for a semantic
relative to be included in the set.maxTermFrequency
- maximum TermFrequency
allowed for a semantic
relative to be included in the set.
setSize
if setSize <=
SparseDistributedMemory.size()
, otherwise it returns a
String array of size SparseDistributedMemory.size()
.public java.lang.String[] getCorrelations(java.lang.String word, int setSize)
word
.
The returned String array contains the setSize
closest index
terms sorted by cosine together with their respective Euclidean
distance to the given word
. However, if the SparseDistributedMemory
is empty, that is isEmpty() == true
, a zero sized String array
will be returned.
word
- index term we want to get a set of "semantic relatives" for.setSize
- number of desired members of the generated set of semantic relatives.
setSize
if setSize <=
SparseDistributedMemory.size()
, otherwise it returns a
String array of size SparseDistributedMemory.size()
.
The String array contains the setSize
closest index terms
sorted by cosine
together with their respective Euclidean
distance to the given word
. However, if the SparseDistributedMemory
is empty, that is isEmpty() == true
, a zero sized String array
will be returned.public java.lang.String[] getCorrelations(java.lang.String word)
word
.
The returned String array contains the setSize
closest index
terms sorted by cosine together with their respective Euclidean
distance to the given word
. However, if the SparseDistributedMemory
is empty, that is isEmpty() == true
, a zero sized String array
will be returned.
word
- index term we want to get a set of 10 "semantic relatives" for.
SparseDistributedMemory.size()
>= 10
, otherwise it returns a String array of size
SparseDistributedMemory.size()
. The String array
contains the 10 (or less) closest index terms sorted by
cosine
together with their respective Euclidean
distance to the given word
. However, if the
SparseDistributedMemory is empty, that is isEmpty() ==
true
, a zero sized String array will be returned.public java.lang.String[] getTfIDfRank(int setSize)
setSize
best "descriptors" (i.e. the index terms
with the highest information value according to tf*idf).
setSize
- number of desired members of the generated set of top ranking
index terms.
setSize
if setSize
<= SparseDistributedMemory.size()
, otherwise it returns
a String array of size SparseDistributedMemory.size()
.
The String array contains the top index terms sorted by tf*idf
together with their tf*idf
value ("term=value").
However, if the SparseDistributedMemory is empty, that is
isEmpty() == true
, a zero sized String array will be returned.public float[] getDocumentVector(java.lang.String[] document, java.util.Map<java.lang.String,java.lang.Number> weights, boolean idfWeighting)
RandomLabel
s
in the SparseDistributedMemory
, representing all word
s
in document
that are present in the SDM. This vector is weighted
with the supplied weight
s in the <word
,
weight
>-pairs in document
.
document
- tokenized document where each String
element
represents one lexical item (i.e. a word).weights
- a Map
containing <word
,
weight
>-pairs where the weight
should
be an object of type Number
. It is allowed to pass
null
to indicate no weighting should be done.idfWeighting
- true
if weighting should be modified
with the Inverse Document Frequency (log2 N/n). In this case
the weight
in the <word
,weight
>-pairs
could be used to represent the word
's Term Frequency
within the document.
float
s representing the Document Vector.
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |