moj.ri
Class RandomLabel

java.lang.Object
  extended by moj.ri.RandomLabel
All Implemented Interfaces:
java.io.Serializable, java.lang.Comparable<java.lang.Object>

public class RandomLabel
extends java.lang.Object
implements java.lang.Comparable<java.lang.Object>, java.io.Serializable

A RandomLabel consists of a word (or term), a term frequency count, a document frequency count, a randomly initialised label and a contextually updated (weighted window) context vector. This container is used by RandomIndex to store the index terms and their corresponding frequency and context data.

Version:
2007-Sept-13
Author:
Martin Hassel
See Also:
Serialized Form

Constructor Summary
RandomLabel()
          Creates an empty RandomLabel with a dimensionality of 0 (zero).
RandomLabel(java.lang.String word)
          Defaults the dimensionality to 1800 and the randomness to 8, i.e.
RandomLabel(java.lang.String word, int dimensionality, int randomDegree, int seed)
          To construct a new RandomLabel object we need the word that is to be 'labelled', the length of the label (i.e.
RandomLabel(java.lang.String word, java.lang.String metadata, long termFrequency, int docFrequency, int[] positivePositions, int[] negativePositions, float[] context)
          Construct new RandomLabel from existing data.
 
Method Summary
 int compareTo(java.lang.Object label)
          Compares this RandomLabel with the specified RandomLabel for order on basis of term frequency.
 float cosineSim(RandomLabel label)
          Calculate the cosine similarity between this RandomLabel and the given.
 float[] getContext()
          Get the context vector associated to the word.
 int getDimensionality()
          Get the random label associated to the word.
 int getDocumentFrequency()
          Get the document frequency for the current word.
 java.lang.String getMetaData()
          Get String based meta data for the RandomLabel.
 int[] getNegativePositions()
          Get the negative positions in the random label associated to the word.
 int[] getPositivePositions()
          Get the positive positions in the random label associated to the word.
 long getTermFrequency()
          Get the term frequency for the current word.
 java.lang.String getWord()
          Get the word which the RandomLabel is associated to.
 int incrementDocumentFrequency()
          Increment the document frequency for the current word.
 long incrementTermFrequency()
          Increment the term frequency for the current word.
 void prune()
          Prunes the RandomLabel by setting the context vector to null.
 java.lang.String setMetaData(java.lang.String metadata)
          Set String based meta data for the RandomLabel.
 java.lang.String toString()
          String representation of the RandomLabel.
 boolean updateContext(RandomLabel[] leftContext, RandomLabel[] rightContext, WeightingScheme weightingScheme)
          Update this RandomLabel's context with the weighted labels of the RandomLabels in left and right context using the supplied weighting scheme (as defined by a visiting object weightingScheme).
 boolean validState()
          Tells whether the RandomLabel is in a valid state or not.
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Constructor Detail

RandomLabel

public RandomLabel(java.lang.String word,
                   int dimensionality,
                   int randomDegree,
                   int seed)
To construct a new RandomLabel object we need the word that is to be 'labelled', the length of the label (i.e. the label's dimensionality), the randomness of the label's initial state (i.e. number of non-zero elements) and a seed.

Parameters:
word - the word which the label is to be associated to.
dimensionality - the length of the label to be associated to the word.
randomDegree - the 'degree' of randomness initially applied to the label given in total number of non-zeros in the initial label (even number). Should not be greater than dimensionality and therefore, in this case, defaults to dimensionality.
seed - a seed for the local random generator. This seed, in combination with word, makes it very likely that the created random label is "unique" yet reproducible.

RandomLabel

public RandomLabel(java.lang.String word)
Defaults the dimensionality to 1800 and the randomness to 8, i.e. four +1 and four -1. Seed is set to 123.

Parameters:
word - the word which the label is to be associated to.

RandomLabel

public RandomLabel()
Creates an empty RandomLabel with a dimensionality of 0 (zero). Usefull for padding.


RandomLabel

public RandomLabel(java.lang.String word,
                   java.lang.String metadata,
                   long termFrequency,
                   int docFrequency,
                   int[] positivePositions,
                   int[] negativePositions,
                   float[] context)
Construct new RandomLabel from existing data.

Parameters:
word - the word which the label is to be associated to.
termFrequency - the term frequency for the associated word.
docFrequency - the document frequency for the associated word.
positivePositions - the positions in the RandomLabel that should hold "1"
negativePositions - the positions in the RandomLabel that should hold "-1"
context - the context vector to be associated to the word (i.e. an array representing the co-occurrence "coloring").
Method Detail

getWord

public java.lang.String getWord()
Get the word which the RandomLabel is associated to.

Returns:
the word which the RandomLabel is associated to.

getMetaData

public java.lang.String getMetaData()
Get String based meta data for the RandomLabel. This can for example be used to store part of speech tags or distributional features.

Returns:
String based meta data for the RandomLabel.

setMetaData

public java.lang.String setMetaData(java.lang.String metadata)
Set String based meta data for the RandomLabel. This can for example be used to store part of speech tags or distributional features.

Returns:
String based meta data for the RandomLabel.

getNegativePositions

public int[] getNegativePositions()
Get the negative positions in the random label associated to the word.

Returns:
the negative part of the randomly generated label associated to the word.

getPositivePositions

public int[] getPositivePositions()
Get the positive positions in the random label associated to the word.

Returns:
the positive part of the randomly generated label associated to the word.

getContext

public float[] getContext()
Get the context vector associated to the word.

Returns:
the contextually updated (weighted window) context vector associated to the word, or null if the label has been pruned.

prune

public void prune()
Prunes the RandomLabel by setting the context vector to null.


getDimensionality

public int getDimensionality()
Get the random label associated to the word.

Returns:
the randomly generated label associated to the word.

getTermFrequency

public long getTermFrequency()
Get the term frequency for the current word.

Returns:
the term frequency, i.e. nr. of contextual updates (including initialization) the RandomLabel has gone through.

incrementTermFrequency

public long incrementTermFrequency()
Increment the term frequency for the current word.

Returns:
the term frequency, i.e. nr. of contextual updates (including initialization) the RandomLabel has gone through.

getDocumentFrequency

public int getDocumentFrequency()
Get the document frequency for the current word.

Returns:
the document frequency, i.e. nr. of documents the word associated with this RandomLabel occurs in.

incrementDocumentFrequency

public int incrementDocumentFrequency()
Increment the document frequency for the current word.

Returns:
the document frequency, i.e. the number of documents the word associated with this RandomLabel occurs in after the increment.

validState

public boolean validState()
Tells whether the RandomLabel is in a valid state or not. The RL is typically in an invalid state during initialisation and when updating the context vector.

Returns:
true if in a valid state, otherwise false

updateContext

public boolean updateContext(RandomLabel[] leftContext,
                             RandomLabel[] rightContext,
                             WeightingScheme weightingScheme)
Update this RandomLabel's context with the weighted labels of the RandomLabels in left and right context using the supplied weighting scheme (as defined by a visiting object weightingScheme). The RandomLabel of the word nearest to this is the first element in the context vectors, (context window) the second nearest the second, and so on (note: this applies to both rightContext and leftContext). All RandomLabels in leftContext and rightContext must have the same dimensionality. However, leftContext and rightContext themselves (i.e. the context window) do not have to be of equal length (i.e. you can have an unbalanced context window).

Parameters:
leftContext - an array of RandomLables where the first element represents the word closest to the left of the word who's label is being updated, the second element represents the word second closest to the left and so on. No slot may be empty (null) as this will cause a NullPointerException.
rightContext - same as for leftContext but for the right side.
weightingScheme - a visiting object that contains the methods for calculating the weights for the left resp. right contexts based upon distance to the current label.
Returns:
returns true upon success, else false. The most common reason for failure is that not all RandomLabels have the same dimensionality.

cosineSim

public float cosineSim(RandomLabel label)
Calculate the cosine similarity between this RandomLabel and the given.

Parameters:
label - the RandomLabel that is to be compared with this RandomLabel.
Returns:
the 2-norm (Euclidean Distance) between this RandomLabel and the given.

compareTo

public int compareTo(java.lang.Object label)
              throws java.lang.ClassCastException
Compares this RandomLabel with the specified RandomLabel for order on basis of term frequency. Returns a negative integer, zero, or a positive integer as this RandomLabel is less than, equal to, or greater than the specified RandomLabel. This gives descending term frequency when used in a sort. Note: this class has a natural ordering that is inconsistent with equals.

Specified by:
compareTo in interface java.lang.Comparable<java.lang.Object>
Parameters:
label - the RandomLabel to be compared.
Returns:
a negative integer, zero, or a positive integer as this object is less than, equal to, or greater than the specified RandomLabel.
Throws:
java.lang.ClassCastException

toString

public java.lang.String toString()
String representation of the RandomLabel.

Overrides:
toString in class java.lang.Object
Returns:
a comma separated string where the first column is the word, the second is meta data, the third is the term frequency, and the fourth is the document frequency. After this comes a column denoting the number of -+1:s followed by the positions for that many +1:s and then that many -1:s. Lastly we have a column denoting the dimensionality of the context vector followed by the actual context vector itself (until end).