I am aware that there are a lot of ugly code here, but I don't really feel
like doing anything about that now. If you find any errors or inefficiencies
in the code, please feel free to let me know, and I will most likely update
the code. This tagger has been described in a paper (though not the 
implementation details):
@inproceedings{stomp,
  author = {Jonas Sj\"{o}bergh},
  title = {Stomp, a {POS}-tagger with a different view},
  booktitle = "Proceedings of RANLP-2003",
  address = "Borovets, Bulgaria",
  year = "2003"
}
available here:
http://www.nada.kth.se/~jsh/publications/stomp03.ps

In order to do anything useful you need some training data for the tagger. It 
is intended to do part-of-speech tagging, but any other mark-up task where 
context gives a useful clue should be possible (shallow parsing would be one
example). The training data should be formated with one word (or whatever you want 
to annotate) and one tag (or whatever...) on each line, separated by a tab.
Stomp also needs a list of all tags used in the training data (though it is
trivial to change the code to infer this from the training data), with one
tag on each line. 

You run the tagger like this:
stomp <directory with training data> <file to annotate>

In the directory there should be two files, one named 'corpus' containing the 
training data and one named taglex, with the list of tags. The file to annotate 
should have white space separated tokens (i.e. "Hello (again)." should be
written like "Hello ( again ) ." or similarily). A tiny example corpus is included. 

I used gcc 3.3 for Solaris when compiling Stomp. Any standard compliant C++ compiler
should be ok, as long as you change the typedefs for the maps (using the standard
map works, but is very slow). Any gcc after 3.0 should work, with the hash_map
provided with the compiler.

Some things to do to make Stomp more useful:
* Handle numbers 
* Handle other easily recognized things (such as date expressions)
