moj.util
Class TagStripper

java.lang.Object
  extended by moj.util.TagStripper

public class TagStripper
extends java.lang.Object

A small utility class that contains methods for stripping tags from documents. It tries to retain the structure of the original text by converting line breaks, paragraphs, headings etc to ordinary (hard) single and double line breaks. It also strips comments and converts entities. The design for this class is heavily influenced by an example in "Web Client Programming with Java" by Elliotte Rusty Harold.

Version:
2006-Dec-08
Author:
Martin Hassel

Field Summary
static java.util.regex.Pattern aamp
           
static java.util.regex.Pattern agt
           
static java.util.regex.Pattern alt
           
static java.util.regex.Pattern aquot
           
static java.util.regex.Pattern aring
           
static java.util.regex.Pattern Aring
           
static java.util.regex.Pattern auml
           
static java.util.regex.Pattern Auml
           
static java.util.regex.Pattern ouml
           
static java.util.regex.Pattern Ouml
           
static java.util.regex.Pattern xamp
           
static java.util.regex.Pattern xaring
           
static java.util.regex.Pattern xAring
           
static java.util.regex.Pattern xauml
           
static java.util.regex.Pattern xAuml
           
static java.util.regex.Pattern xgt
           
static java.util.regex.Pattern xlt
           
static java.util.regex.Pattern xouml
           
static java.util.regex.Pattern xOuml
           
static java.util.regex.Pattern xquot
           
 
Constructor Summary
TagStripper()
           
 
Method Summary
static java.lang.String fromControlCodes(java.lang.String text)
          Replaces all C0/C1 control codes in text with its abbreviated name.
static java.lang.String fromXMLentities(java.lang.String text)
          Convert XML entities (&lt; &gt; &amp; &quot;) to markup characters (< > & ").
static java.lang.String removeControlCodes(java.lang.String text)
          Replaces all C0/C1 control codes in text with whitespace.
 java.lang.String stripHTML(java.io.Reader reader)
          Strips HTML tags and comments from the given steam and returns a string while trying to retain the structure of the original text by converting HTML line breaks, paragraphs, headings etc to ordinary (hard) single and double line breaks.
 java.lang.String stripHTML(java.lang.String htmltext)
          Strips HTML tags and comments from the given string and returns a string while trying to retain the structure of the original text by converting HTML line breaks, paragraphs, headings etc to ordinary (hard) single and double line breaks.
static java.lang.String toXMLentities(java.lang.String text)
          Convert markup characters (< > & ") to XML entities (&lt; &gt; &amp; &quot;).
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

xlt

public static final java.util.regex.Pattern xlt

xgt

public static final java.util.regex.Pattern xgt

xamp

public static final java.util.regex.Pattern xamp

xquot

public static final java.util.regex.Pattern xquot

xaring

public static final java.util.regex.Pattern xaring

xauml

public static final java.util.regex.Pattern xauml

xouml

public static final java.util.regex.Pattern xouml

xAring

public static final java.util.regex.Pattern xAring

xAuml

public static final java.util.regex.Pattern xAuml

xOuml

public static final java.util.regex.Pattern xOuml

alt

public static final java.util.regex.Pattern alt

agt

public static final java.util.regex.Pattern agt

aamp

public static final java.util.regex.Pattern aamp

aquot

public static final java.util.regex.Pattern aquot

aring

public static final java.util.regex.Pattern aring

auml

public static final java.util.regex.Pattern auml

ouml

public static final java.util.regex.Pattern ouml

Aring

public static final java.util.regex.Pattern Aring

Auml

public static final java.util.regex.Pattern Auml

Ouml

public static final java.util.regex.Pattern Ouml
Constructor Detail

TagStripper

public TagStripper()
Method Detail

stripHTML

public java.lang.String stripHTML(java.lang.String htmltext)
Strips HTML tags and comments from the given string and returns a string while trying to retain the structure of the original text by converting HTML line breaks, paragraphs, headings etc to ordinary (hard) single and double line breaks. Also strips comments and converts HTML entities.

Parameters:
htmltext - the string that is to be stripped from HTML
Returns:
the given string stripped from HTML

stripHTML

public java.lang.String stripHTML(java.io.Reader reader)
Strips HTML tags and comments from the given steam and returns a string while trying to retain the structure of the original text by converting HTML line breaks, paragraphs, headings etc to ordinary (hard) single and double line breaks. Also strips comments and converts HTML entities.

Parameters:
reader - the stream that is to be stripped from HTML
Returns:
the given stream stripped from HTML returned as a string

toXMLentities

public static java.lang.String toXMLentities(java.lang.String text)
Convert markup characters (< > & ") to XML entities (&lt; &gt; &amp; &quot;).

Parameters:
text - String (possibly) containing markup characters
Returns:
String with markup characters converted to XML entities

fromXMLentities

public static java.lang.String fromXMLentities(java.lang.String text)
Convert XML entities (&lt; &gt; &amp; &quot;) to markup characters (< > & ").

Parameters:
text - String (possibly) containing XML entities
Returns:
String with XML entities converted to markup characters

removeControlCodes

public static java.lang.String removeControlCodes(java.lang.String text)
Replaces all C0/C1 control codes in text with whitespace.

Parameters:
text - String to remove C0/C1 control codes from
Returns:
String with C0/C1 control removed

fromControlCodes

public static java.lang.String fromControlCodes(java.lang.String text)
Replaces all C0/C1 control codes in text with its abbreviated name.

Parameters:
text - String to replace C0/C1 control codes in
Returns:
String with C0/C1 control replaced