org.apache.uima.conceptMapper.support.tokenizer
Class OffsetTokenizer

java.lang.Object
  extended by org.apache.uima.analysis_engine.annotator.Annotator_ImplBase
      extended by org.apache.uima.analysis_engine.annotator.JTextAnnotator_ImplBase
          extended by org.apache.uima.conceptMapper.support.tokenizer.OffsetTokenizer
All Implemented Interfaces:
org.apache.uima.analysis_engine.annotator.BaseAnnotator, org.apache.uima.analysis_engine.annotator.JTextAnnotator

public class OffsetTokenizer
extends org.apache.uima.analysis_engine.annotator.JTextAnnotator_ImplBase

Simple class to tokenize a string (similar to java.util.StringTokenizer), except that this tokenizer returns TokenAnnotation objects, which, in addition to the token text string, also contain the start and end offsets of the token in the original string.

The tokenizer will optionally perform stemming and case normalization on the tokens, and the set of characters that delimit tokens may be specified. The default stemmer is the Snowball Porter stemmer, but any stemmer may be supplied to the tokenizer as long as it implements the Stemmerinterface.


Field Summary
static java.lang.String PARAM_CASE_MATCH
          Configuration parameter key/label for the case matching string
static java.lang.String PARAM_STEMMER_CLASS
          Configuration parameter key/label for the stemmer class spec
static java.lang.String PARAM_TOKEN_DELIM
          Configuration parameter key/label for the token delimiters string
 
Constructor Summary
OffsetTokenizer()
          Create a new OffsetTokenizer.
 
Method Summary
static java.lang.String doFoldCase(java.lang.String token)
           
static java.lang.String doStemming(java.lang.String token, Stemmer stemmer)
           
protected  void doTokenization(org.apache.uima.jcas.JCas jcas, java.lang.String documentText, java.lang.String delimiters)
           
protected  java.lang.String foldCase(java.lang.String token)
          If one of the case folding flags is true and the input string matches the character pattern corresponding to that flag, then convert all letters to lowercase.
protected  boolean getCaseFoldAll()
          Get case folding flag for folding all tokens.
protected  boolean getCaseFoldDigit()
          Get the case folding flag for folding tokens with at least one digit character.
protected  boolean getCaseFoldInitCap()
          Get case folding flag for folding tokens with initial cap.
protected  java.lang.String getDelim()
          Get the current list of delimiters used to separate the input string into tokens.
 Stemmer getStemmer()
           
protected  boolean getStemming()
          Get the current stemming flag.
 java.lang.String getText()
           
 void initialize(org.apache.uima.analysis_engine.annotator.AnnotatorContext annotatorContext)
          Initialize the annotator, which includes compilation of regular expressions, fetching configuration parameters from XML descriptor file, and loading of the dictionary file.
 void initTokenizer(java.lang.String[] paramNames, java.lang.Object[] paramValues)
           
 TokenAnnotation newToken(org.apache.uima.jcas.JCas jcas)
           
 TokenAnnotation nextToken(org.apache.uima.jcas.JCas jcas)
           
protected  void overrideDelim(java.lang.String delim)
          Set the delimiters used to separate the input string into tokens.
 void process(org.apache.uima.jcas.JCas jcas, org.apache.uima.analysis_engine.ResultSpecification aResultSpec)
          Perform the actual analysis.
 void processAllConfigurationParameters(java.lang.String[] configParameterNames, java.lang.Object[] configParameters)
           
 void processConfigurationParameter(java.lang.String configParameterName, java.lang.Object configParameterValue)
           
protected  void setDelim(java.lang.String delim)
          Set the delimiters used to separate the input string into tokens.
 void setStemmer(Stemmer stemmer)
           
 void setText(java.lang.String text)
          Set the text to tokenize.
 boolean shouldFoldCase(java.lang.String token)
           
 boolean shouldStem()
           
protected  java.lang.String stem(java.lang.String token)
          If the stemming flag is true, then return the stemmed form of the supplied word using the Porter stemmer.
 
Methods inherited from class org.apache.uima.analysis_engine.annotator.Annotator_ImplBase
destroy, finalize, getContext, getTypeSystem, reconfigure, typeSystemInit
 
Methods inherited from class java.lang.Object
clone, equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface org.apache.uima.analysis_engine.annotator.BaseAnnotator
destroy, reconfigure, typeSystemInit
 

Field Detail

PARAM_CASE_MATCH

public static final java.lang.String PARAM_CASE_MATCH
Configuration parameter key/label for the case matching string

See Also:
Constant Field Values

PARAM_STEMMER_CLASS

public static final java.lang.String PARAM_STEMMER_CLASS
Configuration parameter key/label for the stemmer class spec

See Also:
Constant Field Values

PARAM_TOKEN_DELIM

public static final java.lang.String PARAM_TOKEN_DELIM
Configuration parameter key/label for the token delimiters string

See Also:
Constant Field Values
Constructor Detail

OffsetTokenizer

public OffsetTokenizer()
Create a new OffsetTokenizer. Initializes the default stemmer and sets up the regular expressions for the various case folding options.

Method Detail

getText

public java.lang.String getText()
Returns:
Returns the text.

setText

public void setText(java.lang.String text)
Set the text to tokenize. After this method is called, the next call to nextToken will return the first token from the input string as a TokenAnnotation; you can get the text by using TokenAnnotation.getText()


getStemmer

public Stemmer getStemmer()
Returns:
Returns the stemmer.

setStemmer

public void setStemmer(Stemmer stemmer)
Parameters:
stemmer - The stemmer to set.

newToken

public TokenAnnotation newToken(org.apache.uima.jcas.JCas jcas)

nextToken

public TokenAnnotation nextToken(org.apache.uima.jcas.JCas jcas)

foldCase

protected java.lang.String foldCase(java.lang.String token)
If one of the case folding flags is true and the input string matches the character pattern corresponding to that flag, then convert all letters to lowercase.

Parameters:
token - The string to case fold
Returns:
The case folded string

doFoldCase

public static java.lang.String doFoldCase(java.lang.String token)

shouldFoldCase

public boolean shouldFoldCase(java.lang.String token)

shouldStem

public boolean shouldStem()

setDelim

protected void setDelim(java.lang.String delim)
Set the delimiters used to separate the input string into tokens. This adds the new delimiters to the base whitespace delimiters " \t\n\r\f".

Parameters:
delim - The new set of delimiters.

overrideDelim

protected void overrideDelim(java.lang.String delim)
Set the delimiters used to separate the input string into tokens. This sets the delimiters to exactly the given set. The base whitespace delimiters are not included.

Parameters:
delim - The new set of delimiters.

getDelim

protected java.lang.String getDelim()
Get the current list of delimiters used to separate the input string into tokens.

Returns:
The current list of delimiters used to separate the input string into tokens.

getStemming

protected boolean getStemming()
Get the current stemming flag.

Returns:
true if stemming is currently on, false otherwise

getCaseFoldInitCap

protected boolean getCaseFoldInitCap()
Get case folding flag for folding tokens with initial cap.

Returns:
the current value of the flag

getCaseFoldDigit

protected boolean getCaseFoldDigit()
Get the case folding flag for folding tokens with at least one digit character.

Returns:
the current value of the flag

getCaseFoldAll

protected boolean getCaseFoldAll()
Get case folding flag for folding all tokens.

Returns:
the current value of the flag.

initialize

public void initialize(org.apache.uima.analysis_engine.annotator.AnnotatorContext annotatorContext)
                throws org.apache.uima.analysis_engine.annotator.AnnotatorInitializationException,
                       org.apache.uima.analysis_engine.annotator.AnnotatorConfigurationException
Initialize the annotator, which includes compilation of regular expressions, fetching configuration parameters from XML descriptor file, and loading of the dictionary file.

Specified by:
initialize in interface org.apache.uima.analysis_engine.annotator.BaseAnnotator
Overrides:
initialize in class org.apache.uima.analysis_engine.annotator.Annotator_ImplBase
Throws:
org.apache.uima.analysis_engine.annotator.AnnotatorInitializationException
org.apache.uima.analysis_engine.annotator.AnnotatorConfigurationException

processAllConfigurationParameters

public void processAllConfigurationParameters(java.lang.String[] configParameterNames,
                                              java.lang.Object[] configParameters)
                                       throws org.apache.uima.analysis_engine.annotator.AnnotatorConfigurationException
Throws:
org.apache.uima.analysis_engine.annotator.AnnotatorConfigurationException

process

public void process(org.apache.uima.jcas.JCas jcas,
                    org.apache.uima.analysis_engine.ResultSpecification aResultSpec)
             throws org.apache.uima.analysis_engine.annotator.AnnotatorProcessException
Perform the actual analysis. Iterate over the document content looking for tokens and post an annotation for each match found.

Parameters:
jcas - the current CAS to process.
aResultSpec - a specification of the result annotation that should be created by this annotator
Throws:
org.apache.uima.analysis_engine.annotator.AnnotatorProcessException
See Also:
JTextAnnotator.process(JCas, ResultSpecification)

initTokenizer

public void initTokenizer(java.lang.String[] paramNames,
                          java.lang.Object[] paramValues)
                   throws java.lang.Exception
Throws:
java.lang.Exception

doTokenization

protected void doTokenization(org.apache.uima.jcas.JCas jcas,
                              java.lang.String documentText,
                              java.lang.String delimiters)
Parameters:
jcas -
documentText -
delimiters -

processConfigurationParameter

public void processConfigurationParameter(java.lang.String configParameterName,
                                          java.lang.Object configParameterValue)
Parameters:
configParameterName -
configParameterValue -

stem

protected java.lang.String stem(java.lang.String token)
If the stemming flag is true, then return the stemmed form of the supplied word using the Porter stemmer.

Parameters:
token - the word to stem
Returns:
the original word if the stemming flag is false, otherwise the stemmed form of the word

doStemming

public static java.lang.String doStemming(java.lang.String token,
                                          Stemmer stemmer)


Copyright © 2011. All Rights Reserved.