|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectorg.apache.uima.analysis_engine.annotator.Annotator_ImplBase
org.apache.uima.analysis_engine.annotator.JTextAnnotator_ImplBase
org.apache.uima.conceptMapper.support.tokenizer.OffsetTokenizer
public class OffsetTokenizer
Simple class to tokenize a string (similar to java.util.StringTokenizer
), except
that this tokenizer returns
TokenAnnotation
objects, which, in
addition to the token text string, also contain the start and end offsets of the token in the
original string.
The tokenizer will optionally perform stemming and case normalization on the tokens, and the set
of characters that delimit tokens may be specified. The default stemmer is the Snowball Porter
stemmer, but any stemmer may be supplied to the tokenizer as long as it implements the
Stemmer
interface.
Field Summary | |
---|---|
static java.lang.String |
PARAM_CASE_MATCH
Configuration parameter key/label for the case matching string |
static java.lang.String |
PARAM_STEMMER_CLASS
Configuration parameter key/label for the stemmer class spec |
static java.lang.String |
PARAM_TOKEN_DELIM
Configuration parameter key/label for the token delimiters string |
Constructor Summary | |
---|---|
OffsetTokenizer()
Create a new OffsetTokenizer . |
Method Summary | |
---|---|
static java.lang.String |
doFoldCase(java.lang.String token)
|
static java.lang.String |
doStemming(java.lang.String token,
Stemmer stemmer)
|
protected void |
doTokenization(org.apache.uima.jcas.JCas jcas,
java.lang.String documentText,
java.lang.String delimiters)
|
protected java.lang.String |
foldCase(java.lang.String token)
If one of the case folding flags is true and the input string matches the character pattern corresponding to that flag, then convert all letters to lowercase. |
protected boolean |
getCaseFoldAll()
Get case folding flag for folding all tokens. |
protected boolean |
getCaseFoldDigit()
Get the case folding flag for folding tokens with at least one digit character. |
protected boolean |
getCaseFoldInitCap()
Get case folding flag for folding tokens with initial cap. |
protected java.lang.String |
getDelim()
Get the current list of delimiters used to separate the input string into tokens. |
Stemmer |
getStemmer()
|
protected boolean |
getStemming()
Get the current stemming flag. |
java.lang.String |
getText()
|
void |
initialize(org.apache.uima.analysis_engine.annotator.AnnotatorContext annotatorContext)
Initialize the annotator, which includes compilation of regular expressions, fetching configuration parameters from XML descriptor file, and loading of the dictionary file. |
void |
initTokenizer(java.lang.String[] paramNames,
java.lang.Object[] paramValues)
|
TokenAnnotation |
newToken(org.apache.uima.jcas.JCas jcas)
|
TokenAnnotation |
nextToken(org.apache.uima.jcas.JCas jcas)
|
protected void |
overrideDelim(java.lang.String delim)
Set the delimiters used to separate the input string into tokens. |
void |
process(org.apache.uima.jcas.JCas jcas,
org.apache.uima.analysis_engine.ResultSpecification aResultSpec)
Perform the actual analysis. |
void |
processAllConfigurationParameters(java.lang.String[] configParameterNames,
java.lang.Object[] configParameters)
|
void |
processConfigurationParameter(java.lang.String configParameterName,
java.lang.Object configParameterValue)
|
protected void |
setDelim(java.lang.String delim)
Set the delimiters used to separate the input string into tokens. |
void |
setStemmer(Stemmer stemmer)
|
void |
setText(java.lang.String text)
Set the text to tokenize. |
boolean |
shouldFoldCase(java.lang.String token)
|
boolean |
shouldStem()
|
protected java.lang.String |
stem(java.lang.String token)
If the stemming flag is true, then return the stemmed form of the supplied word using the Porter stemmer. |
Methods inherited from class org.apache.uima.analysis_engine.annotator.Annotator_ImplBase |
---|
destroy, finalize, getContext, getTypeSystem, reconfigure, typeSystemInit |
Methods inherited from class java.lang.Object |
---|
clone, equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Methods inherited from interface org.apache.uima.analysis_engine.annotator.BaseAnnotator |
---|
destroy, reconfigure, typeSystemInit |
Field Detail |
---|
public static final java.lang.String PARAM_CASE_MATCH
public static final java.lang.String PARAM_STEMMER_CLASS
public static final java.lang.String PARAM_TOKEN_DELIM
Constructor Detail |
---|
public OffsetTokenizer()
OffsetTokenizer
. Initializes the default stemmer and sets up the
regular expressions for the various case folding options.
Method Detail |
---|
public java.lang.String getText()
public void setText(java.lang.String text)
nextToken
will return the first token from the input string
as a TokenAnnotation; you can get the text by using
TokenAnnotation.getText()
public Stemmer getStemmer()
public void setStemmer(Stemmer stemmer)
stemmer
- The stemmer to set.public TokenAnnotation newToken(org.apache.uima.jcas.JCas jcas)
public TokenAnnotation nextToken(org.apache.uima.jcas.JCas jcas)
protected java.lang.String foldCase(java.lang.String token)
token
- The string to case fold
public static java.lang.String doFoldCase(java.lang.String token)
public boolean shouldFoldCase(java.lang.String token)
public boolean shouldStem()
protected void setDelim(java.lang.String delim)
delim
- The new set of delimiters.protected void overrideDelim(java.lang.String delim)
delim
- The new set of delimiters.protected java.lang.String getDelim()
protected boolean getStemming()
protected boolean getCaseFoldInitCap()
protected boolean getCaseFoldDigit()
protected boolean getCaseFoldAll()
public void initialize(org.apache.uima.analysis_engine.annotator.AnnotatorContext annotatorContext) throws org.apache.uima.analysis_engine.annotator.AnnotatorInitializationException, org.apache.uima.analysis_engine.annotator.AnnotatorConfigurationException
initialize
in interface org.apache.uima.analysis_engine.annotator.BaseAnnotator
initialize
in class org.apache.uima.analysis_engine.annotator.Annotator_ImplBase
org.apache.uima.analysis_engine.annotator.AnnotatorInitializationException
org.apache.uima.analysis_engine.annotator.AnnotatorConfigurationException
public void processAllConfigurationParameters(java.lang.String[] configParameterNames, java.lang.Object[] configParameters) throws org.apache.uima.analysis_engine.annotator.AnnotatorConfigurationException
org.apache.uima.analysis_engine.annotator.AnnotatorConfigurationException
public void process(org.apache.uima.jcas.JCas jcas, org.apache.uima.analysis_engine.ResultSpecification aResultSpec) throws org.apache.uima.analysis_engine.annotator.AnnotatorProcessException
jcas
- the current CAS to process.aResultSpec
- a specification of the result annotation that should be created by this annotator
org.apache.uima.analysis_engine.annotator.AnnotatorProcessException
JTextAnnotator.process(JCas, ResultSpecification)
public void initTokenizer(java.lang.String[] paramNames, java.lang.Object[] paramValues) throws java.lang.Exception
java.lang.Exception
protected void doTokenization(org.apache.uima.jcas.JCas jcas, java.lang.String documentText, java.lang.String delimiters)
jcas
- documentText
- delimiters
- public void processConfigurationParameter(java.lang.String configParameterName, java.lang.Object configParameterValue)
configParameterName
- configParameterValue
- protected java.lang.String stem(java.lang.String token)
token
- the word to stem
public static java.lang.String doStemming(java.lang.String token, Stemmer stemmer)
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |