weka.datagenerators.clusterers
Class BIRCHCluster

java.lang.Object
  extended by weka.datagenerators.DataGenerator
      extended by weka.datagenerators.ClusterGenerator
          extended by weka.datagenerators.clusterers.BIRCHCluster
All Implemented Interfaces:
java.io.Serializable, OptionHandler, Randomizable, RevisionHandler, TechnicalInformationHandler

public class BIRCHCluster
extends ClusterGenerator
implements TechnicalInformationHandler

Cluster data generator designed for the BIRCH System

Dataset is generated with instances in K clusters.
Instances are 2-d data points.
Each cluster is characterized by the number of data points in itits radius and its center. The location of the cluster centers isdetermined by the pattern parameter. Three patterns are currentlysupported grid, sine and random.

For more information refer to:

Tian Zhang, Raghu Ramakrishnan, Miron Livny: BIRCH: An Efficient Data Clustering Method for Very Large Databases. In: ACM SIGMOD International Conference on Management of Data, 103-114, 1996.

BibTeX:

 @inproceedings{Zhang1996,
    author = {Tian Zhang and Raghu Ramakrishnan and Miron Livny},
    booktitle = {ACM SIGMOD International Conference on Management of Data},
    pages = {103-114},
    publisher = {ACM Press},
    title = {BIRCH: An Efficient Data Clustering Method for Very Large Databases},
    year = {1996}
 }
 

Valid options are:

 -h
  Prints this help.
 -o <file>
  The name of the output file, otherwise the generated data is
  printed to stdout.
 -r <name>
  The name of the relation.
 -d
  Whether to print debug informations.
 -S
  The seed for random function (default 1)
 -a <num>
  The number of attributes (default 10).
 -c
  Class Flag, if set, the cluster is listed in extra attribute.
 -b <range>
  The indices for boolean attributes.
 -m <range>
  The indices for nominal attributes.
 -k <num>
  The number of clusters (default 4)
 -G
  Set pattern to grid (default is random).
  This flag cannot be used at the same time as flag I.
  The pattern is random, if neither flag G nor flag I is set.
 -I
  Set pattern to sine (default is random).
  This flag cannot be used at the same time as flag I.
  The pattern is random, if neither flag G nor flag I is set.
 -N <num>..<num>
  The range of number of instances per cluster (default 1..50).
  Lower number must be between 0 and 2500,
  upper number must be between 50 and 2500.
 -R <num>..<num>
  The range of radius per cluster (default 0.1..1.4142135623730951).
  Lower number must be between 0 and SQRT(2), 
  upper number must be between SQRT(2) and SQRT(32).
 -M <num>
  The distance multiplier (default 4.0).
 -C <num>
  The number of cycles (default 4).
 -O
  Flag for input order is ORDERED. If flag is not set then 
  input order is RANDOMIZED. RANDOMIZED is currently not 
  implemented, therefore is the input order always ORDERED.
 -P <num>
  The noise rate in percent (default 0.0).
  Can be between 0% and 30%. (Remark: The original 
  algorithm only allows noise up to 10%.)

Version:
$Revision: 1.8 $
Author:
Gabi Schmidberger (gabi@cs.waikato.ac.nz), FracPete (fracpete at waikato dot ac dot nz)
See Also:
Serialized Form

Field Summary
static int GRID
          Constant set for choice of pattern.
static int ORDERED
          Constant set for input order (option O)
static int RANDOM
          Constant set for choice of pattern.
static int RANDOMIZED
          Constant set for input order (default)
static int SINE
          Constant set for choice of pattern.
static Tag[] TAGS_INPUTORDER
          the input order tags
static Tag[] TAGS_PATTERN
          the pattern tags
 
Constructor Summary
BIRCHCluster()
          initializes the generator with default values
 
Method Summary
 Instances defineDataFormat()
          Initializes the format for the dataset produced.
 java.lang.String distMultTipText()
          Returns the tip text for this property
 Instance generateExample()
          Generate an example of the dataset.
 Instances generateExamples()
          Generate all examples of the dataset.
 Instances generateExamples(java.util.Random random, Instances format)
          Generate all examples of the dataset.
 java.lang.String generateFinished()
          Compiles documentation about the data generation after the generation process
 java.lang.String generateStart()
          Compiles documentation about the data generation before the generation process
 double getDistMult()
          Gets the distance multiplier.
 SelectedTag getInputOrder()
          Gets the input order.
 int getMaxInstNum()
          Gets the upper boundary for instances per cluster.
 double getMaxRadius()
          Gets the upper boundary for the radiuses of the clusters.
 int getMinInstNum()
          Gets the lower boundary for instances per cluster.
 double getMinRadius()
          Gets the lower boundary for the radiuses of the clusters.
 double getNoiseRate()
          Gets the percentage of noise set.
 int getNumClusters()
          Gets the number of clusters the dataset should have.
 int getNumCycles()
          Gets the number of cycles.
 java.lang.String[] getOptions()
          Gets the current settings of the datagenerator BIRCHCluster.
 boolean getOrderedFlag()
          Gets the ordered flag (option O).
 SelectedTag getPattern()
          Gets the pattern type.
 java.lang.String getRevision()
          Returns the revision string.
 boolean getSingleModeFlag()
          Gets the single mode flag.
 TechnicalInformation getTechnicalInformation()
          Returns an instance of a TechnicalInformation object, containing detailed information about the technical background of this class, e.g., paper reference or book this class is based on.
 java.lang.String globalInfo()
          Returns a string describing this data generator.
 java.lang.String inputOrderTipText()
          Returns the tip text for this property
 java.util.Enumeration listOptions()
          Returns an enumeration describing the available options.
static void main(java.lang.String[] args)
          Main method for testing this class.
 java.lang.String maxInstNumTipText()
          Returns the tip text for this property
 java.lang.String maxRadiusTipText()
          Returns the tip text for this property
 java.lang.String minInstNumTipText()
          Returns the tip text for this property
 java.lang.String minRadiusTipText()
          Returns the tip text for this property
 java.lang.String noiseRateTipText()
          Returns the tip text for this property
 java.lang.String numClustersTipText()
          Returns the tip text for this property
 java.lang.String numCyclesTipText()
          Returns the tip text for this property
 java.lang.String patternTipText()
          Returns the tip text for this property
 void setDistMult(double newDistMult)
          Sets the distance multiplier.
 void setInputOrder(SelectedTag value)
          Sets the input order.
 void setMaxInstNum(int newMaxInstNum)
          Sets the upper boundary for instances per cluster.
 void setMaxRadius(double newMaxRadius)
          Sets the upper boundary for the radiuses of the clusters.
 void setMinInstNum(int newMinInstNum)
          Sets the lower boundary for instances per cluster.
 void setMinRadius(double newMinRadius)
          Sets the lower boundary for the radiuses of the clusters.
 void setNoiseRate(double newNoiseRate)
          Sets the percentage of noise set.
 void setNumClusters(int numClusters)
          Sets the number of clusters the dataset should have.
 void setNumCycles(int newNumCycles)
          Sets the the number of cycles.
 void setOptions(java.lang.String[] options)
          Parses a list of options for this object.
 void setPattern(SelectedTag value)
          Sets the pattern type.
 
Methods inherited from class weka.datagenerators.ClusterGenerator
booleanColsTipText, classFlagTipText, getBooleanCols, getClassFlag, getNominalCols, getNumAttributes, nominalColsTipText, numAttributesTipText, setBooleanCols, setBooleanIndices, setClassFlag, setNominalCols, setNominalIndices, setNumAttributes
 
Methods inherited from class weka.datagenerators.DataGenerator
debugTipText, defaultOutput, formatTipText, getDatasetFormat, getDebug, getNumExamplesAct, getOutput, getRandom, getRelationName, getSeed, makeData, outputTipText, randomTipText, relationNameTipText, seedTipText, setDatasetFormat, setDebug, setOutput, setRandom, setRelationName, setSeed
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

GRID

public static final int GRID
Constant set for choice of pattern. (option G)

See Also:
Constant Field Values

SINE

public static final int SINE
Constant set for choice of pattern. (option I)

See Also:
Constant Field Values

RANDOM

public static final int RANDOM
Constant set for choice of pattern. (default)

See Also:
Constant Field Values

TAGS_PATTERN

public static final Tag[] TAGS_PATTERN
the pattern tags


ORDERED

public static final int ORDERED
Constant set for input order (option O)

See Also:
Constant Field Values

RANDOMIZED

public static final int RANDOMIZED
Constant set for input order (default)

See Also:
Constant Field Values

TAGS_INPUTORDER

public static final Tag[] TAGS_INPUTORDER
the input order tags

Constructor Detail

BIRCHCluster

public BIRCHCluster()
initializes the generator with default values

Method Detail

globalInfo

public java.lang.String globalInfo()
Returns a string describing this data generator.

Returns:
a description of the data generator suitable for displaying in the explorer/experimenter gui

getTechnicalInformation

public TechnicalInformation getTechnicalInformation()
Returns an instance of a TechnicalInformation object, containing detailed information about the technical background of this class, e.g., paper reference or book this class is based on.

Specified by:
getTechnicalInformation in interface TechnicalInformationHandler
Returns:
the technical information about this class

listOptions

public java.util.Enumeration listOptions()
Returns an enumeration describing the available options.

Specified by:
listOptions in interface OptionHandler
Overrides:
listOptions in class ClusterGenerator
Returns:
an enumeration of all the available options

setOptions

public void setOptions(java.lang.String[] options)
                throws java.lang.Exception
Parses a list of options for this object.

Valid options are:

 -h
  Prints this help.
 -o <file>
  The name of the output file, otherwise the generated data is
  printed to stdout.
 -r <name>
  The name of the relation.
 -d
  Whether to print debug informations.
 -S
  The seed for random function (default 1)
 -a <num>
  The number of attributes (default 10).
 -c
  Class Flag, if set, the cluster is listed in extra attribute.
 -b <range>
  The indices for boolean attributes.
 -m <range>
  The indices for nominal attributes.
 -k <num>
  The number of clusters (default 4)
 -G
  Set pattern to grid (default is random).
  This flag cannot be used at the same time as flag I.
  The pattern is random, if neither flag G nor flag I is set.
 -I
  Set pattern to sine (default is random).
  This flag cannot be used at the same time as flag I.
  The pattern is random, if neither flag G nor flag I is set.
 -N <num>..<num>
  The range of number of instances per cluster (default 1..50).
  Lower number must be between 0 and 2500,
  upper number must be between 50 and 2500.
 -R <num>..<num>
  The range of radius per cluster (default 0.1..1.4142135623730951).
  Lower number must be between 0 and SQRT(2), 
  upper number must be between SQRT(2) and SQRT(32).
 -M <num>
  The distance multiplier (default 4.0).
 -C <num>
  The number of cycles (default 4).
 -O
  Flag for input order is ORDERED. If flag is not set then 
  input order is RANDOMIZED. RANDOMIZED is currently not 
  implemented, therefore is the input order always ORDERED.
 -P <num>
  The noise rate in percent (default 0.0).
  Can be between 0% and 30%. (Remark: The original 
  algorithm only allows noise up to 10%.)

Specified by:
setOptions in interface OptionHandler
Overrides:
setOptions in class ClusterGenerator
Parameters:
options - the list of options as an array of strings
Throws:
java.lang.Exception - if an option is not supported

getOptions

public java.lang.String[] getOptions()
Gets the current settings of the datagenerator BIRCHCluster.

Specified by:
getOptions in interface OptionHandler
Overrides:
getOptions in class ClusterGenerator
Returns:
an array of strings suitable for passing to setOptions
See Also:
DataGenerator.removeBlacklist(String[])

setNumClusters

public void setNumClusters(int numClusters)
Sets the number of clusters the dataset should have.

Parameters:
numClusters - the new number of clusters

getNumClusters

public int getNumClusters()
Gets the number of clusters the dataset should have.

Returns:
the number of clusters the dataset should have

numClustersTipText

public java.lang.String numClustersTipText()
Returns the tip text for this property

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

getMinInstNum

public int getMinInstNum()
Gets the lower boundary for instances per cluster.

Returns:
the the lower boundary for instances per cluster

setMinInstNum

public void setMinInstNum(int newMinInstNum)
Sets the lower boundary for instances per cluster.

Parameters:
newMinInstNum - new lower boundary for instances per cluster

minInstNumTipText

public java.lang.String minInstNumTipText()
Returns the tip text for this property

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

getMaxInstNum

public int getMaxInstNum()
Gets the upper boundary for instances per cluster.

Returns:
the upper boundary for instances per cluster

setMaxInstNum

public void setMaxInstNum(int newMaxInstNum)
Sets the upper boundary for instances per cluster.

Parameters:
newMaxInstNum - new upper boundary for instances per cluster

maxInstNumTipText

public java.lang.String maxInstNumTipText()
Returns the tip text for this property

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

getMinRadius

public double getMinRadius()
Gets the lower boundary for the radiuses of the clusters.

Returns:
the lower boundary for the radiuses of the clusters

setMinRadius

public void setMinRadius(double newMinRadius)
Sets the lower boundary for the radiuses of the clusters.

Parameters:
newMinRadius - new lower boundary for the radiuses of the clusters

minRadiusTipText

public java.lang.String minRadiusTipText()
Returns the tip text for this property

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

getMaxRadius

public double getMaxRadius()
Gets the upper boundary for the radiuses of the clusters.

Returns:
the upper boundary for the radiuses of the clusters

setMaxRadius

public void setMaxRadius(double newMaxRadius)
Sets the upper boundary for the radiuses of the clusters.

Parameters:
newMaxRadius - new upper boundary for the radiuses of the clusters

maxRadiusTipText

public java.lang.String maxRadiusTipText()
Returns the tip text for this property

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

getPattern

public SelectedTag getPattern()
Gets the pattern type.

Returns:
the current pattern type

setPattern

public void setPattern(SelectedTag value)
Sets the pattern type.

Parameters:
value - new pattern type

patternTipText

public java.lang.String patternTipText()
Returns the tip text for this property

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

getDistMult

public double getDistMult()
Gets the distance multiplier.

Returns:
the distance multiplier

setDistMult

public void setDistMult(double newDistMult)
Sets the distance multiplier.

Parameters:
newDistMult - new distance multiplier

distMultTipText

public java.lang.String distMultTipText()
Returns the tip text for this property

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

getNumCycles

public int getNumCycles()
Gets the number of cycles.

Returns:
the number of cycles

setNumCycles

public void setNumCycles(int newNumCycles)
Sets the the number of cycles.

Parameters:
newNumCycles - new number of cycles

numCyclesTipText

public java.lang.String numCyclesTipText()
Returns the tip text for this property

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

getInputOrder

public SelectedTag getInputOrder()
Gets the input order.

Returns:
the current input order

setInputOrder

public void setInputOrder(SelectedTag value)
Sets the input order.

Parameters:
value - new input order

inputOrderTipText

public java.lang.String inputOrderTipText()
Returns the tip text for this property

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

getOrderedFlag

public boolean getOrderedFlag()
Gets the ordered flag (option O).

Returns:
true if ordered flag is set

getNoiseRate

public double getNoiseRate()
Gets the percentage of noise set.

Returns:
the percentage of noise set

setNoiseRate

public void setNoiseRate(double newNoiseRate)
Sets the percentage of noise set.

Parameters:
newNoiseRate - new percentage of noise

noiseRateTipText

public java.lang.String noiseRateTipText()
Returns the tip text for this property

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

getSingleModeFlag

public boolean getSingleModeFlag()
Gets the single mode flag.

Specified by:
getSingleModeFlag in class DataGenerator
Returns:
true if methode generateExample can be used.

defineDataFormat

public Instances defineDataFormat()
                           throws java.lang.Exception
Initializes the format for the dataset produced.

Overrides:
defineDataFormat in class DataGenerator
Returns:
the output data format
Throws:
java.lang.Exception - data format could not be defined
See Also:
DataGenerator.defaultRelationName()

generateExample

public Instance generateExample()
                         throws java.lang.Exception
Generate an example of the dataset.

Specified by:
generateExample in class DataGenerator
Returns:
the instance generated
Throws:
java.lang.Exception - if format not defined or generating
examples one by one is not possible, because voting is chosen

generateExamples

public Instances generateExamples()
                           throws java.lang.Exception
Generate all examples of the dataset.

Specified by:
generateExamples in class DataGenerator
Returns:
the instance generated
Throws:
java.lang.Exception - if format not defined

generateExamples

public Instances generateExamples(java.util.Random random,
                                  Instances format)
                           throws java.lang.Exception
Generate all examples of the dataset.

Parameters:
random - the random number generator to use
format - the dataset format
Returns:
the instance generated
Throws:
java.lang.Exception - if format not defined

generateFinished

public java.lang.String generateFinished()
                                  throws java.lang.Exception
Compiles documentation about the data generation after the generation process

Specified by:
generateFinished in class DataGenerator
Returns:
string with additional information about generated dataset
Throws:
java.lang.Exception - no input structure has been defined

generateStart

public java.lang.String generateStart()
Compiles documentation about the data generation before the generation process

Specified by:
generateStart in class DataGenerator
Returns:
string with additional information

getRevision

public java.lang.String getRevision()
Returns the revision string.

Specified by:
getRevision in interface RevisionHandler
Returns:
the revision

main

public static void main(java.lang.String[] args)
Main method for testing this class.

Parameters:
args - should contain arguments for the data producer: