public class BIRCHCluster extends ClusterGenerator implements TechnicalInformationHandler
@inproceedings{Zhang1996,
author = {Tian Zhang and Raghu Ramakrishnan and Miron Livny},
booktitle = {ACM SIGMOD International Conference on Management of Data},
pages = {103-114},
publisher = {ACM Press},
title = {BIRCH: An Efficient Data Clustering Method for Very Large Databases},
year = {1996}
}
Valid options are:
-h Prints this help.
-o <file> The name of the output file, otherwise the generated data is printed to stdout.
-r <name> The name of the relation.
-d Whether to print debug informations.
-S The seed for random function (default 1)
-a <num> The number of attributes (default 10).
-c Class Flag, if set, the cluster is listed in extra attribute.
-b <range> The indices for boolean attributes.
-m <range> The indices for nominal attributes.
-k <num> The number of clusters (default 4)
-G Set pattern to grid (default is random). This flag cannot be used at the same time as flag I. The pattern is random, if neither flag G nor flag I is set.
-I Set pattern to sine (default is random). This flag cannot be used at the same time as flag I. The pattern is random, if neither flag G nor flag I is set.
-N <num>..<num> The range of number of instances per cluster (default 1..50). Lower number must be between 0 and 2500, upper number must be between 50 and 2500.
-R <num>..<num> The range of radius per cluster (default 0.1..1.4142135623730951). Lower number must be between 0 and SQRT(2), upper number must be between SQRT(2) and SQRT(32).
-M <num> The distance multiplier (default 4.0).
-C <num> The number of cycles (default 4).
-O Flag for input order is ORDERED. If flag is not set then input order is RANDOMIZED. RANDOMIZED is currently not implemented, therefore is the input order always ORDERED.
-P <num> The noise rate in percent (default 0.0). Can be between 0% and 30%. (Remark: The original algorithm only allows noise up to 10%.)
| Modifier and Type | Field and Description |
|---|---|
static int |
GRID
Constant set for choice of pattern.
|
static int |
ORDERED
Constant set for input order (option O)
|
static int |
RANDOM
Constant set for choice of pattern.
|
static int |
RANDOMIZED
Constant set for input order (default)
|
static int |
SINE
Constant set for choice of pattern.
|
static Tag[] |
TAGS_INPUTORDER
the input order tags
|
static Tag[] |
TAGS_PATTERN
the pattern tags
|
| Constructor and Description |
|---|
BIRCHCluster()
initializes the generator with default values
|
| Modifier and Type | Method and Description |
|---|---|
Instances |
defineDataFormat()
Initializes the format for the dataset produced.
|
java.lang.String |
distMultTipText()
Returns the tip text for this property
|
Instance |
generateExample()
Generate an example of the dataset.
|
Instances |
generateExamples()
Generate all examples of the dataset.
|
Instances |
generateExamples(java.util.Random random,
Instances format)
Generate all examples of the dataset.
|
java.lang.String |
generateFinished()
Compiles documentation about the data generation after the generation
process
|
java.lang.String |
generateStart()
Compiles documentation about the data generation before the generation
process
|
double |
getDistMult()
Gets the distance multiplier.
|
SelectedTag |
getInputOrder()
Gets the input order.
|
int |
getMaxInstNum()
Gets the upper boundary for instances per cluster.
|
double |
getMaxRadius()
Gets the upper boundary for the radiuses of the clusters.
|
int |
getMinInstNum()
Gets the lower boundary for instances per cluster.
|
double |
getMinRadius()
Gets the lower boundary for the radiuses of the clusters.
|
double |
getNoiseRate()
Gets the percentage of noise set.
|
int |
getNumClusters()
Gets the number of clusters the dataset should have.
|
int |
getNumCycles()
Gets the number of cycles.
|
java.lang.String[] |
getOptions()
Gets the current settings of the datagenerator BIRCHCluster.
|
boolean |
getOrderedFlag()
Gets the ordered flag (option O).
|
SelectedTag |
getPattern()
Gets the pattern type.
|
java.lang.String |
getRevision()
Returns the revision string.
|
boolean |
getSingleModeFlag()
Gets the single mode flag.
|
TechnicalInformation |
getTechnicalInformation()
Returns an instance of a TechnicalInformation object, containing detailed
information about the technical background of this class, e.g., paper
reference or book this class is based on.
|
java.lang.String |
globalInfo()
Returns a string describing this data generator.
|
java.lang.String |
inputOrderTipText()
Returns the tip text for this property
|
java.util.Enumeration<Option> |
listOptions()
Returns an enumeration describing the available options.
|
static void |
main(java.lang.String[] args)
Main method for testing this class.
|
java.lang.String |
maxInstNumTipText()
Returns the tip text for this property
|
java.lang.String |
maxRadiusTipText()
Returns the tip text for this property
|
java.lang.String |
minInstNumTipText()
Returns the tip text for this property
|
java.lang.String |
minRadiusTipText()
Returns the tip text for this property
|
java.lang.String |
noiseRateTipText()
Returns the tip text for this property
|
java.lang.String |
numClustersTipText()
Returns the tip text for this property
|
java.lang.String |
numCyclesTipText()
Returns the tip text for this property
|
java.lang.String |
patternTipText()
Returns the tip text for this property
|
void |
setDistMult(double newDistMult)
Sets the distance multiplier.
|
void |
setInputOrder(SelectedTag value)
Sets the input order.
|
void |
setMaxInstNum(int newMaxInstNum)
Sets the upper boundary for instances per cluster.
|
void |
setMaxRadius(double newMaxRadius)
Sets the upper boundary for the radiuses of the clusters.
|
void |
setMinInstNum(int newMinInstNum)
Sets the lower boundary for instances per cluster.
|
void |
setMinRadius(double newMinRadius)
Sets the lower boundary for the radiuses of the clusters.
|
void |
setNoiseRate(double newNoiseRate)
Sets the percentage of noise set.
|
void |
setNumClusters(int numClusters)
Sets the number of clusters the dataset should have.
|
void |
setNumCycles(int newNumCycles)
Sets the the number of cycles.
|
void |
setOptions(java.lang.String[] options)
Parses a list of options for this object.
|
void |
setPattern(SelectedTag value)
Sets the pattern type.
|
booleanColsTipText, classFlagTipText, getBooleanCols, getClassFlag, getNominalCols, getNumAttributes, nominalColsTipText, numAttributesTipText, setBooleanCols, setBooleanIndices, setClassFlag, setNominalCols, setNominalIndices, setNumAttributesdebugTipText, defaultOutput, enumToVector, formatTipText, getDatasetFormat, getDebug, getNumExamplesAct, getOutput, getRandom, getRelationName, getSeed, makeData, outputTipText, randomTipText, relationNameTipText, runDataGenerator, seedTipText, setDatasetFormat, setDebug, setOutput, setRandom, setRelationName, setSeedpublic static final int GRID
public static final int SINE
public static final int RANDOM
public static final Tag[] TAGS_PATTERN
public static final int ORDERED
public static final int RANDOMIZED
public static final Tag[] TAGS_INPUTORDER
public java.lang.String globalInfo()
public TechnicalInformation getTechnicalInformation()
getTechnicalInformation in interface TechnicalInformationHandlerpublic java.util.Enumeration<Option> listOptions()
listOptions in interface OptionHandlerlistOptions in class ClusterGeneratorpublic void setOptions(java.lang.String[] options)
throws java.lang.Exception
-h Prints this help.
-o <file> The name of the output file, otherwise the generated data is printed to stdout.
-r <name> The name of the relation.
-d Whether to print debug informations.
-S The seed for random function (default 1)
-a <num> The number of attributes (default 10).
-c Class Flag, if set, the cluster is listed in extra attribute.
-b <range> The indices for boolean attributes.
-m <range> The indices for nominal attributes.
-k <num> The number of clusters (default 4)
-G Set pattern to grid (default is random). This flag cannot be used at the same time as flag I. The pattern is random, if neither flag G nor flag I is set.
-I Set pattern to sine (default is random). This flag cannot be used at the same time as flag I. The pattern is random, if neither flag G nor flag I is set.
-N <num>..<num> The range of number of instances per cluster (default 1..50). Lower number must be between 0 and 2500, upper number must be between 50 and 2500.
-R <num>..<num> The range of radius per cluster (default 0.1..1.4142135623730951). Lower number must be between 0 and SQRT(2), upper number must be between SQRT(2) and SQRT(32).
-M <num> The distance multiplier (default 4.0).
-C <num> The number of cycles (default 4).
-O Flag for input order is ORDERED. If flag is not set then input order is RANDOMIZED. RANDOMIZED is currently not implemented, therefore is the input order always ORDERED.
-P <num> The noise rate in percent (default 0.0). Can be between 0% and 30%. (Remark: The original algorithm only allows noise up to 10%.)
setOptions in interface OptionHandlersetOptions in class ClusterGeneratoroptions - the list of options as an array of stringsjava.lang.Exception - if an option is not supportedpublic java.lang.String[] getOptions()
getOptions in interface OptionHandlergetOptions in class ClusterGeneratorDataGenerator.removeBlacklist(String[])public void setNumClusters(int numClusters)
numClusters - the new number of clusterspublic int getNumClusters()
public java.lang.String numClustersTipText()
public int getMinInstNum()
public void setMinInstNum(int newMinInstNum)
newMinInstNum - new lower boundary for instances per clusterpublic java.lang.String minInstNumTipText()
public int getMaxInstNum()
public void setMaxInstNum(int newMaxInstNum)
newMaxInstNum - new upper boundary for instances per clusterpublic java.lang.String maxInstNumTipText()
public double getMinRadius()
public void setMinRadius(double newMinRadius)
newMinRadius - new lower boundary for the radiuses of the clusterspublic java.lang.String minRadiusTipText()
public double getMaxRadius()
public void setMaxRadius(double newMaxRadius)
newMaxRadius - new upper boundary for the radiuses of the clusterspublic java.lang.String maxRadiusTipText()
public SelectedTag getPattern()
public void setPattern(SelectedTag value)
value - new pattern typepublic java.lang.String patternTipText()
public double getDistMult()
public void setDistMult(double newDistMult)
newDistMult - new distance multiplierpublic java.lang.String distMultTipText()
public int getNumCycles()
public void setNumCycles(int newNumCycles)
newNumCycles - new number of cyclespublic java.lang.String numCyclesTipText()
public SelectedTag getInputOrder()
public void setInputOrder(SelectedTag value)
value - new input orderpublic java.lang.String inputOrderTipText()
public boolean getOrderedFlag()
public double getNoiseRate()
public void setNoiseRate(double newNoiseRate)
newNoiseRate - new percentage of noisepublic java.lang.String noiseRateTipText()
public boolean getSingleModeFlag()
getSingleModeFlag in class DataGeneratorpublic Instances defineDataFormat() throws java.lang.Exception
defineDataFormat in class DataGeneratorjava.lang.Exception - data format could not be definedDataGenerator.defaultRelationName()public Instance generateExample() throws java.lang.Exception
generateExample in class DataGeneratorjava.lang.Exception - if format not defined or generating public Instances generateExamples() throws java.lang.Exception
generateExamples in class DataGeneratorjava.lang.Exception - if format not definedpublic Instances generateExamples(java.util.Random random, Instances format) throws java.lang.Exception
random - the random number generator to useformat - the dataset formatjava.lang.Exception - if format not definedpublic java.lang.String generateFinished()
throws java.lang.Exception
generateFinished in class DataGeneratorjava.lang.Exception - no input structure has been definedpublic java.lang.String generateStart()
generateStart in class DataGeneratorpublic java.lang.String getRevision()
getRevision in interface RevisionHandlerpublic static void main(java.lang.String[] args)
args - should contain arguments for the data producer: