public class DictionarySaver extends AbstractFileSaver implements BatchConverter, IncrementalConverter
-binary-dict Save as a binary serialized dictionary
-R <range> Specify range of attributes to act on. This is a comma separated list of attribute indices, with "first" and "last" valid values.
-V Set attributes selection mode. If false, only selected attributes in the range will be worked on. If true, only non-selected attributes will be processed
-L Convert all tokens to lowercase when matching against dictionary entries.
-stemmer <spec> The stemming algorithm (classname plus parameters) to use.
-stopwords-handler <spec> The stopwords handler to use (default = Null)
-tokenizer <spec> The tokenizing algorithm (classname plus parameters) to use. (default: weka.core.tokenizers.WordTokenizer)
-P <integer> Prune the dictionary every x instances (default = 0 - i.e. no periodic pruning)
-W <integer> The number of words (per class if there is a class attribute assigned) to attempt to keep.
-M <integer> The minimum term frequency to use when pruning the dictionary (default = 1).
-O If this is set, the maximum number of words and the minimum term frequency is not enforced on a per-class basis but based on the documents in all the classes (even if a class attribute is set).
-sort Sort the dictionary alphabetically
-i <the input file> The input file
-o <the output file> The output file
BATCH, INCREMENTAL, NONE| Constructor and Description |
|---|
DictionarySaver() |
| Modifier and Type | Method and Description |
|---|---|
java.lang.String |
getAttributeIndices()
Gets the current range selection.
|
Capabilities |
getCapabilities()
Returns the Capabilities of this saver.
|
boolean |
getDoNotOperateOnPerClassBasis()
Get the DoNotOperateOnPerClassBasis value.
|
java.lang.String |
getFileDescription()
to be pverridden
|
boolean |
getInvertSelection()
Gets whether the supplied columns are to be processed or skipped.
|
boolean |
getKeepDictionarySorted()
Get whether to keep the dictionary sorted alphabetically or not
|
boolean |
getLowerCaseTokens()
Gets whether if the tokens are to be downcased or not.
|
int |
getMinTermFreq()
Get the MinTermFreq value.
|
long |
getPeriodicPruning()
Gets the rate at which the dictionary is periodically pruned, as a
percentage of the dataset size.
|
java.lang.String |
getRevision()
Returns the revision string.
|
boolean |
getSaveBinaryDictionary()
Get whether to save the dictionary as a binary serialized dictionary,
rather than a plain text one
|
Stemmer |
getStemmer()
Returns the current stemming algorithm, null if none is used.
|
StopwordsHandler |
getStopwordsHandler()
Gets the stopwords handler.
|
Tokenizer |
getTokenizer()
Returns the current tokenizer algorithm.
|
int |
getWordsToKeep()
Gets the number of words (per class if there is a class attribute assigned)
to attempt to keep.
|
java.lang.String |
globalInfo()
Returns a string describing this Saver.
|
static void |
main(java.lang.String[] args) |
void |
resetOptions()
resets the options
|
void |
resetWriter()
Sets the writer to null.
|
void |
setAttributeIndices(java.lang.String rangeList)
Sets which attributes are to be worked on.
|
void |
setDestination(java.io.OutputStream output)
Sets the destination output stream.
|
void |
setDoNotOperateOnPerClassBasis(boolean newDoNotOperateOnPerClassBasis)
Set the DoNotOperateOnPerClassBasis value.
|
void |
setInvertSelection(boolean invert)
Sets whether selected columns should be processed or skipped.
|
void |
setKeepDictionarySorted(boolean sorted)
Set whether to keep the dictionary sorted alphabetically or not
|
void |
setLowerCaseTokens(boolean downCaseTokens)
Sets whether if the tokens are to be downcased or not.
|
void |
setMinTermFreq(int newMinTermFreq)
Set the MinTermFreq value.
|
void |
setPeriodicPruning(long newPeriodicPruning)
Sets the rate at which the dictionary is periodically pruned, as a
percentage of the dataset size.
|
void |
setSaveBinaryDictionary(boolean binary)
Set whether to save the dictionary as a binary serialized dictionary,
rather than a plain text one
|
void |
setStemmer(Stemmer value)
the stemming algorithm to use, null means no stemming at all (i.e., the
NullStemmer is used).
|
void |
setStopwordsHandler(StopwordsHandler value)
Sets the stopwords handler to use.
|
void |
setTokenizer(Tokenizer value)
the tokenizer algorithm to use.
|
void |
setWordsToKeep(int newWordsToKeep)
Sets the number of words (per class if there is a class attribute assigned)
to attempt to keep.
|
void |
writeBatch()
Writes to a file in batch mode To be overridden.
|
void |
writeIncremental(Instance inst)
Method for incremental saving.
|
cancel, filePrefix, getFileExtension, getFileExtensions, getOptions, getUseRelativePath, getWriter, listOptions, retrieveDir, retrieveFile, runFileSaver, setDestination, setDir, setDirAndPrefix, setEnvironment, setFile, setFilePrefix, setOptions, setUseRelativePath, useRelativePathTipTextdoNotCheckCapabilitiesTipText, getDoNotCheckCapabilities, getInstances, getWriteMode, resetStructure, setDoNotCheckCapabilities, setInstances, setRetrieval, setStructurepublic java.lang.String globalInfo()
@OptionMetadata(displayName="Save dictionary in binary form", description="Save as a binary serialized dictionary", commandLineParamName="binary-dict", commandLineParamSynopsis="-binary-dict", commandLineParamIsFlag=true, displayOrder=2) public void setSaveBinaryDictionary(boolean binary)
binary - true if the dictionary is to be saved as binary rather than
plain textpublic boolean getSaveBinaryDictionary()
public java.lang.String getAttributeIndices()
@OptionMetadata(displayName="Range of attributes to operate on", description="Specify range of attributes to act on. This is a comma separated list of attribute\nindices, with \"first\" and \"last\" valid values.", commandLineParamName="R", commandLineParamSynopsis="-R <range>", displayOrder=4) public void setAttributeIndices(java.lang.String rangeList)
rangeList - a string representing the list of attributes. Since the
string will typically come from a user, attributes are indexed
from 1. java.lang.IllegalArgumentException - if an invalid range list is suppliedpublic boolean getInvertSelection()
@OptionMetadata(displayName="Invert selection", description="Set attributes selection mode. If false, only selected attributes in the range will\nbe worked on. If true, only non-selected attributes will be processed", commandLineParamName="V", commandLineParamSynopsis="-V", commandLineParamIsFlag=true, displayOrder=5) public void setInvertSelection(boolean invert)
invert - the new invert settingpublic boolean getLowerCaseTokens()
@OptionMetadata(displayName="Lower case tokens", description="Convert all tokens to lowercase when matching against dictionary entries.", commandLineParamName="L", commandLineParamSynopsis="-L", commandLineParamIsFlag=true, displayOrder=10) public void setLowerCaseTokens(boolean downCaseTokens)
downCaseTokens - should be true if only lower case tokens are to be
formed.@OptionMetadata(displayName="Stemmer to use", description="The stemming algorithm (classname plus parameters) to use.", commandLineParamName="stemmer", commandLineParamSynopsis="-stemmer <spec>", displayOrder=11) public void setStemmer(Stemmer value)
value - the configured stemming algorithm, or nullNullStemmerpublic Stemmer getStemmer()
@OptionMetadata(displayName="Stop words handler", description="The stopwords handler to use (default = Null)", commandLineParamName="stopwords-handler", commandLineParamSynopsis="-stopwords-handler <spec>", displayOrder=12) public void setStopwordsHandler(StopwordsHandler value)
value - the stopwords handler, if null, Null is usedpublic StopwordsHandler getStopwordsHandler()
@OptionMetadata(displayName="Tokenizer", description="The tokenizing algorithm (classname plus parameters) to use.\n(default: weka.core.tokenizers.WordTokenizer)", commandLineParamName="tokenizer", commandLineParamSynopsis="-tokenizer <spec>", displayOrder=13) public void setTokenizer(Tokenizer value)
value - the configured tokenizing algorithmpublic Tokenizer getTokenizer()
public long getPeriodicPruning()
@OptionMetadata(displayName="Periodic pruning rate", description="Prune the dictionary every x instances\n(default = 0 - i.e. no periodic pruning)", commandLineParamName="P", commandLineParamSynopsis="-P <integer>", displayOrder=14) public void setPeriodicPruning(long newPeriodicPruning)
newPeriodicPruning - the rate at which the dictionary is periodically
prunedpublic int getWordsToKeep()
@OptionMetadata(displayName="Number of words to attempt to keep", description="The number of words (per class if there is a class attribute assigned) to attempt to keep.", commandLineParamName="W", commandLineParamSynopsis="-W <integer>", displayOrder=15) public void setWordsToKeep(int newWordsToKeep)
newWordsToKeep - the target number of words in the output vector (per
class if assigned).public int getMinTermFreq()
@OptionMetadata(displayName="Minimum term frequency", description="The minimum term frequency to use when pruning the dictionary\n(default = 1).", commandLineParamName="M", commandLineParamSynopsis="-M <integer>", displayOrder=16) public void setMinTermFreq(int newMinTermFreq)
newMinTermFreq - The new MinTermFreq value.public boolean getDoNotOperateOnPerClassBasis()
@OptionMetadata(displayName="Do not operate on a per-class basis", description="If this is set, the maximum number of words and the\nminimum term frequency is not enforced on a per-class\nbasis but based on the documents in all the classes\n(even if a class attribute is set).", commandLineParamName="O", commandLineParamSynopsis="-O", commandLineParamIsFlag=true, displayOrder=17) public void setDoNotOperateOnPerClassBasis(boolean newDoNotOperateOnPerClassBasis)
newDoNotOperateOnPerClassBasis - The new DoNotOperateOnPerClassBasis
value.@OptionMetadata(displayName="Sort dictionary", description="Sort the dictionary alphabetically", commandLineParamName="sort", commandLineParamSynopsis="-sort", commandLineParamIsFlag=true, displayOrder=18) public void setKeepDictionarySorted(boolean sorted)
sorted - true to keep the dictionary sortedpublic boolean getKeepDictionarySorted()
public Capabilities getCapabilities()
getCapabilities in interface CapabilitiesHandlergetCapabilities in class AbstractSaverCapabilitiespublic java.lang.String getFileDescription()
AbstractFileSavergetFileDescription in interface FileSourcedConvertergetFileDescription in class AbstractFileSaverpublic void writeIncremental(Instance inst) throws java.io.IOException
AbstractSaverwriteIncremental in interface SaverwriteIncremental in class AbstractSaverinst - the instance to be savedjava.io.IOException - IOEXception if the instance acnnot be written to the
specified destinationpublic void writeBatch()
throws java.io.IOException
AbstractSaverwriteBatch in interface SaverwriteBatch in class AbstractSaverjava.io.IOException - exception if writting is not possiblepublic void resetOptions()
AbstractFileSaverresetOptions in class AbstractFileSaverpublic void resetWriter()
AbstractFileSaverresetWriter in class AbstractFileSaverpublic void setDestination(java.io.OutputStream output)
throws java.io.IOException
AbstractFileSaversetDestination in interface SaversetDestination in class AbstractFileSaveroutput - the output stream.java.io.IOException - throws an IOException if destination cannot be setpublic java.lang.String getRevision()
RevisionHandlergetRevision in interface RevisionHandlerpublic static void main(java.lang.String[] args)