Class NGramTokenizer

  • All Implemented Interfaces:
    java.io.Serializable, java.util.Enumeration, OptionHandler, RevisionHandler

    public class NGramTokenizer
    extends CharacterDelimitedTokenizer
    Splits a string into an n-gram with min and max grams.

    Valid options are:

     -delimiters <value>
      The delimiters to use
      (default ' \r\n\t.,;:'"()?!').
     
     -max <int>
      The max size of the Ngram (default = 3).
     
     -min <int>
      The min size of the Ngram (default = 1).
     
    Version:
    $Revision: 1.4 $
    Author:
    Sebastian Germesin (sebastian.germesin@dfki.de), FracPete (fracpete at waikato dot ac dot nz)
    See Also:
    Serialized Form
    • Constructor Detail

      • NGramTokenizer

        public NGramTokenizer()
    • Method Detail

      • globalInfo

        public java.lang.String globalInfo()
        Returns a string describing the stemmer
        Specified by:
        globalInfo in class Tokenizer
        Returns:
        a description suitable for displaying in the explorer/experimenter gui
      • setOptions

        public void setOptions​(java.lang.String[] options)
                        throws java.lang.Exception
        Parses a given list of options.

        Valid options are:

         -delimiters <value>
          The delimiters to use
          (default ' \r\n\t.,;:'"()?!').
         
         -max <int>
          The max size of the Ngram (default = 3).
         
         -min <int>
          The min size of the Ngram (default = 1).
         
        Specified by:
        setOptions in interface OptionHandler
        Overrides:
        setOptions in class CharacterDelimitedTokenizer
        Parameters:
        options - the list of options as an array of strings
        Throws:
        java.lang.Exception - if an option is not supported
      • getNGramMaxSize

        public int getNGramMaxSize()
        Gets the max N of the NGram.
        Returns:
        the size (N) of the NGram.
      • setNGramMaxSize

        public void setNGramMaxSize​(int value)
        Sets the max size of the Ngram.
        Parameters:
        value - the size of the NGram.
      • NGramMaxSizeTipText

        public java.lang.String NGramMaxSizeTipText()
        Returns the tip text for this property.
        Returns:
        tip text for this property suitable for displaying in the explorer/experimenter gui
      • setNGramMinSize

        public void setNGramMinSize​(int value)
        Sets the min size of the Ngram.
        Parameters:
        value - the size of the NGram.
      • getNGramMinSize

        public int getNGramMinSize()
        Gets the min N of the NGram.
        Returns:
        the size (N) of the NGram.
      • NGramMinSizeTipText

        public java.lang.String NGramMinSizeTipText()
        Returns the tip text for this property.
        Returns:
        tip text for this property suitable for displaying in the explorer/experimenter gui
      • hasMoreElements

        public boolean hasMoreElements()
        returns true if there's more elements available
        Specified by:
        hasMoreElements in interface java.util.Enumeration
        Specified by:
        hasMoreElements in class Tokenizer
        Returns:
        true if there are more elements available
      • nextElement

        public java.lang.Object nextElement()
        Returns N-grams and also (N-1)-grams and .... and 1-grams.
        Specified by:
        nextElement in interface java.util.Enumeration
        Specified by:
        nextElement in class Tokenizer
        Returns:
        the next element
      • tokenize

        public void tokenize​(java.lang.String s)
        Sets the string to tokenize. Tokenization happens immediately.
        Specified by:
        tokenize in class Tokenizer
        Parameters:
        s - the string to tokenize
      • getRevision

        public java.lang.String getRevision()
        Returns the revision string.
        Returns:
        the revision
      • main

        public static void main​(java.lang.String[] args)
        Runs the tokenizer with the given options and strings to tokenize. The tokens are printed to stdout.
        Parameters:
        args - the commandline options and strings to tokenize