Class ExternalSort


  • public class ExternalSort
    extends Object
    Goal: offer a generic external-memory sorting program in Java. It must be : - hackable (easy to adapt) - scalable to large files - sensibly efficient. This software is in the public domain. Usage: java com/google/code/externalsorting/ExternalSort somefile.txt out.txt You can change the default maximal number of temporary files with the -t flag: java com/google/code/externalsorting/ExternalSort somefile.txt out.txt -t 3 For very large files, you might want to use an appropriate flag to allocate more memory to the Java VM: java -Xms2G com/google/code/externalsorting/ExternalSort somefile.txt out.txt By (in alphabetical order) Philippe Beaudoin, Eleftherios Chetzakis, Jon Elsas, Christan Grant, Daniel Haran, Daniel Lemire, Sugumaran Harikrishnan, Amit Jain, Thomas Mueller, Jerry Yang, First published: April 2010 originally posted at http://lemire.me/blog/archives/2010/04/01/external-memory-sorting-in-java/
    • Field Detail

      • defaultcomparator

        public static Comparator<String> defaultcomparator
        default comparator between strings.
      • DEFAULTMAXTEMPFILES

        public static final int DEFAULTMAXTEMPFILES
        Default maximal number of temporary files allowed.
        See Also:
        Constant Field Values
    • Constructor Detail

      • ExternalSort

        public ExternalSort()
    • Method Detail

      • estimateAvailableMemory

        public static long estimateAvailableMemory()
        This method calls the garbage collector and then returns the free memory. This avoids problems with applications where the GC hasn't reclaimed memory and reports no available memory.
        Returns:
        available memory
      • estimateBestSizeOfBlocks

        public static long estimateBestSizeOfBlocks​(long sizeoffile,
                                                    int maxtmpfiles,
                                                    long maxMemory)
        we divide the file into small blocks. If the blocks are too small, we shall create too many temporary files. If they are too big, we shall be using too much memory.
        Parameters:
        sizeoffile - how much data (in bytes) can we expect
        maxtmpfiles - how many temporary files can we create (e.g., 1024)
        maxMemory - Maximum memory to use (in bytes)
        Returns:
        the estimate
      • main

        public static void main​(String[] args)
                         throws IOException
        Parameters:
        args - command line argument
        Throws:
        IOException - generic IO exception
      • mergeSortedFiles

        public static long mergeSortedFiles​(BufferedWriter fbw,
                                            Comparator<String> cmp,
                                            boolean distinct,
                                            List<com.google.code.externalsorting.BinaryFileBuffer> buffers)
                                     throws IOException
        This merges several BinaryFileBuffer to an output writer.
        Parameters:
        fbw - A buffer where we write the data.
        cmp - A comparator object that tells us how to sort the lines.
        distinct - Pass true if duplicate lines should be discarded.
        buffers - Where the data should be read.
        Returns:
        The number of lines sorted.
        Throws:
        IOException - generic IO exception
      • mergeSortedFiles

        public static long mergeSortedFiles​(List<File> files,
                                            File outputfile)
                                     throws IOException
        This merges a bunch of temporary flat files
        Parameters:
        files - The List of sorted Files to be merged.
        outputfile - The output File to merge the results to.
        Returns:
        The number of lines sorted.
        Throws:
        IOException - generic IO exception
      • mergeSortedFiles

        public static long mergeSortedFiles​(List<File> files,
                                            File outputfile,
                                            Comparator<String> cmp)
                                     throws IOException
        This merges a bunch of temporary flat files
        Parameters:
        files - The List of sorted Files to be merged.
        outputfile - The output File to merge the results to.
        cmp - The Comparator to use to compare Strings.
        Returns:
        The number of lines sorted.
        Throws:
        IOException - generic IO exception
      • mergeSortedFiles

        public static long mergeSortedFiles​(List<File> files,
                                            File outputfile,
                                            Comparator<String> cmp,
                                            boolean distinct)
                                     throws IOException
        This merges a bunch of temporary flat files
        Parameters:
        files - The List of sorted Files to be merged.
        outputfile - The output File to merge the results to.
        cmp - The Comparator to use to compare Strings.
        distinct - Pass true if duplicate lines should be discarded.
        Returns:
        The number of lines sorted.
        Throws:
        IOException - generic IO exception
      • mergeSortedFiles

        public static long mergeSortedFiles​(List<File> files,
                                            File outputfile,
                                            Comparator<String> cmp,
                                            Charset cs)
                                     throws IOException
        This merges a bunch of temporary flat files
        Parameters:
        files - The List of sorted Files to be merged.
        outputfile - The output File to merge the results to.
        cmp - The Comparator to use to compare Strings.
        cs - The Charset to be used for the byte to character conversion.
        Returns:
        The number of lines sorted.
        Throws:
        IOException - generic IO exception
      • mergeSortedFiles

        public static long mergeSortedFiles​(List<File> files,
                                            File outputfile,
                                            Comparator<String> cmp,
                                            Charset cs,
                                            boolean distinct)
                                     throws IOException
        This merges a bunch of temporary flat files
        Parameters:
        files - The List of sorted Files to be merged.
        distinct - Pass true if duplicate lines should be discarded.
        outputfile - The output File to merge the results to.
        cmp - The Comparator to use to compare Strings.
        cs - The Charset to be used for the byte to character conversion.
        Returns:
        The number of lines sorted.
        Throws:
        IOException - generic IO exception
        Since:
        v0.1.2
      • mergeSortedFiles

        public static long mergeSortedFiles​(List<File> files,
                                            File outputfile,
                                            Comparator<String> cmp,
                                            Charset cs,
                                            boolean distinct,
                                            boolean append,
                                            boolean usegzip)
                                     throws IOException
        This merges a bunch of temporary flat files
        Parameters:
        files - The List of sorted Files to be merged.
        distinct - Pass true if duplicate lines should be discarded.
        outputfile - The output File to merge the results to.
        cmp - The Comparator to use to compare Strings.
        cs - The Charset to be used for the byte to character conversion.
        append - Pass true if result should append to File instead of overwrite. Default to be false for overloading methods.
        usegzip - assumes we used gzip compression for temporary files
        Returns:
        The number of lines sorted.
        Throws:
        IOException - generic IO exception
        Since:
        v0.1.4
      • sort

        public static void sort​(File input,
                                File output)
                         throws IOException
        This sorts a file (input) to an output file (output) using default parameters
        Parameters:
        input - source file
        output - output file
        Throws:
        IOException - generic IO exception
      • sort

        public static void sort​(File input,
                                File output,
                                Comparator<String> cmp)
                         throws IOException
        This sorts a file (input) to an output file (output) using customized comparator
        Parameters:
        input - source file
        output - output file
        cmp - The Comparator to use to compare Strings.
        Throws:
        IOException - generic IO exception
      • sortAndSave

        public static File sortAndSave​(List<String> tmplist,
                                       Comparator<String> cmp,
                                       Charset cs,
                                       File tmpdirectory)
                                throws IOException
        Sort a list and save it to a temporary file
        Parameters:
        tmplist - data to be sorted
        cmp - string comparator
        cs - charset to use for output (can use Charset.defaultCharset())
        tmpdirectory - location of the temporary files (set to null for default location)
        Returns:
        the file containing the sorted data
        Throws:
        IOException - generic IO exception
      • sortAndSave

        public static File sortAndSave​(List<String> tmplist,
                                       Comparator<String> cmp,
                                       Charset cs,
                                       File tmpdirectory,
                                       boolean distinct,
                                       boolean usegzip,
                                       boolean parallel)
                                throws IOException
        Sort a list and save it to a temporary file
        Parameters:
        tmplist - data to be sorted
        cmp - string comparator
        cs - charset to use for output (can use Charset.defaultCharset())
        tmpdirectory - location of the temporary files (set to null for default location)
        distinct - Pass true if duplicate lines should be discarded.
        usegzip - set to true if you are using gzip compression for the temporary files
        parallel - set to true when sorting in parallel
        Returns:
        the file containing the sorted data
        Throws:
        IOException - generic IO exception
      • sortInBatch

        public static List<File> sortInBatch​(BufferedReader fbr,
                                             long datalength)
                                      throws IOException
        This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.
        Parameters:
        fbr - data source
        datalength - estimated data volume (in bytes)
        Returns:
        a list of temporary flat files
        Throws:
        IOException - generic IO exception
      • sortInBatch

        public static List<File> sortInBatch​(BufferedReader fbr,
                                             long datalength,
                                             Comparator<String> cmp,
                                             boolean distinct)
                                      throws IOException
        This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.
        Parameters:
        fbr - data source
        datalength - estimated data volume (in bytes)
        cmp - string comparator
        distinct - Pass true if duplicate lines should be discarded.
        Returns:
        a list of temporary flat files
        Throws:
        IOException - generic IO exception
      • sortInBatch

        public static List<File> sortInBatch​(BufferedReader fbr,
                                             long datalength,
                                             Comparator<String> cmp,
                                             int maxtmpfiles,
                                             long maxMemory,
                                             Charset cs,
                                             File tmpdirectory,
                                             boolean distinct,
                                             int numHeader,
                                             boolean usegzip,
                                             boolean parallel)
                                      throws IOException
        This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.
        Parameters:
        fbr - data source
        datalength - estimated data volume (in bytes)
        cmp - string comparator
        maxtmpfiles - maximal number of temporary files
        maxMemory - maximum amount of memory to use (in bytes)
        cs - character set to use (can use Charset.defaultCharset())
        tmpdirectory - location of the temporary files (set to null for default location)
        distinct - Pass true if duplicate lines should be discarded.
        numHeader - number of lines to preclude before sorting starts
        usegzip - use gzip compression for the temporary files
        parallel - sort in parallel
        Returns:
        a list of temporary flat files
        Throws:
        IOException - generic IO exception
      • sortInBatch

        public static List<File> sortInBatch​(File file)
                                      throws IOException
        This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.
        Parameters:
        file - some flat file
        Returns:
        a list of temporary flat files
        Throws:
        IOException - generic IO exception
      • sortInBatch

        public static List<File> sortInBatch​(File file,
                                             Comparator<String> cmp)
                                      throws IOException
        This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.
        Parameters:
        file - some flat file
        cmp - string comparator
        Returns:
        a list of temporary flat files
        Throws:
        IOException - generic IO exception
      • sortInBatch

        public static List<File> sortInBatch​(File file,
                                             Comparator<String> cmp,
                                             boolean distinct)
                                      throws IOException
        This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.
        Parameters:
        file - some flat file
        cmp - string comparator
        distinct - Pass true if duplicate lines should be discarded.
        Returns:
        a list of temporary flat files
        Throws:
        IOException - generic IO exception
      • sortInBatch

        public static List<File> sortInBatch​(File file,
                                             Comparator<String> cmp,
                                             File tmpdirectory,
                                             boolean distinct,
                                             int numHeader)
                                      throws IOException
        This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later. You can specify a bound on the number of temporary files that will be created.
        Parameters:
        file - some flat file
        cmp - string comparator
        tmpdirectory - location of the temporary files (set to null for default location)
        distinct - Pass true if duplicate lines should be discarded.
        numHeader - number of lines to preclude before sorting starts
        Returns:
        a list of temporary flat files
        Throws:
        IOException - generic IO exception
      • sortInBatch

        public static List<File> sortInBatch​(File file,
                                             Comparator<String> cmp,
                                             int maxtmpfiles,
                                             Charset cs,
                                             File tmpdirectory,
                                             boolean distinct)
                                      throws IOException
        This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later. You can specify a bound on the number of temporary files that will be created.
        Parameters:
        file - some flat file
        cmp - string comparator
        maxtmpfiles - maximal number of temporary files
        cs - character set to use (can use Charset.defaultCharset())
        tmpdirectory - location of the temporary files (set to null for default location)
        distinct - Pass true if duplicate lines should be discarded.
        Returns:
        a list of temporary flat files
        Throws:
        IOException - generic IO exception
      • sortInBatch

        public static List<File> sortInBatch​(File file,
                                             Comparator<String> cmp,
                                             Charset cs,
                                             File tmpdirectory,
                                             boolean distinct,
                                             int numHeader)
                                      throws IOException
        This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later. You can specify a bound on the number of temporary files that will be created.
        Parameters:
        file - some flat file
        cmp - string comparator
        cs - character set to use (can use Charset.defaultCharset())
        tmpdirectory - location of the temporary files (set to null for default location)
        distinct - Pass true if duplicate lines should be discarded.
        numHeader - number of lines to preclude before sorting starts
        Returns:
        a list of temporary flat files
        Throws:
        IOException - generic IO exception
      • sortInBatch

        public static List<File> sortInBatch​(File file,
                                             Comparator<String> cmp,
                                             int maxtmpfiles,
                                             Charset cs,
                                             File tmpdirectory,
                                             boolean distinct,
                                             int numHeader)
                                      throws IOException
        This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later. You can specify a bound on the number of temporary files that will be created.
        Parameters:
        file - some flat file
        cmp - string comparator
        maxtmpfiles - maximal number of temporary files
        cs - character set to use (can use Charset.defaultCharset())
        tmpdirectory - location of the temporary files (set to null for default location)
        distinct - Pass true if duplicate lines should be discarded.
        numHeader - number of lines to preclude before sorting starts
        Returns:
        a list of temporary flat files
        Throws:
        IOException - generic IO exception
      • sortInBatch

        public static List<File> sortInBatch​(File file,
                                             Comparator<String> cmp,
                                             int maxtmpfiles,
                                             Charset cs,
                                             File tmpdirectory,
                                             boolean distinct,
                                             int numHeader,
                                             boolean usegzip)
                                      throws IOException
        This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later. You can specify a bound on the number of temporary files that will be created.
        Parameters:
        file - some flat file
        cmp - string comparator
        maxtmpfiles - maximal number of temporary files
        cs - character set to use (can use Charset.defaultCharset())
        tmpdirectory - location of the temporary files (set to null for default location)
        distinct - Pass true if duplicate lines should be discarded.
        numHeader - number of lines to preclude before sorting starts
        usegzip - use gzip compression for the temporary files
        Returns:
        a list of temporary flat files
        Throws:
        IOException - generic IO exception
      • sortInBatch

        public static List<File> sortInBatch​(File file,
                                             Comparator<String> cmp,
                                             int maxtmpfiles,
                                             Charset cs,
                                             File tmpdirectory,
                                             boolean distinct,
                                             int numHeader,
                                             boolean usegzip,
                                             boolean parallel)
                                      throws IOException
        This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later. You can specify a bound on the number of temporary files that will be created.
        Parameters:
        file - some flat file
        cmp - string comparator
        maxtmpfiles - maximal number of temporary files
        cs - character set to use (can use Charset.defaultCharset())
        tmpdirectory - location of the temporary files (set to null for default location)
        distinct - Pass true if duplicate lines should be discarded.
        numHeader - number of lines to preclude before sorting starts
        usegzip - use gzip compression for the temporary files
        parallel - whether to sort in parallel
        Returns:
        a list of temporary flat files
        Throws:
        IOException - generic IO exception