The Coding Slim Jim: Sorting billions of numbers

Saturday, June 5, 2010

There are several key facts to consider when sorting a massive number of numbers

Key factors to consider

* The memory usage profile and efficiency of the sort algorithm
- - sort algorithms that can run back and forth over pages while sorting 1 number are out
  - -- that means bubble sort, insertion sort, selection sort, quicksort are out
- - a sort algorithm that can naturally deal with chucked groups of data is ideal
  - -- the best candidate is merge sort. Using 4 chucks of memory 2 input and 2 output chunks
  - -- use the insertion sort when the but of items in the merge list are below 32 since is more efficient
* FileIO speed and rates
- - fileIO is slow waiting for it is inefficent create a 2 threads that:
  - -- reads in chucks of data when the input buffer drops below a threshold.
  - -- writes out chucks of data when the output buffer contains data.
  - -- these threads may need to handle multiple buffer files as well as the final in/out file.
* the number of cores and virtual cores on the system
- -- create inter/intra sorting worker threads equal to the number of virtual cores in the system so that the cpu is fully utilized.
* the physical paging sizing of memory and the allocation time and size of memory chunks
- - dont repeatedly allocate and free memory instead keep a buffer pool with chucks of memory available
- - allocate chucks of memory are below the size of a memory page.
- - allocate chucks of memory only when the buffer pool is almost empty
- - free chucks of memory if the buffer pool excess an upper limit.

Pesudo Code;
So the basic merge sort needs be to modified into a couple of ways to make it work

- First the FileIO thread will load buffers into the input buffers, these buffer have unsorted data inside chucks
- using multiple threads and if the instage buffer is running low, perform intra sort on each chuck, ie sort the data in the chucks and move it to the instage buffer pool marking them as a sequence of 1 chuck, ie mark the last chuck in the sequence with an end flag.
- using multiple threads, each thread will take wait for 2 input out of sequences from the instage buffers
- -- remove the instage sequance from the instage buffer
- -- perform inter merge between next inline 2 chucks of the input sequences until one of the sequences final chucks are full processed
- -- when an input sequences final chuck is found move all the remain input chucks over to the output sequence (note the final chuck is left marked final)
- -- once the first output chuck of the new sequance is complete the sequence can be moved in the instage buffer, The final chuck is not needed.
- -- as each input chucks merge is completed move it to the empty buffer pool.
- -- as each output chuck is completed append it to output sequence and get a new chunk from the empty buffer pool
- -- if both final chucks of the input sequences are merged the output chuck needs to be marked as final

The Coding Slim Jim