Topic: Data Stream Problem

A network streams to you 100 Mega Bytes of data at a time. In all, you will receive a total of 800 Mega Bytes. I.e. you are to receive a total of 8 such blocks, each 100 Mega Byte in size. However, you donot know when each block will arrive though the total time is bounded. Our total time bound is thirty minutes. I.e. all the 8 blocks will arrive within thirty minutes.

Your main memory(RAM) is limited to 100 Mega Byte though you have unlimited(enough) hard disk space.  Give an efficient algorithm to sort the entire 800 Megabyte of data as quick as possible(think hard real time constraints).

Note: You can assume that each element that is to be sorted is of positive integer of length 128 bits.

Edit: Now consider another scenario. Say, that you have all the 8 blocks(i.e the entire 800 Mega Bytes) on your hard disk. Would you use the same strategy as you used for the earlier case to sort it time efficiently or would you go for a new strategy? Explain.

I am a veggie coz I hate plants. I am a non-veggie coz I hate animals and I am a cannibal because I hate humans. :-)

Re: Data Stream Problem

Sort the 100Mbyte chunks in the ram, store them to the disk, then merge the sorted chunks?

Never play leap frog with a unicorn

Re: Data Stream Problem

okay. how do you merge the sorted chunks? the final list also has to be sorted. What sorting algorithm would you use?

I am a veggie coz I hate plants. I am a non-veggie coz I hate animals and I am a cannibal because I hate humans. :-)

Re: Data Stream Problem

merging sorted lists is easy. All you need is a couple or pointers. You can do it in O(n) comparisons.

Never play leap frog with a unicorn

Re: Data Stream Problem

"Note: You can assume that each element that is to be sorted is of positive integer of length 128 bits."
I assume that you meant the data to be 7 bits (0-127)

-Set up a 128 zones of 800 MB each in your hard drive. (I'm assuming you have a lot of free space)
-Set all the cells in each zone to 0
-When the first batch of data arrives, sort it in ram and write to the HD to the corresponding zone. i.e if data=0 then store in the first zone, if data=1, goes to the second zone, etc. In order to know which cells are occupied in each zone, we will use the first bit. i.e. if data=7, we store 128+7=135.
-When the last batch arrive, you will have all the data sorted, but separated by empty chunks. All we have to do is move the active cells (the ones with the first bit set to 1) to a continuous space.

Re: Data Stream Problem

thats right Cygnum. What remains in the first part of the question is what sort algorithm would you choose to sort? The second part of the question is still open.

mantelo17:: It is 128 bit positive integer. Yours,(bucket sort) would be the optimal solution if it was a 7 bit positive integer.

I am a veggie coz I hate plants. I am a non-veggie coz I hate animals and I am a cannibal because I hate humans. :-)

Re: Data Stream Problem

no takers? smile

I am a veggie coz I hate plants. I am a non-veggie coz I hate animals and I am a cannibal because I hate humans. :-)

Re: Data Stream Problem

Data wrote:

no takers? smile


Too many bits for my simple minded brain...

Re: Data Stream Problem

since the first part is almost solved, you can use a fast on the average case sort algorithm such as quick sort to sort the individual chunks and then merge them as Cygnum mentioned.

that leaves us with a much simpler question (the second part)

Edit: Now consider another scenario. Say, that you have all the 8 blocks(i.e the entire 800 Mega Bytes) on your hard disk. Would you use the same strategy as you used for the earlier case to sort it time efficiently or would you go for a new strategy? Explain.

Why you might want to try the second case differently is because hard disk edits are much slower than main memory operations.

Note: The assumption is that you donot have a solid state disk either  because its expensive or probably you might want to sort bigger sized data that would not fit into your expensive solid state disk.

I am a veggie coz I hate plants. I am a non-veggie coz I hate animals and I am a cannibal because I hate humans. :-)

Re: Data Stream Problem

okay, so here is the solution to  the second part


1. Read 100 MB of the data in main memory and sort with quicksort.

2. Write the sorted data to disk.

3. Repeat steps 1 and 2 until all of the data is in sorted 100 MB chunks, which now need to be merged into one single output file.


4. Read the first 10 MB of each sorted chunk into input buffers in main memory and allocate 10 MB(or whatever more is available, 20 MB in our case) for an output buffer.

5. Perform a 8-way merge and store the result in the output buffer. If the output buffer is full, write it to the final sorted file. If any of the 8 input buffers gets empty, fill it with the next 10 MB of its associated 100 MB sorted chunk until no more data from the chunk is available.


In the conventional 2-way merge, the number of comparisons  are linear in time where as in a k-way merge, the number of comparisons is logarithmic.

Note: logarithmic as in the height of the comparison tree.

Check this example for a 4-way merge
http://code.google.com/p/kway/

Any tree structure that gives us the minimum element in each pass will do, for e.g. a min heap construction will do.
http://lcm.csa.iisc.ernet.in/dsa/node137.html

I am a veggie coz I hate plants. I am a non-veggie coz I hate animals and I am a cannibal because I hate humans. :-)