Median is the middle value in an ordered integer list. If the size of the list is even, there is no middle value. So the median is the mean of the two middle value.
Examples:[2,3,4]
, the median is 3
[2,3]
, the median is (2 + 3) / 2 = 2.5
Design a data structure that supports the following two operations:
For example:
addNum(1) addNum(2) findMedian() -> 1.5 addNum(3) findMedian() -> 2
Credits:
Special thanks to @Louis1992 for adding this problem and creating all test cases.
Intuition
\nDo what the question says.
\nAlgorithm
\nStore the numbers in a resize-able container. Every time you need to output the median, sort the container and output the median.
\n\nComplexity Analysis
\nTime complexity: .
\nSpace complexity: linear space to hold input in a container. No extra space other than that needed (since sorting can usually be done in-place).
\nIntuition
\nKeeping our input container always sorted (i.e. maintaining the sorted nature of the container as an invariant).
\nAlgorithm
\nWhich algorithm allows a number to be added to a sorted list of numbers and yet keeps the entire list sorted? Well, for one, insertion sort!
\nWe assume that the current list is already sorted. When a new number comes, we have to add it to the list while maintaining the sorted nature of the list. This is achieved easily by finding the correct place to insert the incoming number, using a binary search (remember, the list is always sorted). Once the position is found, we need to shift all higher elements by one space to make room for the incoming number.
\nThis method would work well when the amount of insertion queries is lesser or about the same as the amount of median finding queries.
\n\nComplexity Analysis
\nTime complexity: .
\n\n\nPop quiz: Can we use a linear search instead of a binary search to find insertion position, without incurring any significant runtime penalty?
\n
Intuition
\nThe above two approaches gave us some valuable insights on how to tackle this problem. Concretely, one can infer two things:
\nBut perhaps the most important insight, which is not readily observable, is the fact that we only need a consistent way to access the median elements. Keeping the entire input sorted is not a requirement.
\n\n\nWell, if only there were a data structure which could handle our needs.
\n
As it turns out there are two data structures for the job:
\nHeaps are a natural ingredient for this dish! Adding elements to them take logarithmic order of time. They also give direct access to the maximal/minimal elements in a group.
\nIf we could maintain two heaps in the following way:
\nThis gives access to median values in the input: they comprise the top of the heaps!
\nWait, what? How?
\nIf the following conditions are met:
\nthen we can say that:
\nThen and/or are smaller than (or equal to) almost half of the elements and larger than (or equal to) the other half. That is the definition of median elements.
\nThis leads us to a huge point of pain in this approach: balancing the two heaps!
\nAlgorithm
\nTwo priority queues:
\nlo
to store the smaller half of the numbershi
to store the larger half of the numbersThe max-heap lo
is allowed to store, at worst, one more element more than the min-heap hi
. Hence if we have processed elements:
lo
is allowed to hold elements, while hi
can hold elements.This gives us the nice property that when the heaps are perfectly balanced, the median can be derived from the tops of both heaps. Otherwise, the top of the max-heap lo
holds the legitimate median.
Adding a number num
:
num
to max-heap lo
. Since lo
received a new element, we must do a balancing step for hi
. So remove the largest element from lo
and offer it to hi
.hi
might end holding more elements than the max-heap lo
, after the previous operation. We fix that by removing the smallest element from hi
and offering it to lo
.The above step ensures that we do not disturb the nice little size property we just mentioned.
\nA little example will clear this up! Say we take input from the stream [41, 35, 62, 5, 97, 108]
. The run-though of the algorithm looks like this:
Adding number 41\nMaxHeap lo: [41] // MaxHeap stores the largest value at the top (index 0)\nMinHeap hi: [] // MinHeap stores the smallest value at the top (index 0)\nMedian is 41\n=======================\nAdding number 35\nMaxHeap lo: [35]\nMinHeap hi: [41]\nMedian is 38\n=======================\nAdding number 62\nMaxHeap lo: [41, 35]\nMinHeap hi: [62]\nMedian is 41\n=======================\nAdding number 4\nMaxHeap lo: [35, 4]\nMinHeap hi: [41, 62]\nMedian is 38\n=======================\nAdding number 97\nMaxHeap lo: [41, 35, 4]\nMinHeap hi: [62, 97]\nMedian is 41\n=======================\nAdding number 108\nMaxHeap lo: [41, 35, 4]\nMinHeap hi: [62, 97, 108]\nMedian is 51.5\n
Complexity Analysis
\nTime complexity: .
\nSpace complexity: linear space to hold input in containers.
\nIntuition
\nSelf-balancing Binary Search Trees (like an AVL Tree) have some very interesting properties. They maintain the tree\'s height to a logarithmic bound. Thus inserting a new element has reasonably good time performance. The median always winds up in the root of the tree and/or one of its children. Solving this problem using the same approach as Approach #3 but using a Self-balancing BST seems like a good choice. Except the fact that implementing such a tree is not trivial and prone to errors.
\nWhy reinvent the wheel? Most languages implement a multiset
class which emulates such behavior. The only problem remains keeping track of the median elements. That is easily solved with pointers! 2
We maintain two pointers: one for the lower median element and the other for the higher median element. When the total number of elements is odd, both the pointers point to the same median element (since there is only one median in this case). When the number of elements is even, the pointers point to two consecutive elements, whose mean is the representative median of the input.
\nAlgorithm
\nTwo iterators/pointers lo_median
and hi_median
, which iterate over the data
multiset.
While adding a number num
, three cases arise:
num
and set both pointers to point to this element.The container currently holds an odd number of elements. This means that both the pointers currently point to the same element.
\nnum
is not equal to the current median element, then num
goes on either side of it. Whichever side it goes, the size of that part increases and hence the corresponding pointer is updated. For example, if num
is less than the median element, the size of the lesser half of input increases by on inserting num
. Thus it makes sense to decrement lo_median
.num
is equal to the current median element, then the action taken is dependent on how num
is inserted into data
. NOTE: In our given C++ code example, std::multiset::insert
inserts an element after all elements of equal value. Hence we increment hi_median
.The container currently holds an even number of elements. This means that the pointers currently point to consecutive elements.
\nnum
is a number between both median elements, then num
becomes the new median. Both pointers must point to it.num
increases the size of either the lesser or higher half of the input. We update the pointers accordingly. It is important to remember that both the pointers must point to the same element now.Finding the median is easy! It is simply the mean of the elements pointed to by the two pointers lo_median
and hi_median
.
A much shorter (but harder to understand), one pointer version 3 of this solution is given below:
\n\nComplexity Analysis
\nTime complexity: .
\nmultiset
scheme. 4Space complexity: linear space to hold input in container.
\nThere are so many ways around this problem, that frankly, it is scary. Here are a few more that I came across:
\nBuckets! If the numbers in the stream are statistically distributed, then it is easier to keep track of buckets where the median would land, than the entire array. Once you know the correct bucket, simply sort it find the median. If the bucket size is significantly smaller than the size of input processed, this results in huge time saving. @mitbbs8080 has an interesting implementation here.
\nReservoir Sampling. Following along the lines of using buckets: if the stream is statistically distributed, you can rely on Reservoir Sampling. Basically, if you could maintain just one good bucket (or reservoir) which could hold a representative sample of the entire stream, you could estimate the median of the entire stream from just this one bucket. This means good time and memory performance. Reservoir Sampling lets you do just that. Determining a "good" size for your reservoir? Now, that\'s a whole other challenge. A good explanation for this can be found in this StackOverflow answer.
\nSegment Trees are a great data structure if you need to do a lot of insertions or a lot of read queries over a limited range of input values. They allow us to do all such operations fast and in roughly the same amount of time, always. The only problem is that they are far from trivial to implement. Take a look at my introductory article on Segment Trees if you are interested.
\nOrder Statistic Trees are data structures which seem to be tailor-made for this problem. They have all the nice features of a BST, but also let you find the order element stored in the tree. They are a pain to implement and no standard interview would require you to code these up. But they are fun to use if they are already implemented in the language of your choice. 5
\nAnalysis written by @babhishek21.
\nPriority Queues queue out elements based on a predefined priority. They are an abstract concept and can, as such, be implemented in many different ways. Heaps are an efficient way to implement Priority Queues.\xc2\xa0\xe2\x86\xa9
\nShout-out to @pharese for this approach.\xc2\xa0\xe2\x86\xa9
\nInspired from this post by @StefanPochmann.\xc2\xa0\xe2\x86\xa9
\nHinting can reduce that to amortized constant time.\xc2\xa0\xe2\x86\xa9
\nGNU libstdc++
users are in luck! Take a look at this StackOverflow answer.\xc2\xa0\xe2\x86\xa9