Given an array of citations (each citation is a non-negative integer) of a researcher, write a function to compute the researcher's h-index.
According to the definition of h-index on Wikipedia: "A scientist has index h if h of his/her N papers have at least h citations each, and the other N − h papers have no more than h citations each."
For example, given citations = [3, 0, 6, 1, 5]
, which means the researcher has 5
papers in total and each of them had received 3, 0, 6, 1, 5
citations respectively. Since the researcher has 3
papers with at least 3
citations each and the remaining two with no more than 3
citations each, his h-index is 3
.
Note: If there are several possible values for h
, the maximum one is taken as the h-index.
Credits:
Special thanks to @jianchao.li.fighter for adding this problem and creating all test cases.
This article is for intermediate readers. It introduces the following ideas:\nComparison Sort and Counting Sort.
\nIntuition
\nThink geometrically. Imagine plotting a histogram where the -axis represents the number of citations for each paper. After sorting in descending order, -index is the length of the largest square in the histogram.
\nFigure 1. -index from a plot of decreasing citations for papers
\nAlgorithm
\nTo find such a square length, we first sort the citations array in descending order.\nAfter sorting, if , then papers to all have at least citations.
\nThus, to find -index, we search for the largest (let\'s call it ) such that
\n\n\n
\nand therefore the -index is .
\nFor example:
\n\n\n | \n0 | \n1 | \n2 | \n3 | \n4 | \n5 | \n6 | \n
---|---|---|---|---|---|---|---|
sorted citations | \n10 | \n9 | \n5 | \n3 | \n3 | \n2 | \n1 | \n
\n? | \ntrue | \ntrue | \ntrue | \nfalse | \nfalse | \nfalse | \nfalse | \n
In this example, we know that the largest with is . Thus
\n\n\n
\nBecause , papers (from paper 0 to paper ) have citations at least and papers (from paper to paper ) have citations no more than . By the definition of -index, .
\nIt is also possible to find through binary search after sorting. However, since comparison sorting has a time complexity of which dominates the performance of entire algorithm (linear search is ). Using a binary search () instead of linear search won\'t change the asymptotic time complexity.
\nAlso note that, we deduced the algorithm in descending for simplicity. Usually the sort function provided by default is in ascending order. The same principles applies to both ascending order and descending order. In the case of ascending order, we just scan it from backward.
\n\nComplexity Analysis
\nTime complexity : . Comparison sorting dominates the time complexity.
\nSpace complexity : . Most libraries using heap sort
which costs extra space in the worst case.
Intuition
\nComparison sorting algorithm has a lower bound of . To achieve better performance, we need non-comparison based sorting algorithms.
\nAlgorithm
\nFrom Approach #1, we sort the citations to find the h-index. However, it is well known that comparison sorting algorithms such as heapsort
, mergesort
and quicksort
have a lower bound of . The most commonly used non-comparison sorting is counting sort
.
\n\nCounting sort operates by counting the number of objects that have each distinct key value, and using arithmetic on those tallies to determine the positions of each key value in the output sequence. Its running time is linear in the number of items and the difference between the maximum and minimum keys, so it is only suitable for direct use in situations where the variation in keys is not significantly greater than the number of items.
\n---by Wikipedia
\n
However, in our problem, the keys are the citations of each paper which can be much larger than the number of papers . It seems that we cannot use counting sort
. The trick here is the following observation:
\n\nAny citation larger than can be replaced by and the -index will not change after the replacement
\n
The reason is that -index is upper bounded by total number of papers , i.e.
\n\n\n
\nIn the diagram, replacing citations greater than with is equivalent to cutting off the area where .
\nFigure 2. cutting off the area with citations more than
\nApparently, cutting that area off will not change the largest square and the -index.
\nAfter we have the counts, we can get a sorted citations by traversing the counts array. And the rest is the same as Approach #1.
\nBut we can do even better. The idea is that we don\'t even need to get sorted citations. We can find the -index by using the paper counts directly.
\nTo explain this, let\'s look at the following example:
\n\n\n
\nThe counting results are:
\n\n\n | \n0 | \n1 | \n2 | \n3 | \n4 | \n5 | \n
---|---|---|---|---|---|---|
count | \n0 | \n1 | \n1 | \n2 | \n0 | \n1 | \n
\n\n | \n5 | \n5 | \n4 | \n3 | \n1 | \n1 | \n
The value is defined as "the sum of all counts with citation " or "the number of papers having, at least, citations". By definition of the h-index, the largest with is our answer.
\nAfter replacing with , we have . Now, we count the number of papers for each citation number to . The counts are . The first from right to left ( down to ) that have is the -index .
\nSince we can calculate on the fly when traverse the count array, we only need one pass through the count array which only costs time.
\n\nComplexity Analysis
\nTime complexity : . There are two steps. The counting part is since we traverse the citations
array once and only once. The second part of finding the -index is also since we traverse the papers
array at most once. Thus, the entire algorithm is \n
Space complexity : . We use auxiliary space to store the counts.
\n\n\nIs it possible to have multiple -values?
\n
The answer is NO. One can find this intuitively from Figure 1. The dashed line crosses the histogram once and only once, because the sorted bars are monotonic. It can also be proven from the definition of the -index.
\n