Looking into 2020’s Pull Requests: Part I
At Reviewpad we are continuously testing semantic code reviews with Github’s Pull Requests from our internal projects and public repositories. To get a representative set of organizations and repositories to test, in the past weeks we started looking into a systematic way to sense the public activity on pull requests.
We think that some of these results could be interesting to the community so we decided to share them in a series of weekly posts. Our main intention is to share the data without too much interpretation -- we know that different teams and organizations have different working methodologies and we don't intend to categorize them in any way.
For our first post, we will look into the duration of pull requests which is something we were always curious about.
Data Selection
First of all, there’s a lot of activity around pull requests and this activity seems to be increasing. Fortunately there are several projects dedicated to preserving public Github API events. We decided to use the data from the GH Archive project (as it is the easiest to access) from January to March of 2020.
As a starting point, we selected the set of Pull Request events with the closed action on those that were merged on repositories with at least one star. In total, these amount to 2 633 731 events (January: 813 798, February: 820 324 and March: 999 609). To simplify our analysis, we excluded 2 365 close events that are related to the same pull request leading to an initial set of 2 631 366 pull requests across 283 034 projects and 153 688 organizations.
We decided to look into the top 50 organizations and projects by number of stars involved in these pull requests. For projects, we used the number of stars associated and for organizations we summed the stars from the associated projects.
The following two plots show the number of projects per organization and the number of stars per project. In all of these plots, the elements in the horizontal axis are presented in decreasing order of stars (meaning that microsoft is the organization with the highest number of stars).
Projects per Organization
Stars per Project
To get a sense on the number of pull requests involved in these organizations and projects, the following two plots show that information. As expected, the shape of the plots Projects per Organization and Pull Requests per Organization is similar.
Pull Requests per Organization
Pull Requests per Project
Pull Request Duration
Okay, now to the analysis of the duration of pull requests on organizations.
The next plot shows the average duration in days on all pull requests per organization (Average Open Time). We observe that some averages are very similar across organizations even if the number of repositories is quite different while other organizations dedicated to open source projects have much higher averages. A possible pattern seems to be that above 8.5 days, we have organizations with a high number of contributors.
Pull Request Average Open Time, per Organization
Since we were also curious to see how a small percentage of the longest pull requests could affect the average, we also plotted the averages for the shortest 95% (PR Average Open Time (Ratio to 95%)) and the shortest 99% (PR Average Open Time (Ratio to 99%)) durations. We were expecting to see a lower average and it is interesting to see significant differences in some organizations. To get a better understanding of these differences the next plot shows the ratio (average / percentage average).
Pull Request Average Open Time (Percentile Ratio), per Organization
We observe that the 99% average line is much smoother than the 95% average line, possibly indicating that some of these organizations do spend a lot more time in the review process.
We applied the same procedure to repositories and plotted both the averages and the ratios.
Pull Request Average Open Time, per Project
Pull Request Average Open Time (Percentile Ratio), per Project
On projects, the number of days a pull request is open on average is much higher than with organizations. It would be interesting to see if similarities between projects could identify similar methodologies in teams that work on different projects.
Even though this is preliminary data, we hope it sheds a bit of light on how long review processes can be in large repositories and the effect of a small percentage of contributions.
Stay tuned and let us know what you think about this data!