Article Image

Looking into 2020’s Pull Requests: Part II

14th April 2020 ・ By Marcelo Sousa

Hi there! In our first article we showed how many pull requests were closed this year on 50 known organizations and projects, and we looked into the duration of those pull requests.

For our second article, we decided to do a deep dive into the merged pull requests from 283 034 starred public projects.

Data Selection

We will continue with the data from the GH Archive project from January to March of 2020, and the same initial selection of ~2.6M merged pull requests. We collected for each pull request (PR):

  • The name of the project 
  • The username of the GitHub user that merged the PR
  • The number of stars in the project
  • The duration (closed - opened time)
  • The number of comments

To recap from last week, here’s a summary of the data:

Analysis

This week we’ll examine this PR data at the project level. For that reason, we started by aggregating the data per project. 

Regarding the statistics, we knew that the number of projects is quite high and there’s a huge variety of activity in these projects. Since we wanted an overview that could be easy and fun to understand for all projects, we decided to consider averages and percentiles.

Stars, Pull Requests and GitHub users

The following table presents the statistics for the number of stars, pull requests and GitHub users on all projects. For the number of stars, since it could be different between pull requests, we considered the stars of the project from a random merged pull request.

The average number of stars across all projects (205 stars) is higher than we were anticipating considering the number of projects. Looking at the percentiles we can get more insights about the distribution. We observe that the median (50th percentile) is only 4 stars and at least 90% of the projects have fewer stars than the average. So, if you care about the stars on your project and your project has more than 190 stars: congratulations you made it to the top 10% starred public GitHub projects!

Considering that the first complete week of April was Week 15 of 2020, the average number of merged pull requests this year is actually high (9.3 in total or 0.7 merged PRs per week). On the other hand, at least 80% of the projects are below the average. A median of 2 pull requests tells us that 50% of the projects merged at most one PR every 7 weeks

We know that a lot of public GitHub projects are the hard work of a small group of people. As an estimate of the team size in these projects, we consider the number of users that merge pull requests. The average of 1.4 users per project shows that a lot of these projects are still likely to have a single main leader. Assuming that mature open source projects have teams with more than 3 leading members, at most 5% of the public GitHub projects (~14152 projects) would be in this category!

Duration

Extending the results from our previous post, we computed the duration of the 2.6M pull requests and aggregated them per project. For the list of PR durations per project, we computed the average, 50th, 90th and 99th percentiles. In the following table we present the average and percentiles across all projects. The table can be hard to understand as it presents nested information. For example, the cell in the second row and the first column (0.01 hours) is the 10th percentile of 283 034 values, where each value is the average of the PR durations for individual projects.

The average of the average durations across all projects is 362.2 hours (~15 days) which is not far off from 309.3 hours (~13 days) -- the average over the 50 known projects we analyzed last week.

Looking at the percentiles over the projects, we see that the durations grow exponentially.  It means that some projects have a lot more activity than others -- for example, 50% of the projects have pull requests with an average duration of less than one day while 10% of projects have pull requests with an average duration of around one month!

Comments in Pull Requests

Finally, we analyzed the comment activity in the pull requests. GitHub has two main types of comments associated with pull requests shown in the conversation tab:

  1. Pull request comments: these are single comments on the file diff between the two branches or general comments in the pull request.
  2. Pull request review comments: these are comments on the file diff made during a pull request review and they show up grouped in the conversation tab when the review is finished.

We separated the analysis into two groups because we consider pull requests review comments as a representation of the code review process and all comments as a representation of the discussion of the pull request. We were curious about the percentage of pull requests that are merged without any comment activity so we also computed for each project, the percentage of pull requests without any comment or review comment.

In the following table, we present the average and percentiles across all projects.

We were surprised to see such low numbers in the overall comment activity on pull requests:

  • At least 50% of the public projects don’t have a single merged PR with any comment
  • The average of the average comments across all projects is 1
  • The average of the percentage of PRs across all projects without a single comment is 74%
  • Only 10% of the projects have an average higher than 2.75 comments per PR

The activity regarding reviews comments is even lower:

  • At least 70% of the public projects don’t have a single merged PR with any comment in the PR
  • The average of the average comments across all projects is 0.5
  • The average of the percentage of PRs across all public projects without a single review comment is 93%
  • Only 5% of the projects have an average higher than 2.84 comments per PR

This data shows that the process of pull request reviews is still restricted to a very small percentage of public repositories. It will be interesting to understand how much automation is involved in this comment activity. If the presence of code reviews is an indication of high code quality, we hope that this percentage includes the mature open-source projects used by thousands of developers on a daily basis. In any case, we hope to see these numbers going up as we recompute them in the future. If you have insights on how these results generalize to private repositories feel free to reach us!

Next week we’ll deep dive into a set of projects -- let us know if you have any requests!

Try Reviewpad
Disrupting how software developers collaborate