Reviewpad: Synchronization with Code Hosts

25th June 2021 ・ By Xavier Vilaça ・

This blog was started to foment discussion about the practice of code reviews. That’s because our goal is to change how code reviews are done. To us, code reviews are the most important thing teams can do to ensure code quality. We recommend a code review-centric philosophy of development. We are building the ultimate tool to do so, and today we are telling you a bit about how it will work.

How Reviewpad will change the way you do reviews

One of the most important features of a code host nowadays is the ability to review code, which in GitHub and GitLab takes the form of pull and merge requests respectively.

Reviewpad seeks to enhance the way code reviews are executed, according to very specific guidelines. We want, for instance, to provide the tools for Reviewpad and code host users to interact and communicate. We want users to manipulate code reviews in new ways by providing extra features. And we definitely want to provide more context about the code being displayed than a simple textual diff.

The aim, however, is not to replace code hosts, but to synchronise with them. We want to provide a unified interface for opening and tracking code reviews across multiple hosts.

We will achieve this by using three independent mechanisms:

The first one is a mechanism that extracts the code from code hosts so that Reviewpad may analyse it and provide more context about the changes. This can be achieved by using a git client to clone each repository in a local database, after which it will perform frequent fetch operations to keep the code up to date.

We also have a mechanism that propagates changes performed by Reviewpad on this information to the code hosts. This is achieved by establishing a map between the unified code review model of Reviewpad and the models of each code host.

Last but not least, there is a mechanism that retrieves and stores information about code reviews. This article is about this mechanism, which will be explained in depth.

You will have noticed that the first and second mechanisms are relatively simple to explain.

We use the code hosts’ remote APIs for propagating changes and whenever a user manipulates a code review (e.g., performs a comment, opens or closes a code review), we map it into a sequence of operations of the code host’s API and execute them as soon as possible.

Let’s focus on the last mechanism, then.

The challenge

The issue with retrieving this information about code reviews is not so much the implementation, but the task itself: the requirements are often conflicting.

Let’s look at these requirements in order.

The information we must retrieve comes in a wide variety.

First of all, we must retrieve the code’s location in terms of code host hierarchy. Because the organization is always hierarchical, regardless of the code host. For instance, in GitHub, repositories belong to an organization or a user and are identified as {org/user name}/{repository name}. Things are different in GitLab, where repositories can belong to users, but there are no organizations. Repositories can instead belong to a hierarchy of projects of any depth. For instance: a repository may belong to a project A, a subproject B of A, a subsubproject C of B, and so on.

Now, in general, a code host consists of trees of code sites. A leaf node is a repository, inner nodes are users/organizations/projects, and the root nodes are the code host itself. What we are setting out to do is locate code reviews in this hierarchy so as to play with the natural way of locating code reviews in code hosts.

Evidently, there’s plenty of other stuff to find, such as:

Meta-data, including title, description, reviewers, comments, or user reviews;
History of commits and pushes;
Checks that determine whether the code review is ready to be merged;

And, last but not least, access control information. This is what makes it so that users can access or manipulate only the data the code host’s rules allow them to.

Second requirement: We have limited access. This information can only be imported through the code host’s remote APIs, which have rate limits of requests. We must figure out a way to import all of the information without reaching that limit. If we reach it, some requests may fail.

Number three: We want to provide context about the changes that happened during the review. To compute this analysis takes time. We want operations to be as fast as possible to ensure a good user experience.

The fourth requirement is that we must ensure quality of data. That means data needs to be as up-to-date as possible and consistent.

The fifth one is true of all software products: we want code complexity to be manageable so Reviewpad is easy to maintain.

To achieve this, we had to make some key design decisions.

The design decisions

The first thing we took into consideration when planning this feature was who was going to be using it. Who are Reviewpad’s users? What are their needs and expectations? We figure that Reviewpad users will spend most of their time reading code review changes, and changes outside of Reviewpad will happen relatively infrequently, such as when users:

push changes to the code review code;
add/remove/edit comments directly in the code host;
make changes to the code review meta data, e.g., change the description or the base branch

This also suggests changes to the code sites in which code reviews are located will be even more rare.

This led us to make a call: we are going to import (or compute) the majority of the information beforehand, store it in a local database, and only react to changes as they happen. Resource-intensive operations of importing mostly static information can be triggered once by system admins before normal Reviewpad usage, and that allows for most subsequent operations to require only light updates on the database. In the rare cases of more resource-intensive updates, such as a full code site refresh, they can be performed in the background without affecting Reviewpad usage. The idea is for the operations to be concluded in the time between their beginning and the moment when users take a look for any new information.

Let’s look at this in more detail:

Since repositories are organized into a tree of code sites, we have chosen to allow users to import all the repositories and their code reviews under a given code site (which may be a single repository).

By doing this, let’s say a user imports a github organization. We will then import all the repositories under that organization. Moreover, we will start tracking said organization, so any subsequent changes, such as an added repository, will feature on Reviewpad.

How?

Webhooks. Webhooks are essentially subscriptions of changes on repositories. Whenever a change happens on the code host end, Reviewpad is notified with a message, which triggers an update.

The challenges we faced

The fact that we made our decisions doesn’t mean it’s all smooth sailing. By deciding to import everything, we ran the risk of creating slow operations that would be hard to debug in the case of very large repositories. And relying on webhooks is always a risk. They can fail, or quite simply not be available. And, of course, as we mentioned, there’s a rate limit to what we can import. By doing it in bulk, we will reach it faster.

We had to tackle every single one of these issues.

The first thing we did is make sure we import all the info asynchronously. Whenever a user starts an operation, they can start using Reviewpad immediately. In the background, the database is being uploaded. A nifty little tracking mechanism will be shown so the user knows the current state.

How about important information, such as when a user detects an inconsistency? Reviewpad will allow for users to refresh imported info manually when needed.

These measures accomplish two things: we are no longer dependent on webhooks not failing, and we can work with repositories that don’t allow for them (such as open-source).

The next thing was to break up all important operations into much smaller tasks. Each task imports information regarding a subset of entities, e.g., tasks for retrieving the code site tree, for retrieving the list of code reviews in a repo, for importing the code review metadata, for extracting information about pushes and commits, for synchronizing the git repository.

Tasks will obey a specific hierarchy, spawning others as needed:

Task Hierarchy

As you can see in the graph, code site retrieval is the root task. Under it, we have tasks for extracting the repositories’ metadata and their list of code reviews, as well as for synchronizing the code, among other things.

Why we made these choices

The main advantage of taking this approach is that we can improve the performance of the import operation by running tasks in parallel, all the while keeping a refined log of the import operation.

This way, instead of seeing a single flat log, we can organize it into a hierarchy of logs which follows the same structure as the task hierarchy. Each individual log will contain the main events of each task, and we will then be able to quickly trace down any problems to the exact task in which they occurred.

Our entity-refresh approach allowed us to reuse these tasks in update operations, which decreases code complexity.

This means that instead of computing the minimum required changes on the database for every update operation, we instead identify a set of entities affected by the update and retrieve all their information from scratch.

There are advantages and disadvantages to this approach, of course, especially when compared to more refined approaches that would involve calculating the deltas in the affected entities.

On the one hand, we can reuse the import tasks in every refresh operation, refreshing itself is quite simple (you just extract the latest information of every affected entity and insert it in the database, or replace the previous information in case it exists), and a single refresh can quickly fix any inconsistencies.

On the other hand, we will make more requests to the code host’s API per refresh operation. Remember, the rate limit.

To ensure the limit isn’t reached, we use throttling. Whenever we have access to the rate of requests available to a user at any given point in time, we can throttle the requests on our side, controlling the rate at which requests are sent and slowing them down if necessary. This may slow down the operation, of course, and that information is not always available (such as in Bitbucket), but it is an important measure to have in place.

In any case, what we can do is to refine tasks as much as possible so that refreshing only affects the smallest number of entities possible, minimising requests to the code hosts. We can also optimise only the most critical refresh operations to only update what is really necessary.

Some trade-offs will still be required, of course. We only enforce eventual consistency. Failures are relatively rare, and update operations are quite fast, which means information is kept up to date most of the time. Given that any inconsistencies can be quickly fixed by the manual refresh, we think not enforcing total consistency is an acceptable trade-off.

This is how we built it

We used Go as our primary language and there are four key mechanisms to the implementation.

The throttling of code hosts’ API requests, which quite simply consists in controlling the rate of requests such that it never exceeds the rate limit we’ve mentioned earlier. We have based our algorithm on the token bucket, which delays any requests that will cause an abuse. We don’t enforce a constant rate limit, instead adapting it to our expected usage profile. We also make sure that more important operations are performed first, and, thus, at a higher rate.
Asynchronous Task Mechanism, which uses Go synchronisation primitives to organise the code into asynchronous tasks, specifying the hierarchical relationships and dependencies between them, and submitting them to be run in the background. We’ve also gotten the help of the front-end team to create an interface and tracking mechanism that allows us to visually track any and all tasks where errors occur.
Manual refresh will avoid a more costly and unnecessary untimely refresh of code reviews and entities at higher levels by allowing the user to trigger specific refreshes whenever they detect inconsistencies.
Webhooks administration will provide users with a very simple interface to add or remove webhooks from repositories. Once added, Reviewpad will process the corresponding events using much the same mechanisms as those in manual refresh and code site import operations.

We hope you have enjoyed this deep dive into how we are building this tool. If you have, then we are quite confident you will love to explore it yourself. You can try it now for free, and we encourage you to do so!