Should you consider using gRPC in your data science infrastructure?

Jeff Yu Chu
5 min readAug 23, 2022

--

REST vs gRPC has become a popular search in recent years

This note will brief you the challenge we have faced in building our data science platform and why gRPC has been brought to the table. I will also show you my experiment results and concerns that will help you decide whether gRPC fits your architectural landscape.

TL; DR

I would say a definite yes for the question in title, if you have to send large datasets across your micro services, AND if your data structures (schemas) are relatively static. My experiment shows that gRPC performs around 4 times better than REST in our scenario in concern.

The problem we faced

We currently run a data science platform which processes large time series datasets stored in pandas data frame. Recently due to a dependency clash, we are forced to separate the code that read/writes from our database (MongoDB) and the code that do the actual calculation, into separate micro services (kubernetes pods, to be specific).

Sending data between the two micro services can be challenging as the data frames we are passing back and forth tend to be very large (can be as large as 20MB each). With our current practice, the easiest way to implement the interface would be to use REST APIs and send the data frames as JSON.

However, sending large JSON around might slow down our system significantly and we would like to see if gRPC can alleviate that.

The proof of concept

I first saw the term gRPC in the book Designing Data-Intensive Applications and this is my first time using it with a real life business requirement. To make sure it’s worth implementing, we must evaluate the cost and benefits beforehand. Hence a proof of concept was planned to verify the following:

  1. If our datasets is compatible, and can be easily implemented with gRPC and its underlying serialization format, protocol buffer (protobuf).
  2. If gRPC does provide better efficiency transporting our data. (comparing to REST with JSON)

Note: For those who are new to protobuf, think of it as the JSON used for gRPC. It’s a format to package your data, so that it can be stored and transported more efficiently.

After a brief research, we found that there are very limited practices to be referenced out there. I wondered if it will mean a hard grind of official documents for me but luckily it was not as complicated.

In short, development of a gRPC server / client can be done in following steps.

  1. Define a protobuf schema of the data you are going to transport
  2. Run code generation based on the protobuf schema, which creates classes for you in your preferred language
  3. Use the generated classes in your server / client code to serialize / deserialize your data with protobuf
  4. Use codes generated to build the framework of your server / client
sample schema definition from protobuf3 documents

I might also try to put together the implementation details in another note but for now you can find relevant instructions in the official documents, which I’ve listed in the end of this note.

The comparison

Implementation is verified to be feasible, now it comes to the efficiency. I have therefore designed an experiment to compare gRPC and REST architecture head to head.

First, a spoiler on the result. According to my experiment, gRPC API performs around 4 times better than REST API with our datasets. Also it only adds very little overhead comparing to loading my data directly from the DB.

gRPC outperforms REST by far in my tests (numbers in seconds)

Above screenshot shows the time it takes to load data frames with different architectures. Here are a few more details on my experiments for you to better understand what above results mean:

The data used:

10 pandas data frames with different time series data of ~2000 rows and ~20 columns each. Index are dates (in pd.Timestamp), cell data are double float numbers. These are of data shapes that we frequently process everyday.

The architecture used:

  1. As a compare base, I used a simple structure that loads data frames directly from our database (MongoDB), which simulates our current landscape before splitting up the micro services.
  2. I created a gRPC server which loads data from database in the exact same way as 1., then serializes it into protobuf and serves it. On the other side I used a gRPC client on the same host to query the gRPC server and deserialize the protobuf back to pandas data frame.
  3. A similar architecture to 2. but uses fastapi and sends the data frames in JSON and reconstructs data frames from that.

This experiment very much simulates our current architecture and the two potential implementation options if we need to split it up into separate micro services. I find the result compelling and that truly supported our decision to incorporate gRPC in our infrastructure.

Final thoughts

I have read some notes online which argues that under certain circumstances gRPC works no faster than REST with JSON at all. It is surprising to find that with our scenario, gRPC actually outperforms by far.

Considering that the implementation of a gRPC is not as complicated, and Google does provide a good support with documentations and code generation packages, I would suggest you to give it a try if you are facing a similar challenge as ours.

However, do keep in mind that by its fundamental design, protobuf would be easier to implement if you have a relatively static schema of your data. The protobuf syntax does allow some flexibility of data shape with the use of lists and maps. It will still be required to have all your possibly used fields well defined with data types.

Also, as gRPC is not a generally accepted protocol yet. It will be a better fit if used for inter-server communication only. It might not be as convenient to expose gRPC endpoint to, say, your front end or your users directly. As deploying that will require all your clients to run with gRPC packages. The data being served through gRPC will as well not be simply accessible with ubiquitous tools like a command line curl, for example.

Useful references:

--

--

Jeff Yu Chu
Jeff Yu Chu

Written by Jeff Yu Chu

Software engineer based in Tokyo. Currently working in a finance start-up, building platforms to support portfolio management and algorithmic trading.

No responses yet