This is the reading notes for the ORCA: A Distributed Serving System for Transformer-Based Generative Models. This is an OSDI conference paper from 2022. Almost all the authors come from South Korea, and actually, this is the first time I have read papers written by Koreans.

Summary

Abstract & Introduction & Background

The paper is focused on the inference serving, they point out that the existing system is not good enough for transformer-based models. So, they propose a new method called iteration-level scheduling and create a distributed system called ORCA.

Besides the scheduling thing, they also introduced a new batching method to fit with their scheduling method, which is called selective batching.

The background information they give is kind of detailed. The first part is about the transformer-based models, the second part is about the existing serving system.

The Figure 1(a) actually brings me some confusion, and I talked about it in Thoughts > About Figure 1.

The background information do not have too much to talk about.

There are also two figures about exising inference system.

Figure 2 is how the existing serving system work, and the Figure 3 is why the current serving system oot good enough. The figure is actually kind of clear.

Basically, we just combine requests in to a batch, the send the batch to the model, and after all the requests meet the end, we send them back. Some request may need more iteration, but the other requests can only wait. This is what the authors want to improve.

Challenges and Proposed Solutions

Just two challenges and two corresponding solutions.

First about schedling and second about batching.

They want for each iteration, we can choose the requests, so they create a request pool. Every iteration, they choose enough requests from the pool to make up a batch, and after only one iteration, the model sends the output back to the pool. The requests that are done can be sent back to the clients, and new requests can just be put in the pool.

The second thing they do is the selective batching, just like Figure 5 shows above. The basic idea is simple, but there are some details I am confused about. I write them in Thoughts > Confusion > About batching.

ORCA Design & Implementation

The ORCA design basically contains two things: one is how we place our model on the GPUs, and the other one is what is the detail of the scheduling.

The Figure 6 and 7 introduce how ORCA places the model on GPUs and what parallelization strategy ORCA uses. About parallelism, there are so many different terms about this, but they all represent the same thing.

Here, for ORCA, every intra-layer will be set as one node, just like Figure 7 shows. It’s very clear, not like some other papers, very blurry about what a “worker” or “node” actually refers to.

For the scheduling algorithm, it’s just very detailed, and there’s no point I talking here. I have some questions about this, and write them in Thoughts > Confusion > About the scheduling algorithms.

The implementation is very short and do not have too much to talk about.

The evaluation is just you set up a baseline and compare it with different setups, and show that your results are great.

Thoughts

The paper is kind of old, I think nowadays the support for the LLM inference is very sophisticated. I know two mainstream open-source projects focus on this area: vLLm and SGLang. I actually know some developers of SGLang. However, in this paper, none of them is mentioned 😂. I think this is because the paper is kind of old, and for the framework it mentions, I actually have no idea.

Confusion

About Figure 1

Figure 1(a) was initially a little confusing for me, as it seemed that in the increment phase, we only needed the output from the last iteration. However, what I remember about how the transformer works is that you add the last iteration’s output to your last iteration’s input. Like in the initial phase, your input is “I think it”, the output is “is”, then the next iteration’s input will be “I think it is” and so on. If you read some other papers or some other blogs about the transformer, you’ll see some figures drawn in this way.

I think this is about two different ways to express the model. Each iteration definitely uses the information of the previous iteration’s input, but in this paper, the author may think this information is already in our K/V cache or K/V manager. It’s like not purely input, so they just ignore these. However, I think the most plausible answer to me is that omitting the previous input makes it easy to draw good figures😂.

This is actually related to another topic, which is “why in an inference system, we only need to store K/V?” There is a blog that talks about this issue.

About batching

How to batch requests together

In this paper, it says that if we satisfy some strict conditions, we can batch different requests together, and I just a little bit confused about that, since for me it seems that there is no way we can do the attention process together.

About Figure 5

I just don’t know why the hidden dimension becomes 3H.

About the scheduling algorithms

first-come-first-served (FCFS)

They mention that they make sure that the previous come requests must have greater or equal iteration times than the later requests. However, the detail is omitted, I just wonder how they do this, like sort the request pool every iteration? Just wondering.

max_tokens attribute

This is a very important attribute, but they do not state how they get this attribute. It seems that they just know the maximum output length of each requests from nowhere.

Ideas

About the "iteration level scheduling "

They basically run one iteration of work for a request, then check its status, managing all these requests in a central pool. This “iterate once, check once” model is efficient in some ways, but it sparked a thought: could we make it even smarter?

What if we could get a rough idea, right at the start, of how many iterations a particular request might need to finish? Maybe a small, lightweight predictive model – something like a mini neural network – could look at a request’s characteristics and make an educated guess. It could flag some as likely “quick wins” and others as “longer hauls.”

If we had that kind of foresight, we could then route requests into different processing pools based on their predicted effort. Imagine a pool for tasks expected to finish in just a few iterations, another for medium-length ones, and a third for those predicted to take a significant amount of time. The real tweak would then be how often we “check in” on the progress of requests in each pool. For the fast pool, frequent checks, similar to Orca’s current method, would make sense to maintain responsiveness. But for requests in the medium or long-haul pools, we could reduce the check-in frequency considerably. Instead of checking after every single iteration, perhaps we’d check after every K1 iterations for medium tasks, and an even larger K2 iterations for the long ones.

The potential upside here is a reduction in overhead. Constantly checking the status of a long-running task can be wasteful. By being more selective about when we check, especially for tasks we anticipate will take a while, the system could spend more of its resources actually doing the work rather than managing it. This could lead to better throughput for longer jobs without significantly impacting the latency of shorter ones. Naturally, the accuracy of the predictive model and the overhead of managing multiple pools with different check strategies would be key things to figure out, but it’s an intriguing possibility for optimizing such iterative processing systems. It’s just a thought bubble for now, but an interesting one!