I am happy to see Facebook release Tornado, their Python based web-server that was developed to run the FriendFeed service. They have done their best to optimize the stack within the Python process. However, I'm inclined to believe that it is more of the same problem from which many HTTP toolkits suffer.
Since Tornado is written in Python it is typically restricted[1] by the Global Interpreter Lock. The GIL, for short, is a technique often used by dynamic languages such as Python and Ruby 1.9 to simplify development of the languages runtime as well as providing certain features to be implemented safely in the language. For example, it is the GIL that makes multi-swap features atomic in Python (a, b = b, a). The question is "How do we scale Tornado"? The answer is the same as most GIL based systems. You scale using processes, typically, equal to the number of concurrent threads your system can execute.
Update: The previous paragraph reads differently than I intended. Tornado doesn't really suffer from the GIL as its single threaded. What I meant to get across that is if you needed to scale to all the cores of your system you need either threads or processes. Future paragraphs explain this in better detail.
Scaling in this model, however, presents a few challenges. Each process ends up requiring connections to your database and cache tiers. So now instead of N connections you will ultimately grow to N*N_PROCESSES connections. Scaling the number of cores will inversely effect your data tier performance which is typically your bottleneck to begin with.
I think the need to design the system in this direction is ultimately due to the real web-servers letting us down. Tornado handling data in an event-based fashion should not be news in general. Most web-servers have been doing this for a while. Cherokee[2] does it quite well, Apache's mpm-event does it, and Python Twisted's reactor has done it for years. However, all of these systems let your infrastructure developers down in a critical component, the module API.
Once you have a socket accepted you need to do something with it. Web-servers tend to differentiate themselves in this area but I don't think they have gotten it right. Apache provides different models such as MPM which is a combination of worker processes and threads. This was done primarily to work around inefficiencies in massive numbers of threads per process. "Why would you need a massive number of threads" you might ask? It's due to the design-flaw that the module API's are designed around executing synchronously. After the bottom-half (the socket handler) accepts a new request, it is passed to a handler. The handler calls the configured modules to process the request and blocks while this happens. Ultimately, this means you are wasting a thread regardless of your application having the resources to immediately process the request. When your system cannot keep up with requests you create a massive number of threads/processes up to your configured bounds which slows down the threads that actually are able to process a request. Obviously this exacerbates the problem.
What needs to happen is a module API that supports a fully asynchronous and event-based processing of requests from the socket handling bottom-half through the application-based upper-half. A primary example of where this can be useful is exemplified by "partition-based" data such as most user-content systems. If you accept too many requests on a single server for a data partition you will slow down the average response time per request. This is due to the contention you create on your database and/or cache connections. So while you may want to accept the request, you may not want to start processing it until a number of other requests in that partition range are completed.
Once we have switched to a server that can work in this model we will run into the problem of how to handle all those cores efficiently. Thankfully, Apache styled MPM models wont be necessary anymore. If applications can be processed in an asynchronous model then they will be able to yield the thread on IO and other shared resources that are not yet available. Doing so will mean the optimum number of threads will be equal to the number of cores you have to process.
For this to work optimally with language runtimes dependent on a GIL we need to use a little-known technique to have multiple runtimes within a single process. If you have multiple copies of the runtimes shared library each subsequent call to dlopen() will result in new memory zones for the loaded library. In such cases, you can even have different versions of the same runtime loaded within one process (python2.3, 2.4, 2.5, etc). We can load a runtime per core so that each worker has it's own runtime with the GIL disabled (since its single threaded). As I'm sure you noticed this is an implementation detail.
A few things are preventing this from becoming an immediate reality. There needs to be a standard for languages to be able to integrate into the web-server so that shared resources can be stored outside of each runtime. For example, one of our python workers might yield on getting a database connection from the servers connection pool. Now the worker can pause that request until the connection is available and process other requests. This should feel natural to those who have worked in co-routine based systems.
One of the problems that interests me is getting the request from the bottom-half to workers in the upper-half as efficient as possible. Typically this is done with a thread pool. Most thread pools are implemented with a single Queue and various worker threads that pop their future work item off the queue. What research has shown is that this creates a considerable amount of lock-contention as you scale the number of cores. For each additional worker you create a new contender for the lock to retrieve an item from the queue. Nir Shavit et al pioneered a concept called work-stealing which greatly reduces this problem. It consists of a queue per worker and distributing work items in a round-robin fashion among the workers. To prevent worker-starvation, when a thread runs out of work-items, it will look to its neighbors to try to steal an item off their queue. Some fancy footwork is performed to reduce the contention by stealing from the opposite side of the queue and implementing the fast path as lock-free. Additionally, you can choose neighbors that share your die on NUMA. I was so interested in this and the challenges of implementing lock-free algorithms in C (they are easier to implement with garbage collection), that I implemented it in Iris[3], my concurrency toolkit for glib.
The primary thing we've learned from web-servers over the recent years is that to handle high-load scenarios you need to do everything you can to reduce resource usage when forward-motion cannot be performed. By continuing to build web and application frameworks within the container of the synchronous module API we cannot hope to get the massive improvements in performance that we all desire. We need to consider the high-performance web-server as part of our application which means writing HTTP container standards.
-- Christian Hergert 2009-09-14