In the latest CppCast from Tony Van Eerd , Tony was talking about lock-free programming and amazingly I had almost same experience. It was already more than 10 years ago. In 2006 time around we released Guild Wars Faction, I developed my own memory allocator based on Maged Michael’s algorithms as presented in 2004 PLDI paper, “Scalable Lock-Free Dynamic Memory Allocation”. My test showed it sometimes even better than Windows LFH that was known to be very efficient and mostly lock-free but we decided not to ship it. I was a little bit dissapointed at that time but performance improvement wasn’t greater than the risk.

I joined ArenaNet in 2003 October and I started working on some backend servers in 2004 something like Chat server and Auth server. Before then I had no experience on servers so that was quite a learning curve. Of course, there were many naive bugs I wrote that caused server crash and deadlock. Fortunately, I’ve had quickly ramped up. Patrick Wyatt gave me opportunity to write Guild Server which manages all things related to guild system such as guild membership, guild chat, guild hall, guild ranking and leader board. It was first major server task given to me. Looking back, even if Pat reviewed most of my code, it was pretty challenging.

Deadlock and race condition are the most annoying bugs in the Guild Wars servers. We developed couple of rules to avoid deadlocks.

  1. We strictly avoided calling external functions (in other module) inside lock.
  2. Inside critical section or lock, do little work as possible and leave the lock as quickly as possible.
    • For example, there was a common trick we used. When we have to remove and destroy some entries from the global list, we just remove them and add to the local list inside a lock. Then destroy the entries outside of the lock.
  3. In server code, there were several locks in same cpp file for various reasons. So there was a big comment at the beginning of module, what is the order of entering locks to avoid deadlocks.
  4. Pat created deadlock monitoring thread which is checking livability of each thread periodically and detect deadlock. When it detected deadlock, it gets stack dumps of all threads then crash the service. Service will restart and probably gets the same deadlock later. Meanwhile, we analyzed the stacks and created a quick fix. Each Guild Wars servers designed to be restartable which also means they recover after crash with minimum service disruption as possible.

Server crash time to emergency fix roll out was usually less than one hour. In retrospect, that’s still amazing.