About a year and a half ago, I started to become curious as to why there were so few computer games with VoIP support. It should be relatively easy, shouldn't it? You just grab an open source sound codec like Speex, throw in some sound code, add some network packet buffering code, mix, and voila--VoIP support. Unfortunately, I became distracted by something else and never got around to looking at the issue more deeply.
Just recently, I became curious again, so I decided to code up a small VoIP application. And, as expected, it did end up being fairly easy. I have to admit that I did waste a week going down the wrong path--I configured my sound card wrongly, resulting in a lot of feedback in the signal, so I spent a week examining algorithms for echo and feedback cancellation. But once I figured out what I was doing wrong, I was able to bang out a working little VoIP application in about a week.
The only part that wasn't really obvious (to someone who's been thinking about the problem for the past year) was the packet buffering code. Conceptually, it should be easy, you just store incoming packets somewhere and wait a few milliseconds before using them to compensate for packets coming in out of order and with jittered delay. But that leaves a lot of unresolved issues such as what should you do if you use up all the packets waiting for you? You can skip a packet (i.e. assume that they are lost) or skip a "heartbeat" (i.e. assume that you're consuming packets too quickly and wait for the packet to arrive). And what happens if you receive too many packets?
The scheme that I ended up using was to trigger everything based on the difference between the highest packet sequence number in the packet buffer and the next packet sequence number that the program is expecting. So whenever my program tries to grab the next packet, if the difference in sequence numbers is below a certain threshhold, then I skip a heartbeat; otherwise, I grab the next packet, skipping it if it hasn't arrived yet. If the difference in sequence numbers ever gets above a certain threshhold, then I advance the expected sequence number to get below the threshhold. I also had different threshholds for deciding when to adjust for clock-skew. Oddly enough, my sound card seemed to playback audio at a different sample rate than what I specified for some reason (it could quite well be my fault), so I put in some code for stretching incoming sound samples to take up more or less time. All of these threshholds should probably be adaptive somehow, but I couldn't be bothered, so I just set them to some default values.
Other than that, everything was pretty straight-forward. In Windows Java, the System.currentTimeMillis() method is very low-resolution, which caused my server to alternate between thinking it was very fast or very slow. Fortunately, newer versions of Java have a nanoTime() method, which provides a better high-resolution timer. I also had a bit of a problem with silence detection because my sound card didn't use 0 as a baseline. Sound samples seemed to center around -100 or so. I was able to get around this by taking the max and min sound samples of every chunk of sound, averaging them, and then exponentially averaging them with my previous estimate of what the baseline was. This seemed to work as an adaptive baseline. I'm not sure how to erect an adaptive silence detection algorithm on top of that though since I put a static algorithm there instead.
So the conclusion of all this is that VoIP support isn't included in more computer games due to laziness on the part of developers. It really is very easy to do. There may be some latency issues where computer game code is stealing so many CPU cycles that the VoIP code can't get CPU resources at regular enough intervals that it causes latency, but I'm sure it's something that game developers could work out with thread priorities and stuff like that.