VoIP Building Blocks
DiamondWare has spent the last six years developing fundamental building blocks for VoIP applications. Our media stack emerged from hours of work optimizing low-level code. This study and depth of understanding is integral to building robust systems for real-world conditions.
Any VoIP application developer is faced with the choice of developing or buying technology components. Many offerings today are cobbled together, with corresponding results. In contrast our solutions are effectively vertically integrated. This provides benefits when developing, including quality control, tight integration, efficient platform utilization, unique features, and perhaps most importantly the capability to guide and extend our vision of the future (e.g. 3D requires a complete stereo pipeline).
Good development takes time, trial and error, and the lingering risk of being superceded by new ideas and approaches. In our case, today we find ourselves with the only VoIP media stack to address stereo opportunities. Consumer broadband uptake has accelerated and Enterprise VoIP market growth is projected in billions of dollars.
In the consolidation of automobile manufacturers nearly a century ago, those companies who could build engines fared better than the rest. We at DiamondWare have our "engine" and are now putting it into next generation VoIP applications. Our "engine" has a number of key components.
Media Stack
The media stack denotes the software components that handle the audio from "source" to "sink". Keeping in mind that audio flows in both directions, in the client one source and sink pair is the microphone to the network card, and the other is the network card to the loudspeakers or headphones. In a server, the first pair is network card to mixer, and the second is mixer to network card. Each of these four pipelines is quite different.
APipe is DiamondWare's software component to manage any audio pipeline. It abstracts objects for source, sink, and everything in between such as encryption, compression, logging to disk, echo cancellation, jitter buffering, voice disguise, automatic gain controller, etc.
Telephony Sound ToolKit is the answer to one of the toughest problems in the client, in both streams. It handles the audio from mic to application, and from application to loudspeakers. Tele-STK was developed to break out of the conventional latency-robustness tradeoff. If the audio latency is too high, then the system does not provide business-class communications. But trading off sound quality for low delay is not acceptable.
JitPP is a dynamic jitter buffer. It performs three functions: (1) take the packets received from the network out of order and unevenly distributed in time and turn them into an ordered, periodic stream; (2) correct for clock "drift", i.e. the fact that even if the remote host is programmed to the same sampling rate, its clock rate will differ from the local clock and therefore it will send either too many or too few samples; and (3) conceal lost packets.
DirectMixer® solves a problem that is not apparent until one develops a softphone application: it doesn't work unless the recording source is the microphone, the mic is unmuted for recording but muted for playback, loudspeakers are the digital audio destination, the loudspeakers are unmuted, and the volume levels for record and playback are appropriate. It turns out that this is an exceedingly difficult thing to achieve using the Windows API calls, but it is required nonetheless.
Mixlib is the heart of Palantir. It provides control mechanisms for joining/parting conferences and other actions which can affect who hears who, as well as for setting 3D positions, voice colorization, volume levels, and other real-time parameters. And it can give indications such as VU meter, member list in a conference, etc. The heart of Mixlib is the mixing itself. This part of the code base is critical; it must be extremely efficient because it executes so many times per second.
DDD is the algorithm and code to convert a monaural voice stream into a stereo sound stream that is perceived to be outside the listener's head, with a particular location vector. Like Mixlib, this function has to be extraordinarily fast to run.
Red, Blue, Green, etc. These are the algorithms that colorize voices and "tag" up to eight teams with a unique and distinctive sound. For the same reasons, must be very fast to run. |
DiamondWare licenses acoustic echo cancellation, automatic gain controller, and voice activity detect functions. Similarly, it licenses (or uses freely when available) compression algorithms including G.711, GSM, G.723, G.726, G.729, Speex, and Vorbis.
Signaling
Telephony Network ToolKit is a signaling and transport component that provides guaranteed delivery of datagram packets with minimal latency.
DWSIP is DiamondWare's Session Initiation Protocol stack. It was developed after examining several commercial and open-source SIP stacks. Although low-performance SIP stacks are a commodity item, there are none that require less than 100K of memory, and none under a megabyte that support the latest Internet Engineering Task Force Request For Comment (IETF RFC) standards, namely RFC 3261 and SIMPLE. Reluctantly, DiamondWare built its own SIP stack, which saved on license fees and enables a truly small PDA voice application. Other developers who target small embedded platforms may want to consider DWSIP.
Applications
It is on top of all the above components (and many lesser modules which don't merit individual discussions) that an application is built.
|