Adventures in WebRTC: Making Phone Calls from XMPP

0@psycoti.ca 2021-229 L3349 section protocol def

Normally when I'm writing an article, I prefer to not front-load the article with a bunch of technical stuff and instead bring it up organically. This article is different, though, and if anyone is going to get anything out of this I've got to set up a bit of background, so bear with me. I'll go into more detail on these things when it makes sense, but I have to at least introduce the players in our little play.

First, we have calls. You probably know about these, phones aren't exactly new, but I want to clarify that these are voice calls, as you may expect from a phone. One person calls the other, it rings, they pick up, and then audio conversation can be had.

Second, we have XMPP which is a chat protocol that's been popular in the FOSS (Free and Open Source Software) community for years. It has also made appearances in various commercial offerings, from Google's GTalk, Facebook's Messenger (at one point), and others. The big feature XMPP has compared to other chat protocols is that it's a standard, which means there are multiple implementations of both clients (the program the user uses) and servers (which host the user's account). They also have a large collection of extensions which extend the standard to provide more features which clients and servers can choose to implement. Also importantly it's "federated", which means users of one server can talk with users of another server seamlessly.

Third, we have Jingle, which is one of the previously mentioned extensions that allows two XMPP users to setup a peer-to-peer data connection, in this case for exchanging voice data.

Fourth, we have Asterisk. Asterisk is an open source telephone system. You can use it to receive or send phone calls, setup phone trees and extensions, etc. Because it supports many different protocols for sending voice data it can be used to connect a call from one protocol, like Jingle, to another protocol.

And finally, JMP, which is the company I'm working for. JMP integrates XMPP with the mobile phone network, giving XMPP clients a user they can contact representing any phone number, and we'll turn the chats and Jingle calls into SMS messages and phone calls and vice-versa. This allows JMP's customers to use a client of their choice, across mobile devices and desktops, to communicate with people who are still using SMS and traditional telephones and haven't moved to XMPP yet.

Ok, now we can establish the starting point of our story. We have SMS working in both directions already, and thanks to the work of my co-worker singpolyma and a user named eta's patches, we already had phone calls coming in properly. But allowing our users to call-out to phone numbers, that is, from XMPP to cell phones, wasn't working yet. We figured it was just a small tweak to the existing setup, so I set out to find the simple change that was required.

And now, over a month later, here's the path I went through, with gratuitous technical details along the way.

My Initial Testing Setup

It started out pretty simply. I have an app on my phone, Conversations, which is one of the XMPP clients we recommend to our users. I tested that I could receive calls, and everything worked out well. But, when I tried calling a special user created just to test outbound calls my phone would ring, and I would answer it, but the app would just see "connecting" forever.

I tested a normal Jingle call to another user of the same app, and it worked in both directions, so I knew the app worked fine. I didn't see anything in the logs from my phone to describe what might be the problem, so I had to look somewhere else. There were some logs in Asterisk, which is acting as a bridge between Jingle and the phone network, but nothing that stuck out right away. In order to get a more information about what was actually happening, I wanted to get into the code and put my own debugging logic. Luckily, Conversations is FOSS, and so that's very possible. But, it was also inconvenient to install a development build of Conversations onto my phone, because that would replace the app I used every day. I also wanted to use more tools I had on my computer, so I decided to install Movim instead. Movim is another XMPP client that supports the same voice calling features, but it runs on the computer. In a web-browser, in fact. Normally it would run as a hosted setup, where someone runs a Movim server for you out on the web, and you just connect to it as a web-page, but given what I wanted to accomplish I had to run the server myself on my computer so I can get to both sides of this conversation. A quick test confirmed that Movim could send and receive XMPP-to-XMPP calls just fine, but had the same issue with XMPP-to-Phone calls, so it wasn't just a Conversations bug.

I was now ready to start actually digging into this problem.

Initial Jingle Debugging

XMPP, as a protocol, works by sending XML stanzas over a connection. This means it's very easy to just look in some logs and see what's actually going on. For example, if I were to send a normal message it may look something like this:

<message type='chat' id='dd75a234-8f44-46ff-bd37-c878d04aef92' to='user@example.com'>
    <body>
        Hello!
    </body>
</message>

That makes it pretty easy to debug, which is why that's where I started. My initial theory was that there was something wrong with the messages being forwarded back and forth from my users to Asterisk that was making it impossible for them to talk to each other. To investigate that, I changed my local version of Movim to log every XMPP blob it sent and received, so I could inspect them with my human eyes and brain to make sure they look legitimate. To my disappointment they did appear to be relatively sensible on both the Asterisk and Movim side; the information they exchanged seemed to contain accurate addresses and formats. It would have been nice to have been done here.

Oh well! Undeterred I traced through the Movim code to figure out where the information eventually ended up. Somewhere in the code Movim must use these exchanged addresses to establish the voice connection. Eventually I got down to a line in the front-end JavaScript that just took the result, changed it from the format of the data and then called setRemoteDescription on some peer connection object. That's part of the WebRTC APIs. Crap.

Detour 1: WebRTC

WebRTC is a set of APIs supported by modern web browsers that allow users to establish peer-to-peer connections. Normally when a user is visiting a webpage, even a page that gives the appearance of interacting with other users, that interaction is actually done through the web server. To illustrate the difference, let's assume that two people, named Romeo and Juliet, are using a webpage to chat. Romeo visits the webpage; contacting the server and requesting the chat page, which is sent to his browser to show him a box to type things in and the messages he's already received. He sees a new message from Juliet, so he types his response into the box and hits send. What actually happens is that Romeo's browser sends the message he typed to the server along with information about who should receive the message. The server will take the new message and store it along with the other messages in the list of messages Juliet has received. At some point in the future Juliet will load this page and in doing so the server will send her all the messages that have been stored for her, including this new one. If the site is fancy, she won't even have the refresh the page to get them! Maybe it's periodically asking the server if there are any new messages so long as she's on the page, or maybe there's even a WebSocket waiting to be pushed messages, but in either case Romeo's message goes from his computer, through the server to get stored, and then is retrieved from that server when Juliet's computer requests the messages.

That's fine for occasional, short, bits of text or even the occasional picture. But if we imagine that Romeo instead wanted to start an audio or video chat, that's a whole other thing. First of all, it's a lot more data and it's constantly generating more and more data as the call goes on. And that information is actually doubled, because Romeo has to send it to the server, and then the server has to put it somewhere, and then Juliet has to also pull it back down. So it would be nice for the server operator if they could send the messages directly to each other, rather than involving the server.

The second reason is latency, or how long it takes for a sent bit of data to be received. Even if the server could handle all the data Romeo was sending it, it will usually take longer for the data to go from Romeo, up to the server, then from the server down to Juliet, versus just going directly from Romeo to Juliet.

So the way WebRTC works is that the user's browser has support for a bunch of standards for forming peer-to-peer connections (person-to-person, user-to-user, browser-to-browser, whatever you want to call them). If Romeo wanted to start a video call with Juliet, he would send her a special message through the server as normal, but this message would contain information on how Juliet could contact him directly. If she wants to talk with him, she would respond with a special message of her own (also through the server) containing information on how she could be contacted directly. Code on each of their webpages would take the special direct-contact information the other party sent them and give it to the browser through the setRemoteDescription method I mentioned earlier to signal to the browser that it would like a direct connection to be established. The page doesn't have to know or care how that happens, it will just be told when it works or doesn't. And life will be good.

Back to Jingle

Ok, so if life should be good, why was I unhappy that the information I was tracing was going directly into setRemoteDescription? Well, because life wasn't good and things weren't working. And more importantly, it wasn't as easy to figure out what's going on in WebRTC. I had the code for Movim, I'd already changed it to get extra diagnostic information and I could easily add more code to tell me more things, but the browser's implementation of WebRTC is a bunch of C++ that's built directly into the browser. Even though the browser is FOSS, so the code is available, it's still much harder to make a change to it, build a new version of just that part, integrate that into the entire browser, and then build all of that just to test one small change.

So I wanted to avoid that if I could. For now, things were still ok. WebRTC is something the browser knows about, so it should have some kind of tools for WebRTC app developers to help debug things. I was using Firefox so I searched for "Firefox WebRTC debug" and got this. I installed that tool and was instantly disappointed. It told me essentially nothing that I couldn't have found out already. It doesn't give any insight into the inner workings of WebRTC, it just gathers up the events and properties any app could have subscribed to, and subscribes on my behalf, giving me a list of updates. I guess that's better than having to write that code myself, but in this case I just saw a list of updates that seemed reasonable, and still nothing worked. Not very helpful.

screenshot of Firefox's WebRTC debugging UI

The situation looked a little better on Chromium, so I switched over to that. There's a built-in special URI chrome://webrtc-internals/ that provides on the left of the page the same log of events that Firefox's plugin did, but on the right allowed me to peek inside all the various data structures in their current state and see what was going on. Finally I was getting somewhere!

screenshot of Chromium's WebRTC debugging UI

I started by comparing working calls (inbound calls) to broken calls (outbound calls), to get a sense of what was different about them. There were a few false starts here that ended up coming down to randomness instead. It seemed like inbound calls always had some events in one order, and outbound had them in another, so maybe it was a race condition there! But then 1 in 5 inbound calls would actually have them in reverse order and it would still work. It looked like maybe it would work only if srflx candidates were chosen (don't worry about what that means), but then it wouldn't again. I was grasping at straws trying to find a pattern. While I was looking through properties, I did notice one strange thing: one of the pieces of information that is exposed is the number of bits flowing through each candidate pair and each interface. I noticed as I was looking around that the peer connection's inbound data rate was 0, but one of the interfaces had data streaming into it. It just wasn't the interface that had been chosen for the connection... That was weird. It meant Asterisk was trying to send data to the wrong place. To understand what that meant we have to go on another detour!

Detour 2: NATs and ICE

So what are these "candidates", and "addresses", and why are we exchanging them, pairing them up, and choosing them? This is part of the standard called ICE (which has a newer version of the standard also called ICE). This standard also relies on another standard called STUN and optionally an extension to STUN called TURN. That all sounds pretty complicated, but what is it all for? The problem is NATs. First I'll give the basic, oversimplified, version of the problem. Then the basic, oversimplified, version of the solution.

The Problem: NATs

The way practically the entire internet works is that each device is given an address. Then, when one computer wants to talk to another, it wraps the data up into a packet, addresses the packet to the other computer, and then pushes it out the wire (or through the air) onto "the network". If the computer on the other end of that wire (AKA, the router) is the one the packet was addresses to then the packet is received! If not, but the router knows where to find that computer, the packet is sent along that wire and bounces along from computer to computer until it reaches its destination. There's a problem with this, though. In the original design for the internet the addresses were given enough digits to allow for roughly 4 billion different addresses in the best-case scenario. In reality it was much fewer than that. In the 80s that may have seemed like an enormous amount of computers, but these days a single human may have a laptop, tablet, desktop, and a phone. They also may have a thermostat, a fridge, a speaker, a television, and various light bulbs and sensors that are also on the internet. Even in the 90s a company may have had a building full of hundreds or thousands of computers, all for people that had their own computers at home. There was just more demand than expected. So, for this reason (among others) NATs became common.

The idea of a NAT is relatively simple. We have one side, the Local Area Network (LAN), and another side, the Wide Area Network (WAN). Think "inside my network" and "the internet". Inside my network I can give computers whatever addresses I want, so long as it's unique to only the computers that are inside my network. There are some strong recommendations on what kinds of addresses to give out, but technically I can do whatever. So if I have a LAN, and you have a LAN, we can both have computers on our LAN addressed "10.0.0.10", and that's fine because the address is "local" to our own networks. So if one computer on my local network wants to talk to another, it works just like I described before: "10.0.0.10" produces a packet for "10.0.0.20" and sends it along its cable. The router sees the packet for "10.0.0.20" and thinks "Ah, that computer is down this wire" and sends it along, and "10.0.0.20" receives it. All is as it was.

But what if "10.0.0.10" wants to talk to a computer on the internet? It addresses the packet the same as before, and sends it to the router the same as before. The router will look at the destination address and see that it's on the WAN side, which means the data has to go out onto the net. But, the internet is really only useful if packets can be responded to. I want to ask the internet for something, and get an answer back! But the router can't just tell the other side that this traffic is from "10.0.0.10", because that's a private "local-only" address. Multiple networks could have a computer with that address and there'd be no way to figure out which was which! So what the NAT does is to rewrite the packet to put the router's own WAN address as the source of the message before forwarding it on. That way, if the other side does respond, then the response will come back to the router, and the router can figure out which request that response was for, rewrite the destination to be the original sender of the packet, and then forward the response into the LAN, rewritten to look like it's for the proper computer. The final piece is that packets aren't just addressed to a computer, but a combination of computer and "port" so the computer can handle multiple independent connections and know which data is associated with which connection. So if a packet goes out from computer "10.0.0.10" and port "537", then the NAT might give it port "537", or "700", or "12432" on the WAN side. It can pick whatever it wants, what's important is that it remembers its choice. Then when a response packet comes back to port 12432 on the public side it can look up in it's table to see that really means 10.0.0.10 at port 537 on the LAN side, so it knows how to rewrite it properly.

So, that's lovely. Now network operators and consumer network equipment can assign whatever addresses they want on the LAN, and none of it reduces the amount of addresses there are out on the net. People can still talk to the internet, and just works the way the computers expect. You can even put one NAT on the LAN side of another, and then put that on the LAN side of another NAT, and have a tree of NATs! Each packet that comes through gets translated by each layer, and then forwarded on to be translated by the next, and when a response packet comes in each layer performs the transformation and forwards "down" the tree until eventually it gets back to the original computer. So what's the problem? Why bring any of this up? It's because this system only works if everyone inside a NAT only ever wants to reach out to things "on the internet", like servers. But if you remember why you're reading this in the first-place, WebRTC is trying to setup direct peer to peer connections between two different internet users without having traffic go out to a server at all! NATs can't work in that way, because the whole point of their translation is that the computer's address is not something routable on the general internet; it's a local address that only exists within its own LAN. There's nothing you can do to tell a NAT, or the top NAT in a tree of NATs, how to send a packet it's received that it didn't already have a translation saved for

The Solution: STUN, TURN, and ICE

That last sentence really is the key to how we're going to overcome this limitation. If I want people to be able to talk to my browser from anywhere on the internet, what I can do is first send some packet out somewhere so my NAT and any other NATs between me and the internet will all make a translation that will allow responses to that packet to find their way back to me. Then, if somehow the other browser I wanted to communicate with could know the last layer of that translation, the "public" address and port that eventually made it onto the real internet, then they could send traffic there and all the NATs would perform their translation and I would eventually get the packet they sent! Magic! The problem, though, is that I don't know what that final "public" address and port are. Most routers don't expose that kind of information to the LAN, and even if they did there's no guarantee that their WAN address isn't just some other NAT's LAN address. Really we don't care about my router, or how many NATs there are, we just want to know what the internet sees.

So what STUN does is define a format for packets that one can use to find out this information. Then people can run STUN servers, either for public use, or maybe specific to the particular app that's looking to communicate. The STUN server runs on the public internet, and when I send it a packet asking it who I am, it will respond with a packet back to me. But inside that packet it will include the address and port it saw my request come from, which allows me to know my own "public" identity. It would be like calling someone to find my own phone number by asking them what number they see. There's more to STUN than that, but that's all we need for now.

This will work on many NATs, but not all, because of course it has to be difficult. Some NATs go further than what I've said so far. Rather than just remembering an association between a source address and port and whichever public port they pick, they remember all of the source address and port, and destination address and port, when picking an output port. That means if I send a packet to 8.8.8.8 the NAT might pick port "3425" to use, but if I send from the same port on my end to 4.4.4.4 instead rather than re-using 3425 it will pick a totally different port. Because of this, if I send a STUN packet out to address 8.8.8.8 and it tells me my public address and port, and I give that to 4.4.4.4, then when it uses that address to try and talk to me it won't match my NAT's table, because there are no entries for 4.4.4.4 at all. No packets were ever sent to that address, so no entry was made. These are called "symmetric NATs" by STUN, and completely ruins our plan.

To get around this TURN was created as an extension to STUN. TURN allows another kind of request to a STUN server which asks the STUN server to allocate a port on the STUN server itself, and forward any packets it receives there back to me. Then the STUN server tells me what its own address is, and what port it picked for me (very much like an opt-in NAT, actually). So now, even if I have a symmetric NAT, I can still give the other person the STUN server's address and my port there, and I know those packets will make their way back to me down the same connection I used to make this request. This isn't ideal. It's not really peer to peer anymore. Packets are still going through another server, which means the TURN server (really the STUN server that supports TURN) will need to be able to handle the volume of traffic, and there's still an extra jump adding latency.

There are a few silver linings, though. The first is that if Romeo needs to use TURN, that doesn't mean Juliet does. So any packets Juliet sends to Romeo may go through another server, but packets from Romeo to Juliet can go direct, which is still half as much traffic to the TURN server. Even if they both need to use a TURN server, there's no need for them to use the same one. That can mean that Romeo is talking to 8.8.8.8 and Juliet is talking to 4.4.4.4, so each of those servers is only seeing half the traffic. Also, TURN servers are able to be more simple than app servers. They don't have to put anything into a database or anything like that, they just turn each packet they receive into one they send, without even knowing what's in it. This allows them to process more packets per second. And finally, it means the app developer doesn't have to implement both a peer-to-peer mode and a "that didn't work" mode. If they run TURN servers, then at the very least it will use that and run the same as if it was truly peer-to-peer.

So we have our actual addresses, which might work if we aren't behind a NAT. We have STUN addresses which will work for many people that are behind one or more NATs. And we have TURN addresses which should work for everyone. But TURN is more effort, so we'd rather use STUN if that works. And even STUN is more work than it needs to be if direct messages work, like if both of our two users are behind the same NAT. So what we need is a way to figure out which of these work and then pick the best one. This is what the standard ICE adds to STUN and TURN, and where we get to the "candidates" I mentioned earlier.

With ICE Romeo would find all the network addresses his device has, and maybe use STUN to find his public addresses and ports, and maybe even TURN to get a new public address and port. The idea is that any of these might work; they are "candidates" Juliet may be able to use to send packets to. Juliet will do the same thing on her end and get her own list of candidates. Then they send these lists to each other, then they would attach each of their own candidates with each of the other side's candidates and get a list of candidate pairs. So if Romeo had "A, B, C" and Juliet had "X, Y, Z", then both sides would make a list of candidate pairs "(A, X), (A, Y), (A, Z), (B, X), (B, Y), (B, Z), (C, X), (C, Y), (C, Z)". Now they go through each pair sending packets from their candidate to the other candidate. If one of them receives a packet then that means this direction works, at least. They respond with a success message, and then they will immediately send their own check for that pair back, if they hadn't already, to see if it works the other direction too. At the end of this process we will have tested all candidate pairs and we'll have a list of the ones that worked in both directions.

Along with the candidates ICE also has us keep track of what kind of candidate each one is, "host" for network addresses the device has directly, "server reflexive" ("srflx") for the ones we got from STUN, "relayed" ("relay") candidates from TURN, and "peer reflexive" ("prflx") which only comes up in cases where one side successfully receives a packet from an address that isn't otherwise a candidate, meaning there's some other network quirk between the two users. Each of these candidate types is given a priority based on our preferences, for example we'd prefer to use host over srflx, and we'd prefer to use srflx over relay. Then we can combine the priorities of each side of a pair to get a priority for each working pair. At this point we can simply pick the best working pair, given our priorities.

There is one last wrinkle. Networks are complicated, and sometimes things go missing. As a last protection against this, one side of ICE (conventionally the person making the call) is nominated as the "controlling" peer. The controlling peer gets final say in which pair is actually nominated to be used. So once we have a pair we like, the controlling peer sends a request again with a "use-candidate" value to tell the other side that this is it, the other side responds back "ok" so everyone knows we're all clear and that communication on this pair still works. At this point we're finally done with ICE and we have a pair of ports that can be used to talk from peer to peer.

Back to WebRTC

Ok, that was pretty in-depth. Let me remind you where we are here: I'm looking in my WebRTC debugging tools and I'm seeing that the connection object isn't getting any data, and the candidate pair chosen to represent that connection isn't getting any either. But there is a candidate pair that is getting data, it's just not the right one!

So my first thought on how this could happen is that maybe there's a disagreement between Movim/Chrome and Asterisk on which of them is the controlling peer! That would also make sense why inbound calls work, because the caller is always the controlling peer. So if Asterisk thought it was the controlling peer in both cases, then Chrome would agree on inbound calls, but disagree on outbound calls. It felt pretty right. There is a section in the ICE standard on how to solve a situation like this, but maybe it wasn't being followed properly. Here's the problem... this is a pretty internal detail of the implementation of ICE in these two pieces of software. I didn't want to take our production Asterisk server down and add a bunch of logging here, maybe even breaking it in the process. And like I mentioned before I wasn't excited to rebuild Chrome just to test this. I spent a little bit of time looking to see if there was some implementation of ICE that was simpler that Chrome that I could use as a stand-in and be more free to make quick changes to, but nothing jumped out that wasn't going to be hard to adapt to my actual use-case. I lamented: I don't even need logs, what I really wanted was some way to see the data they're sending back and forth without modifying anything. Oh... wait a second...

Wireshark

At this point it became clear that I had been working in web and other special areas for too long. I had been searching for so long for a way to inject logging statements into this flow somehow so I could see what was on the network, when I should have immediately reached for Wireshark. It had previously been a tool in my toolbox, but I hadn't touched it since everything became all-web-all-the-time. Wireshark is a program that just records all of the packets your computer sends and receives, and shows them in an interface that makes it easy to filter, search, and inspect. I didn't need the programs to log what they thought they were doing, I could inspect what they actually did and follow along that way! What's even better is that Wireshark already knows about STUN and TURN, so it can show me the different fields without me having to know how to unpack the bits from the packet myself!

screenshot of Wireshark UI

See here how I can search for "stun" and it'll only show me the packets for STUN? Also notice that I can expand the "STUN" attributes, because it knows about them, and see in plain terms "this is a binding response" and my IP, and also the tiny diamond on the left shows the corresponding request that this response is to. Very handy stuff. Much better than logging.

If you remember what I mentioned before, ICE has bits in the standard that try to correct for the situation where both sides think they're the controller. In order to do that, they declare on each request whether they're making it as the "controlling" or "controlled" party, which means it was easy to figure out if each side thought they were the controlling peer. Sure that this was it, I looked; everything looked like it was to spec. Fuck.

Ok, if Asterisk knew it wasn't the controlling peer, why wasn't it using the candidate pair the controlling peer was nominating? Now that I had real data that was being exchanged, I could start going from packets, to standard, to code, and back, to try and trace out how everything was actually working. After tracing around for a while on how I expected the flow to progress I found a real problem!

 {
    char address[PJ_INET6_ADDRSTRLEN];

-   if (component < 1 || !ice->comp[component - 1].valid_check) {
+   if (component < 1 || !ice->comp[component - 1].nominated_check) {
        return;
    }

    ast_sockaddr_parse(cand_address,
-       pj_sockaddr_print(&ice->comp[component - 1].valid_check->rcand->addr, address,
+       pj_sockaddr_print(&ice->comp[component - 1].nominated_check->rcand->addr, address,
            sizeof(address), 0), 0);
    ast_sockaddr_set_port(cand_address,
-       pj_sockaddr_get_port(&ice->comp[component - 1].valid_check->rcand->addr));
+       pj_sockaddr_get_port(&ice->comp[component - 1].nominated_check->rcand->addr));
 }

 /*! \brief Destructor for locally created ICE candidates */

This code was toward the end of the ICE session where we're taking the result of ICE and preparing to return it to the main code. Every time we get a response to one of our checks it's marked as "valid", and that valid_check field is updated when a new check is found to be valid and has a better priority than what's stored there. That way by the end we will have the best priority valid check easily in reach. Also, any time the controller nominates a candidate we do a similar thing and store the result in the nominated_check field. But here, at the end, we're not using the best nominated check, only the best valid check. That will just work so long as the best valid check is nominated, which is expected by the standard, but not technically required and also not what I was seeing. This is great! Finally something that explains what I was seeing. Asterisk sending to a candidate that wasn't nominated.

So I deployed that and confidently ran it. Still didn't work. That was disappointing.

Ok, back to Wireshark looking for other weird things. Paying closer attention to the actual flow I was seeing, I noticed that a request would go out for one pair and I'd get the response properly. But that's it. If we remember the standard, there's supposed to be an immediate request in the opposite direction to test that direction, but now that was looking I only saw responses to my requests, and never the requests originating from Asterisk. This is important because the ICE negotiation isn't fulfilled until the other side makes these requests. From our perspective only one direction works, and it looks like we can't actually talk to Asterisk using the same channel Asterisk can talk to us on. So we keep trying to nominate a pair, but never receive the expected opposite request to confirm to us that this pair is good, so we try again, and again, etc.

This gave me an area in the code to look at, at least. What's worse, this code actually lives outside of Asterisk's codebase, in an external library made just for Asterisk called pjproject. After some investigation and comparing to the standard I noticed this section:

/* 
 * 7.2.1.4.  Triggered Checks
 *
 * Now that we have local and remote candidate, check if we already
 * have this pair in our checklist.
 */
for (i=0; i<ice->clist.count; ++i) {
    pj_ice_sess_check *c = &ice->clist.checks[i];
    if (c->lcand == lcand && c->rcand == rcand)
        break;
}

/* If the pair is already on the check list:
 * - If the state of that pair is Waiting or Frozen, its state is
 *   changed to In-Progress and a check for that pair is performed
 *   immediately.  This is called a triggered check.
 *
 * - If the state of that pair is In-Progress, the agent SHOULD
 *   generate an immediate retransmit of the Binding Request for the
 *   check in progress.  This is to facilitate rapid completion of
 *   ICE when both agents are behind NAT.
 * 
 * - If the state of that pair is Failed or Succeeded, no triggered
 *   check is sent.
 */

link

So this comment specifically references part of the ICE standard, section 7.2.1.4, in deciding when to send these triggered checks back with the same candidates after receiving a check from the other side. The problem is that it's actually wrong, the standard actually says:

If the state of the pair is Failed, it is changed to Waiting and the agent MUST create a new connectivity check for that pair (representing a new STUN Binding request transaction), by enqueueing the pair in the triggered check queue.

So in their implementation if I've already tried and failed to contact you with a candidate pair, and then later I get a request from you with that pair, I should ignore it. The standard, though, says I should instead try again on that pair, since it may have just started working. That's kinda weird though. If there was an occasional failure and it mysteriously didn't work one in every hundred calls, maybe this would be to blame, but this failed consistently. Every time. What's up? Well, a good place to start is how things end up in the failed state. The code had a timeout hard-coded where it'll retry each request 7 times, with one second between attempts, before deciding the pair doesn't work. That makes some sense, but looking at my Wireshark session I noticed something important.

The way Jingle works, when Romeo clicks the "call" button it sends a request to Juliet's device saying "I'd like a call please", and then it starts gathering candidates and sending them over. Juliet's device shows an incoming call screen or something to ask if she'd like to pickup, and when she answers yes she sends her candidates back to Romeo so they can start negotiating. But in this case we're not calling Juliet's device, we're calling Asterisk. The way Asterisk handles this is that it starts ringing on the phone network, and to speed things up in the meantime it starts gathering candidates and builds an ICE session with Romeo's candidates. This means the ICE session has already started before the other person's phone has even started ringing! Then, if the other person accepts the call, Asterisk will send its candidates down to Romeo with the session acceptance, so Romeo can start his ICE session with Asterisk's candidates.

That means that while the phone has been ringing, Asterisk has already been trying Romeo and finding no one's responding (because Romeo hasn't started his ICE session since he hasn't seen any candidates yet). So after 7 seconds Asterisk decides the candidates don't work. Later, when the call has been answered, Romeo starts ICE and starts sending out ICE messages and gets responses, but doesn't get any triggered checks, so he assumes there's something wrong with his responses and that the channel isn't actually working. So he keeps trying to get through and nominate things, but it never works. Ok!

So, assuming that's our problem, there are a few ways to fix it. The first thing we could do is change the ICE triggered checks to be in line with the standard, so it would retry the failed checks when things actually start on the other side. Another way we could fix it would be to change the way ICE works for Jingle in Asterisk and only start the ICE session once we've sent the call acceptance back to the caller. That way both sides will be starting their ICE around the same time, and so they'll likely line up and actually agree on something. The problems were that the first solution involved changing code in pjproject, which was annoying, and the second one was a somewhat involved change to how Asterisk's Jingle integration worked. Instead I opted for a worse, but far easier, solution which was to simply increase the timeout to 45 seconds. This dodged around the problem by assuming that Asterisk wouldn't have considered that candidate failed by the time the person actually answers the call. That way it'll still send the triggered check, and all will be good with the world.

So I rolled that out, and it actually worked! We plan to return to this issue and build a real solution, but for now it allowed us to keep testing.

Mission Accomplished

Mission Accomplished banner on US Navy Ship

So, I now had Movim successfully and consistently making calls out to Asterisk, and thus to real humans' phones. I told my coworkers I had done it, I had found the problem and fixed it, and all was now well. So we tested it with Conversations, the Android client that we expect many of our users to use. Nope. Just as broken as ever. Ok, maybe I was a little hasty... What about Gajim, another desktop client? Busted. Ok, what about Movim on Firefox, where I started before switching to Chromium just for its dev tools? Totally broken.

Ok, so... maybe there was still a ways to go... Don't worry, though, dear reader. The next fixes won't take as long to explain.

Gajim

I started with Gajim. It's not more important than the others, but when I was testing I noticed Gajim actually printing a useful looking error message in its error log. That's a very nice place to start! It also wasn't mad about the ICE stuff at all, but before that. When the clients are back at the beginning of the setup they negotiate what kind of data is going to go over this call. Is it video, or audio, or file transfer? If it's audio, what format of audio is it, is it stereo, what levels of quality does each side support? These kinds of things, where the two clients are trying to come to a consensus on how we're actually going to go about transmitting audio data, once ICE figures out the connection itself. In Jingle the way this works is that during session initiation you can specify the kind of content you want in this session, or if you want to add content to an existing session you can add new content to a session. From then on we can talk about that content by its name and the person who created, "initiator" for the person making the call, and "responder" for the person who has been called. That's mostly there to prevent a case where Romeo starts a session with Juliet and then they both propose a new audio stream at the same time, and then each side thinks they're negotiating about their own proposal, when in reality there are two proposals. With the creator it becomes clear, they both are talking about their own audio content, and further negotiation is required.

That being said, the code in Asterisk seemed to feel it was always the creator of the audio content. The code was written to send a creator of "initiator" for an inbound call, and "responder" for an outbound call. For inbound calls, Gajim agreed; Asterisk would propose a new session with audio content, and so they were the creator. For an outbound call, though, Gajim would do the same and propose a new session along with audio content to go along, but Asterisk would respond back about an audio stream that Asterisk itself had created. But it hadn't created one, Gajim did. So I made it so Asterisk just always used the "initiator", which seems to match how clients actually established sessions. Incidentally, the reason I didn't notice this before is because Movim doesn't care about the creator field and also just assumes "initiator".

So after that change it now worked in Gajim!

Firefox and Logging

Movim on Firefox was harder to debug. When I was looking at Wireshark, I just saw... nothing. There was occasional single things that would go out, but basically it looked like ICE wasn't doing anything. That made it hard to debug...

By now, though, I'd found where ICE actually lives in the code, specifically over in pjproject and not the main Asterisk code. I'd gained some experience reading that code and working over there, and while I was reading that code I noticed some parts of the code did actually already have logging code. If I could just figure out how to turn it on, it may tell me more about what the code was thinking. I'm embarrassed to say it took me a good while to figure out how to get those logs turned on.

I eventually found some forum post somewhere outlining the simple steps I needed to use to enable logging of the data I wanted:

# First we get to the asterisk command shell
$ sudo asterisk -r
# Then we add a new "logging channel" that will include debug logs
> logger add channel some_filename notice,warning,error,debug
# Then we set core (that is, Asterisk) to log up to debug logs
> core set debug 5
# Then we set pjproject to also log debug logs
> pjproject set log level 5
# And then tell pjsip (not sure why it's not pjproject) to actually log
> pjsip set logger on

Not sure why I didn't just guess all of that...

But anyway, now I could run my tests and it would log out to /var/log/asterisk/some_filename! I will admit, it is nice that I wasn't filling the normal log files with junk and could actually see only my test, rather than wandering through days of logs looking for my portion.

When I was done, I could do the reverse (also from the asterisk command shell):

> pjsip set logger off
> pjproject set log level default
> core set debug off
> logger remove channel some_filename

This would stop putting new logs in my some_filename file, but wouldn't delete it. This is also convenient because I could now search through this file without it getting infinitely longer, or filling up with logs I wasn't interested in.

That being said, even the short file for a test that takes a minute can have thousands of log lines, so it still takes some sifting to find the actual information I'm looking for.

I noticed a few important things looking through the logs. The first, and most obvious, thing is that it builds an ICE session many many times. It'll build one, tear it down, build another one, then tear that down, within a second. This made it hard to follow the history of a single session, but was also very obviously something that might be a problem. The second issue I noticed is that all of the sessions got to a point where they said "Resetting ICE for RTP instance", but some of them said "Nevermind. ICE isn't ready for a reset" afterwards, and then things didn't seem to work after that. All the broken ones had "comp_id=2", which meant they were for the second component. Comparing the same logs with Movim on Chrome, there was no second component. Huh.

So what is a component? ICE has a section in the spec for negotiating multiple independent ports in a way where either they both work or the whole negotiation fails, which could be used by applications which need multiple ports to work in coordination for anything to work. The protocol that WebRTC offers for audio is called RTP (Realtime Transport Protocol), which has two modes of operation. Originally RTP had two connections, one where it would send the audio data, and another called RTCP (RTP Control Protocol) where it would send information about how well the audio was sending so the participants could adjust their quality or something. A later version of RTP added an optional feature called rtcp-mux, which allowed the sending of the RTCP information along the same connection as the audio so we only need one connection, and so only one ICE component. Well, when WebRTC was standardized it was decided that WebRTC required the RTP implementation to support rtcp-mux in order for it to be allowed as part of WebRTC. So in Chromium they take advantage of that and just assume it supports rtcp-mux and only start ICE for one component. Firefox, though, felt it was important to be more backwards-compatible and tries to support both rtcp-mux and traditional RTP+RTCP modes. There's a way to tell if the other server supports rtcp-mux, but that information is sent when the other side answers the call, and by then Firefox has already sent all of the candidates for both components.

Ok, so that's why Firefox acts differently than Chrome, but why is it a problem? Surely it should be fine to negotiate two components and just ignore one, and ICE should work fine in either case. Well, that comes down to the constant building and rebuilding. The way original ICE works, the full set of candidates is gathered by both sides and then exchanged, they're all processed, and then a winner is picked. Jingle, though, made a change where it would send each candidate as it discovered them. That way the ICE session can start sooner and can be looking for candidate pairs while the STUN and TURN stuff is going on, and also if a candidate is found right away it might be decided it's not even worth it to get a TURN candidate since it'll be lower priority than the valid pair we've already found. That method of operating eventually got its own draft standard under the name Trickle ICE, which is similar but slightly different, which means there's a new version of Jingle that's meant to bring Jingle up to date on new versions of ICE including Trickle ICE. It's a draft, though. All that is to say, things are kind of a mess, and the version of ICE that Asterisk supports right now is not the Trickle kind. That's a feature that's being worked on for the future, but it's not released or supported by the XMPP integration at the time of this writing.

So, if Jingle uses Trickle ICE, but Asterisk doesn't support that, how does the XMPP integration in Asterisk work? Well, every time it sees a new candidate trickle in, it just restarts the ICE session as though that candidate is the only one. It's not ideal, but it appears to work a surprising amount of the time! That's likely owing to ICE's flexibility when it gets a request from a pair it doesn't know about, how it just adds it to the list. As long as one side knows all the candidates, the other side should respond properly and come to a consensus. Weird, but fine. Maybe in a newer version of Asterisk that will get better. But in this case Firefox has two different components, and it sends the "component 2" candidates after the "component 1" candidates. So Asterisk will get one candidate and will setup an ICE session, and then immediately gets the component 2 candidate so it tears the first one down and sets up the new one. But this new one only has candidates for component 2. That's something ICE doesn't allow, so Asterisk's ICE session isn't in a "nominating" mood, so it just doesn't do anything. That's why Firefox doesn't appear to be able to negotiate a connection.

So now that we know the problem, how to fix it? Well, we could implement a sketchy version of Trickle inside Asterisk, or even just inside the XMPP integration code for Asterisk. That looked like it was likely to introduce new bugs, and I knew that someone was already working on a real version of Trickle that may just work later. And even though it's nearly broken, it also works in Chromium every time I tested it. I really just wanted Firefox to work the same way Chromium does, so that's what I built. I just put some code in the XMPP integration part of Asterisk that ignores candidates for component 2. I know they're not going to end up using it either way, because all of our clients support rtcp-mux, and this is basically one line of code that is highly unlikely to introduce any new bugs!

Now Movim on Firefox works too! Three down, and one to go!

Conversations

This was it, our most supported client, and if it didn't work here we couldn't really call it a feature. And it still isn't working after all those other fixes. It's hard to use Wireshark because the app is running on my phone, but I now have the power of Asterisk logs at my disposal! Using that I took a look and... everything looked pretty good. Candidates were being exchanged, and negotiated, and being chosen as valid and then... just sitting there. None of the valid options were ever being nominated, so both sides were just waiting until someone gets bored of waiting and ends the call. I decided that this was really interesting, but I might need Wireshark to see more. Great.

There's probably some way I could have networked my phone through my computer or something to use Wireshark, but if I found that I needed more logging or something it would probably be useful if I could build the code for the Android app. And if I can build the code, then I can just run the Android Emulator which lets me run the app on my computer in a fake phone environment. So I pulled down the code for Conversations and got it setup and working in Android Studio. Now it was as easy as clicking the "play" button and running Wireshark to watch the packets the virtual-phone was sending. Since I already knew what to look for from the Asterisk logs, it was pretty easy to look at the stream of packets and see that neither side was including the "use-candidate" attribute which, if you remember, is how the controlling side tells the controlled side which candidates we're going to go with. That would explain why inbound calls work with Conversations, because for inbound calls Asterisk is the controlling side, and includes that "use-candidate" value so we're all on the same page, but when the roles are reversed and Conversations is the controlling side it never nominated anything. That's weird. Unlike with the Firefox side, it's not like it was doing nothing; it was definitely making requests to find a list of valid candidates. And unlike the original Chromium problem triggered checks were coming back just fine, and both sides appeared to know what the valid list is. After combing through packets, I did manage to find one weird thing. Wireshark knew about the attributes in STUN and could pull those values out of the packet and show them to me, but some of the packets around the time I might expect nomination to start had some extra attributes. Values Wireshark didn't recognize and listed only by their type, "0xC001". It's possible it was nothing, but it was something the working implementations didn't do, so it was a thread I had to pull on.

I pulled down the code for libwebrtc, the WebRTC implementation Conversations used, and searched through the code for that value and saw it was associated with STUN_ATTR_NOMINATION. Continuing to trace through the code for where this value was used, and which sets of conditions lead to that code, and so on, I eventually found that there is an option libwebrtc supports called "renomination". I manage to find what could be called a standard only by the most generous definitions, ICE Renomination: Dynamically selecting ICE candidate pairs. This document doesn't actually mention which STUN attributes to use to do the nomination, saying only "we define a new STUN attribute", but the code I was looking at seemed to line up with the intent of the document, at least.

The intention of this standard is to make it easier to control which pairs get nominated by ICE. In base ICE we test to find valid pairs, then some are nominated, and then the nominated pair with the best priority is chosen. Renomination is trying to make it easier for the controlling side to change its mind, for example if a WiFi device goes out of range during negotiation, making the cell network candidates better choices, or something. I'll be honest, I'm not really sure how often this would really apply, because ICE doesn't run for very long. Either way, though, the way this proposed extension handles this "mind changing" is by replacing the "use-candidate" attribute, which is either present or absent, with a different "nomination" attribute that contains a number. This number is more important than the priority, so the controlling side can nominate one pair with value 2, and then later nominate another pair with value 3 and that will now be the best choice no matter what priorities say. So, that's what the "unknown" attribute I saw in Wireshark, and also why no candidates were ever chosen. Conversations was sending one attribute to nominate, and Asterisk was waiting for a different one.

But why? There's a part of the spec that says that renomination is only turned on if both sides include the renomination option in their initial candidate exchange. Well, Asterisk definitely doesn't support it, so what's up? Looking at the Conversations code it looks like the issue is that there's no room in the Jingle standard for exchanging whether or not someone supports an option like this. There just isn't a value for that. So in order to allow it, Conversations just assumed all call partners supported it and included that value in both sides of every session it setup. It also seems like libwebrtc isn't very picky about this. This means if a user running Movim on Chromium is being called by Conversations, Conversations will think both sides supports renomination and Movim will think neither side does. But despite this, when Conversations sends the special attribute instead of the "use-candidate" attribute, Chromium still understands it and doesn't even check if it should expect that kind of value from this session. It just knows what was meant. So this is why Conversations works with Chromium, but any standard implementation of WebRTC that didn't support the draft extension of Renomination would not work. Like Asterisk.

To fix this I did two things. The first thing I did was talk to the Conversations developer about how we can be more careful about when we include this option, and was told we could just never include it. They just didn't have strong opinions on it. So I made a patch to remove that option. That was the easiest fix I've had to make yet. The problem is that unlike the changes I'd been making to our Asterisk server, this code ran on our users' phones. We couldn't control when it would be released, and we couldn't control when our users would install the new version even after release. Since then, that change went out in version 2.9.8, but we didn't know that at the time, and wanted this to work as soon as possible. So, like the Firefox fix, I considered building an implementation of renomination in Asterisk, but after looking at my Wireshark it didn't seem like it was necessary.

Since it was just a stopgap measure until our users migrated to newer versions of Conversations I did the simplest thing that would work, which is to just treat the "renomination" attribute (0xC001) exactly the same as the "use-candidate" attribute; ignoring the number, and the purpose of the extension, entirely. This would break if Conversations ever actually tried to renominate a candidate to a new pair with lower priority, but how likely is that? Probably not very likely. And if they only ever nominated once, or tried to renominate to a higher-priority candidate, then it would just work! The only quirk here is the one I mentioned earlier, which is that ICE is actually implemented in a library to Asterisk called pjproject. I really didn't want to fork that library and cut a new release just for myself with my garbage change in it, solely so I could put it in our Asterisk implementation. Luckily this apparently isn't the first time Asterisk has needed to make some tweaks to pjproject, so there was already a system for that! If I made a git commit in the pjproject repository, and then put the diff from that commit into a folder in a magic location in the Asterisk repo then during the Asterisk build it will make that change to the library for me before building! That meant I could still use an official release of pjproject, and could keep all of my ugly changes together in the Asterisk codebase.

I tested both changes, and they both worked. If the client didn't specify "renominate" as an option, then it spoke standard ICE and would work without my change. And if I left the client as-is it would work with my altered server that pretended to know what renomination was. The phone calls went through, and Conversations worked! All the clients worked!

Somehow Wrapping Up

Well, if you're at the end of this, thanks for following my on my odyssey. We began at the banks of "this is 90% working, there's probably just one weird typo to fix to get it the rest of the way". From there it was down the river of standards and the rapids of interoperability of standards between browsers and old C projects, aided all along by my compass Wireshark. I was confident when the rapids calmed down and the one problem was fixed, but there was a world of trouble up ahead as I didn't see the waterfall coming. My raft came out the other side, a little battered but with things holding together. Calls were being made, and metaphors were being thoroughly stretched. Seriously, why would I need a compass to raft down a river? IT ONLY GOES ONE WAY.

All in all, none of these fixes were "the proper fix", but with them I got the system working. I plan to engage the Asterisk community to open discussions about the actual issues, rather than just producing a patch no one wanted, but in the meantime calls are being made and received. And I got to stretch my legs and learn a bunch of inner workings, which is my interest. I even got a blog post out of it...

Sources

Mission Accomplished banner by U.S. Navy photo by Photographer's Mate 3rd Class Juan E. Diaz. (RELEASED) - Source, Public Domain