UCMA patches are important, even if the descriptions are vague

This is sort of a follow up to a post from earlier this month on impersonation, and how you can get UCMA to insert the referred-by header into an INVITE.  I was working on an application that used this technique, and everything was working great on the dev system.  When we moved to final testing on the production server though, things started failing again.  Long story short, running OCSLogger on both the dev system and the prod system revealed two different INVITE contents: one had the referred-by header, and one didn’t (Note-Running the Lync logging tool on the client is a great way to get a SIP trace that isn’t as noisy if you just care about one endpoint).

Now, the way this manifested itself was strange-the Lync server trace showed the invite being routed to the edge server (?), and then failing with the error “: 1017;reason=”Cannot route From and To domains in this combination”;source=”sipfed.computer-talk.com””.  This made very little sense, and examining the outbound routing traces confirmed that the referred-by information was missing.  The question is-why wasn’t this being added?  The code specifically set a bunch of headers before creating the INVITE, and they were all there (not that it matters much, but those were Replaces and MS-Sensitivity).  So what happened? 

Well, I don’t actually have as good an answer as I’d like to this question, but I did notice that there were pending updates on the system.  It turns out that it was installed with the UCMA bits from the Lync ISO, and hadn’t had any patches installed on Lync (just on Windows).  The Lync environment was up to date, so I decided to let it install all the pending patches, including the June 2012 UCMA3 update.  System came back up, tried the call again, and everything was fine.  Checked the client trace, and there’s the Referred-by header. 

Of course, now the question is, what did that update do?  Nothing in the patch notes for any of the UCMA updates mentioned the Referred-by header or Transferor property specifically, but there was one update that referenced joining conferences using a PSTN endpoint.  This behaviour could be caused by UCMA neglecting to insert the header, so perhaps this fixed my problem as well?  I suppose the only way to know for sure would be to ask someone on the product team, or to try removing/applying updates one at a time until I can isolate the exact change that made this work, but for now I’m just glad that things are doing what they’re supposed to.  If anyone has any insights into what’s going on here though, I’d love to hear them, so leave a comment or send me an email.   For now though, the lesson of the day is to make sure that your servers are all at the same patch level-sometimes it can make a big difference. 

Posted in Uncategorized | Leave a comment

Routing calls from Lync to a UCMA app via the PSTN-avoiding the 485:Ambiguous error

This case may be a little esoteric, but I ran into it this morning, and it caused me some grief.  I have a Lync deployment that uses extensions instead of DIDs, which means that every user has a line URI that looks like “tel:+19058825000;ext=3900”.  This includes UCMA applications, which other users (or the auto attendant) can dial internally by extension.  Now, I wanted to give this application its own DID without breaking the extension dialling, which seemed like a simple matter-the gateway can forward the DID to the mediation server, and the mediation server can normalize the DID to the line URI.  I set this up, placed a call from a cell phone, and everything was great.  Then I tried calling the same number from Lync…

For those keeping score, I now have a Lync client placing a call out through the mediation server, through a gateway, and out to the PSTN.  This call then turns right around, comes back into the same gateway, through the same mediation server, and in theory, gets to the application.  Instead, I get a carrier “number is not in service” message, and a message in the Lync trace that seemed very strange:

TL_INFO(TF_PROTOCOL) [0]0AE8.139C::07/18/2012-16:56:42.578.00f4d3e7 (SIPStack,SIPAdminLog::TraceProtocolRecord:SIPAdminLog.cpp(125))$$begin_record
Trace-Correlation-Id: 669833036
Instance-Id: 00075679
Direction: outgoing;source=”local”
Peer: cttrhlync01.corp.computer-talk.com:63943
Message-Type: response
Start-Line: SIP/2.0 485 Ambiguous
From: “Chris Bardon”<sip:19058825000;phone-context=PstnGateway_192.168.213.21@computer-talk.com;user=phone>;epid=7CF3A158A2;tag=d8526b2edf
To: <sip:2590;phone-context=PstnGateway_192.168.213.21@computer-talk.com;user=phone>;tag=A7386EC1880CB52617AC440F34BCA9DD
CSeq: 3186 INVITE
Call-ID: fa9db663-2f5d-4cc2-a53f-a895c2411367
ms-application-via: cttrhlync02.corp.computer-talk.com_LYNC2010RC;ms-server=cttrhlync01.corp.computer-talk.com;ms-pool=cttrhlync01.corp.computer-talk.com;ms-application=51FB453D-5B9F-45df-83B4-ADD1F7E604A8
Via: SIP/2.0/TLS 192.168.200.230:63943;branch=z9hG4bKcdc9dfa6;ms-received-port=63943;ms-received-cid=1A2700
ms-diagnostics: 4002;reason=”Multiple users associated with the source phone number”;HRESULT=”0x8004C3CC”;source=”cttrhlync01.corp.computer-talk.com”
Server: RTC/4.0
Content-Length: 0
Message-Body: –
$$end_record

The key in there was the ms-diagnostics header saying that there were multiple users associated with the source phone, which I suppose is true if you ignore the extensions on the Line URIs.  The real question here though, is why should the mediation server care about the source?  The destination is valid, the call should go through, right?  I looked for a definitive answer on this, but I haven’t managed to find one yet, so if anyone knows please leave a comment.  My theory though is that the mediation server sees an incoming call that it can match to an outbound call, and thinks it can connect them internally without the PSTN leg.  Because the extension isn’t being sent with the PSTN caller ID though, the MS can’t determine which user made this call in the first place, so it just gives up on the call rather than send it through anyway.  Whatever the reason behind this is though, there is a workaround-modify the caller ID.

There are a few places where you could modify the caller ID for this case, but I chose to make it as specific as possible.  I added a route in the Lync control panel that matched ^\+190588225000$ – the normalized version of my DID.  Then I checked the “Suppress Caller ID” checkbox, and put a random phone number in there (actually, I used the company fax number).  Commit the change, wait a few minutes (since routing changes aren’t necessarily instantaneous), and try the call again.  This time it’ll succeed, although the caller ID that your app gets will be the value in the route, and not the actual Lync user’s caller ID.  Also worth noting-this has no effect on actual calls from the PSTN-this is just for calls to the PSTN DID from Lync. 

Of course, this case is only really relevant in a few cases, but it means that internal users can reach your application through the DID that you publish to the outside world the same way as external users can, which is good for consistency.  It also would have worked had the line URI of the application been the same as the DID, but this would have broken being able to dial the app by extension elsewhere, like from the front end application I have to transfer to other apps, allowing them to share a single DID. 

Posted in Uncategorized | Leave a comment

Identity Manipulation in UCMA-getting Impersonation to work through a mediation server

There’s a UCMA method on the Microsoft.RTC.Collaboration.Conversation class called Impersonate that lets you place an outbound call as if you were a different user.  You provide the method with a SIP URI, phone URI, and Display Name, and usually it works as expected, but there are some cases where things can fall apart pretty spectacularly if you’re not careful. 

First, let’s look at the case where things work as expected.  Say I want to create a UCMA app that accepts a call from user A (sip:cb@rnddev.computer-talk.com), does something, and then makes an outbound call to user B (sip:cbardon@rnddev.computer-talk.com).  Normally, I’d get something like this when I placed the outbound call:

image

Which is the identity of my application.  When I create my outbound call though, I could do something like this with the conversation:

conversation.Impersonate(“sip:cb@rnddev.computer-talk.com”,null,”cb”);

And then when I establish my call, the inbound call to User B would appear to be from User A, and not your app

image

If all the calls are between Lync endpoints, then everything is fine here.  What if you involve a mediation server or the PSTN though?  First, consider the case where you get an inbound call from the PSTN to your app, and then try to impersonate that caller like this:

conversation.Impersonate(“sip:+19058825000;ext=3154@rnddev.computer-talk.com;user=phone”,null,”Chris Bardon”);

You’d get a result like this:

image

Note that the URI parameter MUST be specified, but I left the phoneUri parameter blank.  This is because the uri is a phone URI already, and trying to specify the phoneUri again in this case would throw an exception.  In any case, an inbound call from the PSTN impersonates correctly.  Now, what about the case where I want to initiate a call from a Lync user out to the PSTN like this:

conversation.Impersonate(“sip:cb@rnddev.computer-talk.com;user=phone”,”tel:+19058140048”,”Chris Bardon”);

This case works just fine too (although it’s more difficult to get a screenshot of-you’ll have to trust me on this one). 

So for the last case-what happens if both the inbound and outbound call are through a mediation server?  As it turns out, this case fails rather spectacularly.  If you try something like this:

conversation.Impersonate(“sip:+19058825000;ext=3154@rnddev.computer-talk.com;user=phone”,null,”Chris Bardon”);

And place a call out to a PSTN endpoint, your app will get a 404 that looks like this:

TL_INFO(TF_PROTOCOL) [0]1F00.1B54::07/03/2012-18:58:29.425.019eda5f (SIPStack,SIPAdminLog::TraceProtocolRecord:SIPAdminLog.cpp(125))$$begin_record
Trace-Correlation-Id: 834215393
Instance-Id: 00024EB3
Direction: incoming
Peer: chrislaptop.corp.computer-talk.com:50061
Message-Type: response
Start-Line: SIP/2.0 404 Not Found
From: <sip:chrisice70_1@rnddev.computer-talk.com;gruu;opaque=app:conf:audio-video:id:1CBIIRTU>;tag=1e2d289a90;epid=72D4E27D44
To: “chrisice70_1″<sip:chrislaptop.corp.computer-talk.com@rnddev.computer-talk.com;gruu;opaque=srvr:chrisice70:_m411PC3cFaS4dn7x0olnAAA>;tag=f4a3e8bc9a;epid=E73F01CD8F
CSeq: 2034 INVITE
Call-ID: 722b999c-9af3-4fca-8e66-c9968d31dcd2
VIA: SIP/2.0/TLS 192.168.201.74:57523;branch=z9hG4bKC9BC6AB4.6E7020D901A5DDAF;branched=FALSE,SIP/2.0/TLS 192.168.201.74:58100;branch=z9hG4bK4f5e3f5;ms-received-port=58100;ms-received-cid=437600
CONTENT-LENGTH: 0
PRIORITY: Normal
SUPPORTED: Replaces
P-ASSERTED-IDENTITY: <sip:15de4e29-d190-491b-920a-f46c79f087ec@rnddev.computer-talk.com>
SERVER: RTCC/4.0.0.0 chrisice70
ms-diagnostics: 1003;reason=”User does not exist”;TargetUri=”+4165752695@rnddev.computer-talk.com”;source=”LYNC2010.rnddev.computer-talk.com”
Ms-Conversation-ID: eb54476dad1e4eada19648d1ba329373
Message-Body: –
$$end_record

The key here is the “User does not exist” flag in the ms-diagnostics, although it should be fairly obvious that the user doesn’t exist on Lync, because it’s not a user, it’s a phone number.  Digging into the logs a little more, you could look at the outbound routing log and see something like this (cleaned up for readability):

Creating a OutboundRoutingTransaction object (57)
Enter
From uri: sip:+19058825000;ext=3154@rnddev.computer-talk.com
From User Uc Enabled: False
Referrer URI: <null>
IsAvMCUDialOut: False
Alternate Tel URI: <null>
IsEmergencyCall = False
Checking for Vacant Number range. Request URI = +4165752695
Checking for Vacant number entries for +4165752695
No matching range found.
No matching Vacant Number Range found.
+4165752695 does not match any Vacant Number range
Checking for CPS range. Request URI = +4165752695
Checking for CPS entry for +4165752695
Input prefix [+] does not match that of range [ ]
No matching range found.
No matching range found
+4165752695 does not match any CPS range
Applying From URI’s outbound policy
Routing request based on caller: sip:+19058825000;ext=3154@rnddev.computer-talk.com
Caller not UC enabled.
Stamping request from non UC enabled user and sending request on its way…
Exit

Basically, this is the outbound routing engine deciding how to route the call to +4165752695.  It determines that the from URI isn’t a UC enabled user, decides that the call doesn’t fall under the unassigned number range, and that it’s not a call park orbit.  It then decides to route the call based on the from user’s policy, which, since the from user isn’t a UC enabled user, is nothing. 

This actually falls under a case that’s documented on NextHop with respect to meeting joins, and the solution that they mention, creating static routes, might very well work.  I wasn’t able to get it working that way, and I managed to find a better alternative.  I did like Mike Stacy’s Post on creating static routes though, but in this case, circumventing all the normalization and routing rules seemed wrong, and like something that’d be difficult to get customers for this app to do.

The key to the solution was the “IsAvMCUDialOut” line-if an MCU dial out worked, then there must be a way to route like this, right?  As it turns out, an AVMCU call not only sets this flag, but also the “ReferrerURI” flag, which translates into the Referred-by header.  Simple enough then, I just added this to my outbound call establish:

outboundCallLegSettings.CallEstablishOptions.Headers.Add(new SignalingHeader(“REFERRED-BY”, “<sip:cb@rnddev.computer-talk.com>”));

Which also proceeded to fail.  As it turns out, Lync expects the referred-by header to be signed, so that it looks something like this:

REFERRED-BY: <sip:cbardon@rnddev.computer-talk.com>;ms-identity=”MIIBxQYJKoZIhvcNAQcCoIIBtjCCAbICAQExCzAJBgUrDgMCGgUAMAs GCSqGSIb3DQEHATGCAZEwggGNAgEBMGowXDETMBEGCgmSJomT8ixkARkW A2NvbTEdMBsGCgmSJomT8ixkARkWDWNvbXB1dGVyLXRhbGsxFjAUBgoJkiaJk/I sZAEZFgZybmRkZXYxDjAMBgNVBAMTBXJuZENBAgodMhniAAAAAAC0MAkGBSs OAwIaBQAwDQYJKoZIhvcNAQEBBQAEggEAKu4exuDavlYfjaJVJqu43mAZ+IK5My XKMkXAZoVm9dT8SS5col2meENQtdVhWRvcb7jGhbLAjpggPpvfQ2CD6CbkScwj5H5 jgcxX9tA4iIT4a3QiMSZA/A7tfVPJ9ipexBro/18eHEur2gpxY82QhNFXgcAzeofFTP+QB REIopLqWqgFlVccZcoUP6sC02L4qbxb1gzouyO+2lEUf+IVCdATMlBleWxQht7l6Dgc9Y 0xEutsWPg8a9ym5Q19rWb14OHUVxmCgFHJ6y56R3/aSUceQ844j/fhCRDctTa4zs6L/ GN5+H7z60vLUuME1utyiJOpXaZ1PB38mkig6KyCIA==:Tue, 03 Jul 2012 18:58:23 GMT”;ms-identity-info=”sip:LYNC2010.rnddev.computer-talk.com:5063;transport=Tls”;ms-identity-alg=rsa-sha1

Instead of this (which is what my message looked like):

REFERRED-BY: <sip:cbardon@rnddev.computer-talk.com>;

Now, there’s no documentation on how this hash is generated that I could find, so I tried some other solutions:

  • Setting P-Asserted-Identity directly?  Nope, Lync won’t let you modify a header after it’s been set.
  • Modifying P-Asserted-Identity in an MSPL script (actually an MSPL managed code app)?  This would work, but took a lot of work to set up.
  • Setting P-Preferred-Identity?  Had no effect…
  • Setting P-Session-On-Behalf-Of (which I stole from a delegate invite)?  Nope, no good.
  • Setting the Remote-Party-ID header?  Still nothing…
  • Mashing in Referred-By manually with a captured signature?  Strangely enough, this worked, but was in no way a manageable solution

Finally, I was looking through the CallEstablishOptions class again, and came across a property called Transferor.  I set it like this:

outboundCallLegSettings.CallEstablishOptions.Transferor = “sip:chrisice70_1@rnddev.computer-talk.com”

And the PSTN call went through with the correct caller ID.  I tried a Lync endpoint again and got a toast window that looked like this:

image

Which had the correct information on it as well.  Now, if only this had been a little more obvious in the documentation (say, when I searched the UCMA docs for Referred-By)…

So, to summarize, if you want to impersonate a mediation server endpoint back to a mediation server, you not only have to impersonate the caller, but you have to set the Transferor property on the CallEstablishOptions to make sure that the call is routed.  I’ve chosen to set this property in all cases, since it simplifies the code, but if you wanted to only set it in certain cases you would certainly have the option.  Keep in mind though, telling how a call is going to be routed from within your UCMA application can be tricky, and is subject to change based on the server configuration.  Also, using the Lync Server Logging Tool and Snooper is a great way to trace what’s going on with the server.  I never would have figured this out had I not compared my failed trace to a successful conference dial out trace and noticed the referrer information. 

Has anyone else run into the same problem, or encountered a case where this doesn’t solve their problem?  Leave a comment or drop me an email if you have.

Posted in Uncategorized | 3 Comments

Lync, UCMA, and DNS load balancing part 1

One of the features that Lync 2010 introduced was DNS load balancing.  In this scheme, requests for an FQDN can return multiple entries, and Lync will send requests to one of the pool endpoints.  There’s some decent information on technet about the basic idea with Lync servers, but there’s actually very little out there on how to use this scheme with UCMA applications.  In theory, it’s supposed to work the same as for servers, but it’s not necessarily evident how to get there.

Provisioning your app

A lot of times, we’ll run New-CSTrustedApplicationPool and pass the machine FQDN in as the pool FQDN, usually because the app is only going to run on one server at a time.  In this case though, we’ll actually have a pool FQDN and a separate ComputerFQDN like this:

PS C:\Users\cbardon> New-CsTrustedApplicationPool -Identity ucmapool.rnddev.comp
uter-talk.com -Registrar lync2010.rnddev.computer-talk.com -Site 1 -ComputerFqdn
ucma1.rnddev.computer-talk.com -RequiresReplication $false

This creates a new pool FQDN, as well as adds the first machine to it.  Next, you’ll need to add the rest of the machines to the pool like this:

PS C:\Users\cbardon> New-CsTrustedApplicationComputer -Pool ucmapool.rnddev.comp
uter-talk.com -Identity ucma2.rnddev.computer-talk.com

Which you repeat for each of the servers that you want in the pool.  Then, configure applications and endpoints the same as with a single computer pool-those still work exactly the same way as before. 

A catch for manual provisioning

Now, if you’re using automatic provisioning, then you should be able to skip on to the next section.  If you noticed the –RequiresReplication $false flag in the pool configuration though, you’d realize that this example uses manual provisioning, which is useful for cases where your app server can’t be joined to the Lync domain.  This means specifying some extra parameters when creating your platform though, including the GRUU.  When you created your application, you may have noticed that the output looked something like this:

image

Note that I have a service GRUU, as well as ComputerGRUUs for each machine in the pool.  For a single computer pool these are the same, but now each individual machine in the pool has it’s own GRUU as well as the one for the service as a whole.  When creating the platform, use the computer GRUU on each app server.  You’ll also want to use the Pool FQDN as the application FQDN.

Certificates

The next deviation in the procedure comes when you request a certificate for your application servers.  Normally, the subject name of the cert needs to be the FQDN of the application server, but in this case, the subject needs to be the pool FQDN.  The certificate you request should be the same on all application servers (so mark the keys as exportable), and should contain the pool FQDN and individual machine FQDNs as Subject Alt Name entries.  This creates a bit of a maintenance headache for adding new capacity, but it’s reasonably easy to request new certs if you control the CA.  What you’ll end up with is something that looks like this in the local machine store:

image

and the SANs:

image

If you’re using the web enrolment tools to request certificates from a windows CA, you can specify the SANs by putting something like this:

SAN:dns=ice7testerpool.rnddev.computer-talk.com.com&dns=ice7tester.rnddev.computer-talk.com&dns=ice7tester2.rnddev.computer-talk.com

in the Attributes field.  I always forget the syntax of this one…

Fun with DNS

At this point, the configuration for Lync and your app is done, so all that remains is the DNS configuration.  Normally, this involves an A record for the app server FQDN, but with load balancing there are a few other things that need to change.  Note-I’m writing this using a windows server 2008 R2 DNS server, so the settings may be different if you have a different DNS.  Basically, we need two things in DNS: an entry for each server, and an entry for the pool that resolves to each server’s IP address.  In my example, I have this in DNS for my pool machines:

image

Now, by the end of this, you want to be able to go to any machine in your network and do this:

image

Or this:

image

Note that the ping command went to different servers each time, and that the nslookup command returned both entries in different orders.  This is important-this means that DNS is working the way it’s supposed to.  Unfortunately, the defaults in DNS might cause it to not work this way, so here’s what you may need to change:

DNS Server properties

Right click on the DNS server and bring up the advanced property page:

image

You’ll want to make sure that Round Robin is enabled, and that netmask ordering is disabled.  Actually, disabling netmask ordering isn’t essential, but it’s a good idea if you want a “real” load balancing scenario.  Basically, netmask ordering is an optimization that says that if you’re in a subnet (say 201.X) and get two entries for a DNS query in different subnets (e.g. 201.1 and 202.1), that you should bias towards the closer result.  The result of this is that an nslookup query will return the entries in the same order every time. 

Time to Live

In most cases, DNS caching isn’t a problem.  The address for a service rarely changes, so as an efficiency, Windows remembers the DNS results for particular lookups.  Most DNS servers also remember results for servers that they forward requests to, which can make it very difficult to actually get a change to propagate out when you want to make one.  For load balancing, it’s even more troubling, since it depends on returning different results for each query.  The test with subsequent PING requests would have each request going to the same server if caching is enabled, which you can verify by running ipconfig/flushdns to clear the local resolver cache, or disabling the cache service completely.  The other way to ensure that your DNS records get re-queried each time though, is to set the Time To Live on the records themselves.  For some reason, this setting is hidden in the Windows DNS server.  Under the view menu check “Advanced”:

image 

And then open your pool entries.  You should see a new field for TTL at the bottom of the page:

image

The TTL is set to an hour by default.  Change this value to 0.  You may need to clear the DNS cache on the DNS server and the Lync server, but at this point you should be able to ping your pool FQDN and get different results back each time.

Finally load balancing?

Now you’re able to start your application instances, which should both register with Lync.  Place a call with both instances running, and one of them will answer.  Shut an instance down, and your call is answered by the other instance.  This makes your app fault tolerant for sure-anytime an instance is down, calls will go to the other instances, but is this actually balancing any load?  If you modify your app so you can identify which instance you’re actually talking to, you’ll notice something odd-calls are usually always going to the same instance of your app.  There are some answers to why this happens, but that’s a blog post in and of itself.  Part 2 will go into how load balancing actually works in Lync, and some techniques to get it to work the way you want it to.

Posted in Uncategorized | 1 Comment

WMAFileSource may have a few more tricks up its sleeve

The Microsoft.Rtc.Collaboration.AudioVideo.WmaFileSource class gets used all over most UCMA apps to play audio files, and from the name it’s pretty clear what it’s intended to do.  Of course, most of the time if you’re playing something like recorded IVR prompts, the odds that those will be recorded directly into WMA are reasonably low, so you’ll need to search for a converter, modify the files, and use them in your project.  The intent appears to be that if you want to play a different format (e.g. wav or mp3), you could create a new MediaSource derived class and go from there.  As it turns out though, that might not be entirely necessary.

I had a tester who wanted to try an mp3 file in a UCMA app, and instead of converting it, just changed the extension to .wma, and it played.  We modified the code to allow you to pass files with the .mp3 and .wav extensions to the application, and that worked too.  Maybe the class isn’t quite so WMA specific after all? 

My suspicion here (and I’m still waiting to find out from the product team that this is the case), is that the WmaFileSource class simply passes the filename to the Windows Media runtime and streams the audio out, so in theory anything that you can open in media player will work like this.  I only tested mp3 and wav in addition to wma though, and both of those seemed to work normally, and would likely cover a wide range of what you might run into that someone would want to drop into your app.  Has anyone else tried passing other formats to this class to see what happens? 

Anyway, something to try if you run into a situation where you just don’t feel like converting a media file from mp3 to wma. 

Posted in Uncategorized | Leave a comment

UCMA apps, load testing, and timeouts

One of the things that often ends up coming up too late in a development cycle is testing your application under load.  Sometimes we think to do this early on, but more often than not, one of the last things developers tend to do is throw traffic at an application until it breaks.  The problem is, finding an issue with load at the end of a dev cycle can be very difficult to fix, and it can call into question some of the fundamental aspects of your architecture.

Recently, I found an issue when testing a UCMA app that looked amazingly like calls to the app were being throttled.  I was even able to reproduce the error using an amazingly trivial UCMA app that simply started an ApplicationEndpoint, registered for AudioVideoCallReceived, accepted a call, and waited (if you want to follow along, the source code for the app and snooper traces are on skydrive here).  I started throwing 10 calls per second at the system up to 100 calls (using SIPP to a mediation server, which is a great SIP traffic generation tool), and after about the first 30 calls, I started noticing failures.  Here’s a sample trace from the front end server:

clip_image002[5]

And here’s the same side of the call from the app server:

clip_image002[7]

As you can see, the original INVITE went to the app, and was received, and the app promptly responded with the 100, 180, and 200.  The front end sent the INVITE at 14:10:21, and if we ignore the time difference between the two servers, you see that the app server’s responses all went out within milliseconds.  If you look at the trace though, it appears that the 100 took 31 seconds to be processed by the front end server?  There’s a lot of signalling going through that system at that point, but did I mention that the front end had 24 CPU cores and 64 GB of memory?  And that it barely registered the traffic?  This was a real puzzle, and I posted in a few forums (both public and private) trying to find an answer. 

As it turns out, the solution involved no code changes, and it was something completely out of my control.  Evidently, a power failure had reset the configuration on the switch that these systems were connected to, which, for some reason, put the ports in half duplex mode, while windows assumed full duplex.  Apparently this can cause some really serious performance problems.  Once I found our network admin, convinced him that this was probably a network issue, and got him to fix the duplex mode, all was well again, and traffic started flowing smoothly. 

I suppose the moral here is two things: One, never assume that the network just works (just like assuming that the power is always going to be on), and Two, sometimes you can blame the network guy and be right about it. 

Posted in Uncategorized | Leave a comment

UCMA Startup errors-when everything else doesn’t work, check the hosts file…

This was a fun round of troubleshooting.  One of our developers needed to debug a UCMA application that we’ve run on dozens of other servers.  He went through the steps to provision the app, just as we had everywhere else, but we got the following exception from starting the platform:

Portal failed establishing the endpoint: Microsoft.Rtc.Signaling.ConnectionFailureException:Operation failed because the network connection was not available. ---> Microsoft.Rtc.Internal.Sip.SipException: Invalid From header: Semantic error:  fTopLabel == true
   at Microsoft.Rtc.Internal.Sip.FromHeader.Parse(SipHeaderLink& headerLink)
   at Microsoft.Rtc.Internal.Sip.FromHeader..ctor(String headerValue)
   at Microsoft.Rtc.Internal.Sip.NegotiateLogic.CreateABlankNegotiate(FunctionType funcType, String negotiateData, SipResponse prevResponse)
   at Microsoft.Rtc.Internal.Sip.NegotiateLogic.StartCompression()
   at Microsoft.Rtc.Internal.Sip.NegotiateLogic.AdvanceOutboundNegotiation()
   at Microsoft.Rtc.Internal.Sip.TlsTransport.DelegateNegotiation(TransportsDataBuffer receivedData)
   at Microsoft.Rtc.Internal.Sip.TlsTransport.OnReceived(Object data)

At first glance, this looks like a network issue, so we made sure that the dev machine could reach the Lync server on all the ports it needed (it could).  Then we rechecked the certificate, and verified that the MTLS connection was forming, but then immediately terminating.  We even tried changing the client machine name to something without hyphens, and re-provisioning the application a couple of times just to make sure that there wasn’t something wrong along the way.  Finally, running OCSLogger.exe on the developer machine and digging through the traces, we saw this:

(000000000270F9F2)local cert SN robinlaptop.corp.computer-talk.com is not same as localfqdn localhost127.0.0.1. Send feature info

Then we checked the hosts file, and noticed a line that looked like this:

127.0.0.1       localhost127.0.0.1       localhost127.0.0.1       localhost127.0.0.1       localhost127.0.0.1       localhost127.0.0.1       localhost

Now, I have no idea how this developer got this in his hosts file, but removing the entry fixed everything.  After discovering this, I tried adding one line for localhost in the hosts file, and everything was fine.  I took the same entry and split it to four lines-everything was fine.  I tried changing ONE CHARACTER in the bad entry-everything was fine.  For some reason, only that exact sequence of characters managed to short-circuit the platform startup.

In any case, I’m putting this out there as a troubleshooting suggestion for anyone else who runs into the same thing.  Make sure that nothing has messed with your hosts file-it’s not necessarily the first thing you’d think to check, but I know I’m adding it to my list of troubleshooting steps from now on.

Also, if anyone knows exactly why this particular entry in the hosts file does what I described here, I’d really like to know why.

Posted in Uncategorized | Leave a comment