Non-Domain joined machines with Lync clients unable to connect using ADFS

Right,

So this is a ADFS issue, a customer is using an ADFS service to validate Lync users on their O365 tenant which is dirsync'd.

All domain joined machines with a Lync client were able to connect to the O365 service.  The same users, however, used on non-domain joined machines with Lync client would not work and return an error that the service was unavailable.   The customer has an internal certificate authority.

So what is going on?

First lets have a look at ADFS, those in the know can skip this step.  ADFS stands for Active Directory Federation Services, its is a way of allowing your domain users access to applications across organisational boundaries in a seamless and transparent way.  Happy days so far...  Now O365 is ADFS aware, so setting it up to use your on-premise Active Directory for validation is quite easy.  Have a look here for a document outlining the steps on how to set up ADFS,  if you already have ADFS then just go here on how to setup your tenant for ADFS.  Have a look at the picture below for an overview of how ADFS and O365 interact:

Great, you have your ADFS setup and your O365 tenant is working. 

NB you can check that your ADFS is working by going to this URL : https://www.testexchangeconnectivity.com/, selecting O365 and then selecting Microsoft Single Sign on.  Be warned though, if you use ADFS for internal use only it will not work!

So back to our issue.  The customer was unable to connect to the Lync service with non-domain joined machines.  So first, we took an ETL trace and UCCAPILOG from the failing Lync client. 

In the UCCAPILOG was this entry:

SIP/2.0 401 Unauthorized
ms-user-logon-data: RemoteUser
Date: Thu, 31 May 2012 13:07:20 GMT
WWW-Authenticate: TLS-DSK realm="SIP Communications Service", targetname="SN20A00DIR01.infra.lync.com", version=4, sts-uri="https://webdir0a-ext.online.lync.com:443/CertProv/CertProvisioningService.svc"
From: <sip:xxx@xxx.com>;tag=xxxxx;epid=xxxxx
To: <sip:xxx@xxx.com>;tag=xxxxx
Call-ID: ee02f80818264571b49ff2676485e792
CSeq: 1 REGISTER
Via: SIP/2.0/TLS 192.168.10.24:49202;received=10.27.46.15;ms-received-port=50738;ms-received-cid=EA42B00
Server: RTC/4.0
Content-Length: 0

Then in the ETL:

OUTGOING_TRANSACTION::ProcessAuthRequired - getting best auth challenge failed
OUTGOING_TRANSACTION::ProcessAuthRequiredResponse - ProcessAuthRequired failed 80ee0010

So something was going on here.  Checking the certificate store for the local machine the OCS_CERT wasn't being renewed.  This was checked by deleting the certificate and attempting to log back into the Lync client.  No change in the error message thrown and the certificate wasn't renewed.

So the certificate wasn't being renewed to the client.  Looking to the MOSDAL we saw that the ADFS tests pasted ok and that there were valid URLS registered on the O365 authentication system.  The best way to test this is to take the AD FS Metadata Exchange (MEX) URL and attempt to run it in the clients browser, the URL is in this form: https://xxx.xxxx.com/adfs/services/trust/mex .

The browser on the client navigated ok to the URL, indicating that the URL was correct and the service running.

We looked at the ADFS server event logs in the Admin and security section.  This would log the attempts to validate against that AD FS server.  The logs didn't contain much which was a warning in itself.  To further validate ADFS, there is a page which can be used to initiate a sign on: https://xx.xxx.com/adfs/ls/idpinitiatedsignon.aspx.  This also worked fine.

So far we hadn't had a break, most of the initial troubleshooting was inconclusive and indicating that the ADFS system was working as expected and that the clients were receiving the correct URL connection data.

Moving on, though, the ADFS server wasn't reporting a large amount of connection attempts as we had expected.  The HTTPS self initiated logon was logged as were the logons from the domain joined machines.

So we reproduced the failure and took a CAPI2 event log (CAPI = Cryptographic Application Programming Interface - a layer which isolates programmers from the code used to encrypt the data).   From that we could see that the revocation check was failing because the server that held the CRL could not be found:

This was from a Build Chain Task for the CAPI2 process.  The subsequent Verify Chain policies also failed with the same error:

The server however was up and running, and since domain joined machines had no problems it was accessible.  We turned to the certificates and started the certificates store snapin within MMC. 

Looking at the CA for the domain (which was from an internal source) we ensured that the CA root was installed on the non-domain joined machine and it was.  Then we checked two details of that certificate:

  1. Authority Info Access (AIA)
  2. CRL Distribution Point

Each of these contains information which the client machine needs to know on how to validate the certificate / and validate the root issuer of the certificate.

The customers certificates were only showing the LDAP path to the CRL and AIA.  There were no paths for the HTTP access point.

This is important, since the domain joined machines would be aok using any SSL cert issued by the CA because the LDAP path was valid for them.  With the non-domain joined machines the LDAP path was not valid since the machine doing the request was not part of the domain.  As part of a CRL check the system will attempt a LDAP connection, and if that fails a HTTP connection in this instance.  There was no information on the certificate to give the machine a chance to locate a HTTP location.

This technet article here https://technet.microsoft.com/en-us/library/cc776904(v=WS.10).aspx will explain much better than I can about the best practises and recommended practises for setting up your CA.  Most relevant to our conversation here:

Provide an additional HTTP CDP location or an alternative LDAP path to CRLs for clients that cannot use Active Directory or LDAP.

After we added an HTTP location via the CA management tool and reissued the root CA and SSL, all worked fine for all Lync users on non-joined domain machines.  You can see how that was done here: https://technet.microsoft.com/en-us/library/cc753296.aspx

Also here is a great article on Certificate revocation and how it all works: https://technet.microsoft.com/en-us/library/ee619754(v=WS.10).aspx

I hope this blog post has helped you out or just given you a glimpse into the technologies which underpin O365 and Lync online.