Troubleshooting Cross Platform Discovery and Agent Installation (part 2)

This is part of a multi-part article (if you didn't gather that from the title). For easy reference, here are all the parts.

(part 1) Scenario #1 - "There were no computers that met the specified discovery criteria"

(this article) Scenario #2 - "SSH discovery failed." with "unspecified problem"

(part 3) Scenario #3 - "Did not find a matching supported agent"

(part 4) Scenario #4 – "New computer shows as Platform: Unknown and Version: Unknown"

Scenario #2 - "SSH discovery failed." with "unspecified problem"

After fixing the issue with not being able to read the agents directory, the discovery details in the UI shows this:

image_thumb3

Obviously that's not very helpful, so I go back to the DebugView output. Starting from the end and going up, I'm looking for something that might give me a clue as to why the task failed. Unfortunately, I don't see anything definitive, but I'm narrowing down what might be the area of the problem. Here are the last few lines in the DebugView output:

 [6556] Beginning ExecuteOSInformationScript on thread id: 7 
[6556] 7| Executing DiscoveryScript.Discovery.Task 
[6556] 7| Returned from DiscoveryScript.Discovery.Task 
[6556] 7| DiscoveryScript.Discovery.Task returned as succeeded 
[6556] 7| Return from DiscoveryTaskHelper.ExecuteOSInformationScript() 
[6556] 9218a54c-5961-4fc0-bd46-99fbe8d543b6 | 7 | Return from 
DiscoveryTaskHelper.ExecuteSSHDiscovery() 
[6556] Microsoft.MOM.UI.Console.exe Error: 0 : 
[6556] 9218a54c-5961-4fc0-bd46-99fbe8d543b6 | 7 | 10.10.10.17: ExecuteSSHDiscovery failed with exception: 
<stdout></stdout><stderr></stderr><exception>Unspecified problem</exception> 
[6556] Beginning OnDiscoveryTaskCompleted on thread id: 1 
[6556] Beginning startNextDiscoveryTask on thread id: 1 

Comparing this to the flowchart above, it looks like it is at the "Copy GetOSVersion.sh script" step in the process, and it's copying the script to the computer and running it, but there is an error at that point. Using WinSCP, another tool in my kit, I browse the file system of the remote Linux computer and see that /tmp/scx-root/GetOSVersion.sh is indeed there, so the file copy worked. I check the rights on the file and I see it's set to "-rw-r—r—" but that's ok since the MP uses the "sh" command to run the script so it doesn't need execute permissions.

Going to the module debug logs (enabled with EnableOpsMgrModuleLogging)  I see the DeployFile.vbs.log file was updated recently. Here's what it said:

 Transferring file: C:\Program Files\System Center Operations Manager 2007\AgentManagement\UnixAgents\GetOSVersion.sh 
to location: /tmp/scx-root/

Verifying that file: GetOSVersion.sh was transferred properly
/tmp/scx-root/GetOSVersion.sh

So that looks ok. Let's check the SSHCommandProbe.log file:

 Leave SSHCommandProbe::DoProcess
Enter SSHCommandProbe::EnterDoInit
XML_INIT_CALL
GetEventId
Exit SSHCommandProbe::DoInit
Enter SSHCommandProbe::DoProcess (YES!!!)
SSHCommandProbe::DoProcess passed initial arguments checking
SSHCommandProbe::DoProcess preparing SSH call
centos55-x86
22
root
sh /tmp/scx-$USER/GetOSVersion.sh; EC=$?; rm -rf /tmp/scx-$USER; exit $EC
Enter SSHFacade::RunCommand
ExpectedSSHFacadeException
Unspecified problem
Enter initDataHolder
Enter initDataType
initDataType initializing output datatype
Leave initDataType
Leave initDataHolder
Leave SSHCommandProbe::DoProcess

That "Unspecified problem" looks suspicious, but it's not getting me any closer to figuring out the root cause here…

I'll try deleting it and re-running discovery. I watch DebugView as well as watching the directory with WinSCP, and it seems to take a long time to get the file there (it should be really quick since I'm on a private LAN). I think what is happening is a timeout (like my previous article: Can't get your computer discovered?). The Microsoft.Unix.DiscoveryScript.Discovery.Task has a timeout value of 20 seconds, so it's possible the network configuration is causing it to time out.

I go disable my secondary network adapter on my OpsMgr server so only my private LAN is active and I re-run discovery. Still fails. Hmm. I go check my network configuration on the CentOS machine and it looks like my DNS got configured to my corp network via DCHP on the other network adapter. I reconfigure the adapter on the private network to use the proper DNS, and…it looks like my problem is solved! But now I have a new problem…

More on that in part 3.