This week, we had really really strange error in one of our production databases. We had few jobs running internally (SQL Agent) and externally (Windows Scheduler) for usual data fetching and feeding. But yesterday, we got notification from our watchdog application (which basically monitors our system/job status) start sending us alert messages that jobs are failing. Upon investigation of Agent log, we found that jobs were failing with message of
error,  Unable to start JobManager thread for job JOB_NAME
SQL Server log and Windows logs were not showing any specific error as well. We resolved the issue by simply restarting the SQL Agent but we wanted to find real cause of this error. So I decided to do little more digging into this error and found that it is kind of common error.
I found few suggestions though …
1) Check for max number of worker threads for SQL Agent. We can find this in either registry or in MSDB database depends of what version of SQL Server we are talking about. For SQL Server 2000 and earlier, this can be found in registry and for SQL Server 2005 and later it is stored in MSDB database. You can get more info from this MSFT KB document. Unfortunately in out case we had already set this value to higher numbers.
2) Some people suggested to see if there are any kind of conflict between jobs and they are trying to take over each other. Now, we have inherited this system from our DBA who has left this job few months ago so we are kind of trying to shoot in the dark without any sort of proper documentation (currently I am trying to fill up missing gaps in manual). But we tried to check if any sort of job conflict is going on but we haven’t found any (yet ) …
3) And the most common suggestion was to just restart SQL Agent and don’t bother for error if this is happening rarely. Which we did as a temporary solution since it is not happening frequently.
My take is, it can be because we have not set up the job step to retry more than ONE. I mean if we set it to retry then even though it fails at first try, it will try again so in event of busy resource for one shot it will always try again for number of times we have defined in the job.
I have posted my question at MSDN forums,
but have not got any reply yet (even after it has over 90 views !!!)
And till then just hope that it doesn’t happen again Meanwhile I will try to replicate the issue on dev. server and try to implement my thought about setting retry numbers for job. I hope to find some real answer for this issue and once I find it, I will surely post it here as well.
That’s it for now.
It’s Just A Thought …