Friday, October 15, 2010

Time Solution

Just wrapped up my Windows Time side project yesterday. Here's a rough overview:

Found, resolved initial issues listed in the last blog post. This included these:
  1. Used the 0x8 flag to disable kerberos authentication on time requests.
  2. Used round robin DNS to make our top domain's service to sub domains redundant.
  3. Open firewall properly to allow both DC's in the top domain to reach outside Infoblox.
  4. Resolve stratum issue (appears to have been the result of the kerberos failure).
Determined a new standard for all the sub domains. Simple stuff. 0x8 flags on the PDC, point the BDC at the PDC.

Created a script that automated the new configuration and testing. This one is for the PDC.

REM Stopping Time Services
CALL :OUTPUT "Stopping Time Service"
net stop w32time
IF %ERRORLEVEL% NEQ 0 (Call :ERR "Failed to stop Time Service")

REM Restarting Time Service
CALL :OUTPUT "Unregistering NTP"
w32tm /unregister
IF %ERRORLEVEL% NEQ 0 (Call :ERR "Failed to unregister Time Service")

REM Restarting Time Service
CALL :OUTPUT "Registering NTP"
w32tm /register
IF %ERRORLEVEL% NEQ 0 (Call :ERR "Failed to register Time Service")

REM Starting Time Service
CALL :OUTPUT "Starting Time Service"
net start w32time
IF %ERRORLEVEL% NEQ 0 (Call :ERR "Failed to stop Time Service")

REM Restarting Time Service
CALL :OUTPUT "Setting up properly"
w32tm /config /manualpeerlist:ntp.xxxxx.xxx,0x8 /syncfromflags:manual /reliable:yes /update
IF %ERRORLEVEL% NEQ 0 (Call :ERR "Failed to configure time service")

REM Restarting Time Service
CALL :OUTPUT "Restarting Time Service"
net stop w32time && net start w32time
if %errorlevel% NEQ 0 (Call :ERR "Failed to restart Time Service")

REM Checking to make sure that it worked successfully
echo.
CALL :OUTPUT "Testing resync"
call w32tm /resync >> "c:\admin\ntp refresh.log"
echo.
CALL :OUTPUT "Testing monitor access"
call w32tm /monitor >> "c:\admin\ntp refresh.log"

Found a way to use LANDesk to rollout this script. Not tough to do. Note tho: set the task to be an emergency distribution and push. Wasn't able to figure out what return code LANDesk was looking for tho, so the task always returned "Failure" when complete.

Found a way to use WinSCP to upload the logs into a central FTP location.

The script for FTP:
c:\admin\winscp425.exe /console /command "option batch on" "open ftp://username:Password@192.168.xxx.xxx:21 -passive" "put ""c:\admin\ntp refresh.log"" ""/Transfers/upload/%EFT_FILE_NAME%.log""" "exit" /log=c:\admin\winscp_log.txt

The script for SFTP:
c:\admin\winscp425.exe /console /command "option batch on" "open sftp://username:Password@192.168.xxx.xxx:22-passive" "put ""c:\admin\ntp refresh.log"" ""/Transfers/upload/%EFT_FILE_NAME%.log""" "exit" /log=c:\admin\winscp_log.txt


Spent 2 months rolling this out to 130 PDC's and 40 BDC's.
Unusually slow rollout time, since I'm new here. Lots of inefficiency to make sure I didn't mess up, but along the way I cleaned up the documentation around here. Lots of things were mislabeled as primary or secondary or not even labeled at all! Made for slow, meticulous going.

Used QRadar query to track the number of events NTP/Windows Time generated.
The query: "Log Source Type is Microsoft Windows Security Event Log, Payload contains is any of [NtpClient or W32Time or Windows Time]"

Here’s a breakdown of the events from every Monday: the data only goes back, incidentally, to my first day. Analysis of the data indicates that while a certain domain/server can be a heavy hitter (event 1) on a particular day, the breakdown of errors was fairly evenly spread across our domains. Even during the peak of the issues caused by a subdomain, (event 3), that domain only accounted for 11% of the errors, indicating that solving the top domain problems (event 2) is what had a profound impact across all of our environments. See the second graph for another illustration of just how stark the drop is.
(click to view whole image)


Raw Snare query focused on the week of the drop:
(click to view whole image)







Notes:
"Sub domains" here means domains that are independent, but in terms of DNS and Windows Time, they look to the top domain.
PDC means Primary Domain Controller
BDC means Backup Domain Controller
LANDesk is a centralized desktop management product that we use for servers as well. You can force installs, as well as many other things. It's versatile, but not well coded.
WinSCP is an implementation of an FTP Client executable. The online documention leaves much to be desired.