IT engineering and a little bit of hacking: 2010

Friday, December 3, 2010

Visual Studio 2010 Silent Install

Needed to script out installing VS2010, wanted to throw a couple of the gotchas out into the interwebs for others. Basically, I wrote three batch files that used .reg files to queue each other up and used the setup's native unattend capability to install on the D:\ drive and only install selected components (we didn't install SQL or Sharepoint components).

Notes on my implementation:
- The setup run via silent install and automatic reboots.

- It uses the registry to queue up more installs after each reboot, so you can just sit back.

- The setup uses an unattend setup.ini file to only install the components required by our development teams per their recommendation.

- The install.bat warns that several reboots are required and pauses to let the user quit, as well as sets a timer on the first reboot. After that, the script just runs.

- The c:\admin\ path is vital, so be sure to extract it there.

- About a 35 minute install, it puts 2GB on the D:\ drive and 4GB on the C:\ drive.

- The initial zip file is 2.4GB, and unzipped it’s still 2.4GB.

Make sure you have 10GB or so free on C:\ before you start this process.

You can get away with 8GB free on C:\ if you delete the .zip file after you extract it.

- Security scanning only picked up 1-2 vulnerabilities after install, but that may change over time so I recommend scanning it post-install.

- This DOES install .NET 4.0.

Scripts below:

Install.bat
echo "This install requires a reboot. Please press enter if you'd like to continue, else quit."
pause
regedit /s install.reg
VS2010setup.exe /q /norestart /unattendfile setup.ini
echo "Complete. If you don't want to reboot the server, use shutdown -a"
pause
shutdown -r -t 60 -c "The server is restarting per Visual Studio 2010 Install requirements."

Install.reg
Windows Registry Editor Version 5.00

[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\CurrentVersion\RunOnce\]
"VS2010" = "C:\\admin\\vs_2010\\Setup\\install_stage_2.bat"

Install_Stage_2.bat
regedit /s install2.reg
c:\admin\vs_2010\setup\VS2010setup.exe /q /norestart /unattendfile c:\admin\vs_2010\setup\setup.ini
echo "Complete. Rebooting"
shutdown -r -t 10 -c "The server is restarting a second time per Visual Studio 2010 Install requirements."

Install2.reg
Windows Registry Editor Version 5.00

[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\CurrentVersion\RunOnce\]
"VS2010" = "C:\\admin\\vs_2010\\Setup\\install_stage_3.bat"

Install_Stage_3.bat
c:\admin\vs_2010\setup\VS2010setup.exe /q /norestart /unattendfile c:\admin\vs_2010\setup\setup.ini
echo "Complete. Rebooting"
shutdown -r -t 10 -c "The server is restarting the last time per Visual Studio 2010 Install requirements."

Gotchas:
- The slash is an escape character when you are working in quotes, so double slash it.
- Watch out using setup.exe as the runonce target. I renamed the setup file to avoid a possible issue with a native Windows file named setup.exe.
- The setup.exe in the \setup\ folder is the only one with unattend functionality. Don't confuse it with the setup.exe in the root folder.
- Whatever happens during manual install is not necessarily what you'll see once you automate it. I was able to install in it only one reboot manually, but was unable to automate it without two reboots. Three is ideal.

- SP2 is required for server 2003 installs.

Sources:
http://aka-community.symantec.com/connect/forums/installingscripting-out-visual-studio-pro-2008-visual-studio-pro-2010 - Other people's scripting

http://techsupt.winbatch.com/TS/T000001029F22.html - Runonce discussion

http://msdn.microsoft.com/en-us/library/aa376977(VS.85).aspx - Official MS run once documentation

Friday, October 15, 2010

Time Solution

Just wrapped up my Windows Time side project yesterday. Here's a rough overview:

Found, resolved initial issues listed in the last blog post. This included these:

Used the 0x8 flag to disable kerberos authentication on time requests.
Used round robin DNS to make our top domain's service to sub domains redundant.
Open firewall properly to allow both DC's in the top domain to reach outside Infoblox.
Resolve stratum issue (appears to have been the result of the kerberos failure).

Determined a new standard for all the sub domains. Simple stuff. 0x8 flags on the PDC, point the BDC at the PDC.

Created a script that automated the new configuration and testing. This one is for the PDC.

REM Stopping Time Services

CALL :OUTPUT "Stopping Time Service"

net stop w32time

IF %ERRORLEVEL% NEQ 0 (Call :ERR "Failed to stop Time Service")

REM Restarting Time Service

CALL :OUTPUT "Unregistering NTP"

w32tm /unregister

IF %ERRORLEVEL% NEQ 0 (Call :ERR "Failed to unregister Time Service")

REM Restarting Time Service

CALL :OUTPUT "Registering NTP"

w32tm /register

IF %ERRORLEVEL% NEQ 0 (Call :ERR "Failed to register Time Service")

REM Starting Time Service

CALL :OUTPUT "Starting Time Service"

net start w32time

IF %ERRORLEVEL% NEQ 0 (Call :ERR "Failed to stop Time Service")

REM Restarting Time Service

CALL :OUTPUT "Setting up properly"

w32tm /config /manualpeerlist:ntp.xxxxx.xxx,0x8 /syncfromflags:manual /reliable:yes /update

IF %ERRORLEVEL% NEQ 0 (Call :ERR "Failed to configure time service")

REM Restarting Time Service

CALL :OUTPUT "Restarting Time Service"

net stop w32time && net start w32time

if %errorlevel% NEQ 0 (Call :ERR "Failed to restart Time Service")

REM Checking to make sure that it worked successfully

echo.

CALL :OUTPUT "Testing resync"

call w32tm /resync >> "c:\admin\ntp refresh.log"

echo.

CALL :OUTPUT "Testing monitor access"

call w32tm /monitor >> "c:\admin\ntp refresh.log"

Found a way to use LANDesk to rollout this script. Not tough to do. Note tho: set the task to be an emergency distribution and push. Wasn't able to figure out what return code LANDesk was looking for tho, so the task always returned "Failure" when complete.

Found a way to use WinSCP to upload the logs into a central FTP location.

The script for FTP:

c:\admin\winscp425.exe /console /command "option batch on" "open ftp://username:Password@192.168.xxx.xxx:21 -passive" "put ""c:\admin\ntp refresh.log"" ""/Transfers/upload/%EFT_FILE_NAME%.log""" "exit" /log=c:\admin\winscp_log.txt

The script for SFTP:

c:\admin\winscp425.exe /console /command "option batch on" "open sftp://username:Password@192.168.xxx.xxx:22-passive" "put ""c:\admin\ntp refresh.log"" ""/Transfers/upload/%EFT_FILE_NAME%.log""" "exit" /log=c:\admin\winscp_log.txt

Spent 2 months rolling this out to 130 PDC's and 40 BDC's.

Unusually slow rollout time, since I'm new here. Lots of inefficiency to make sure I didn't mess up, but along the way I cleaned up the documentation around here. Lots of things were mislabeled as primary or secondary or not even labeled at all! Made for slow, meticulous going.

Used QRadar query to track the number of events NTP/Windows Time generated.

The query: "Log Source Type is Microsoft Windows Security Event Log, Payload contains is any of [NtpClient or W32Time or Windows Time]"

Here’s a breakdown of the events from every Monday: the data only goes back, incidentally, to my first day. Analysis of the data indicates that while a certain domain/server can be a heavy hitter (event 1) on a particular day, the breakdown of errors was fairly evenly spread across our domains. Even during the peak of the issues caused by a subdomain, (event 3), that domain only accounted for 11% of the errors, indicating that solving the top domain problems (event 2) is what had a profound impact across all of our environments. See the second graph for another illustration of just how stark the drop is.

(click to view whole image)

Raw Snare query focused on the week of the drop:

(click to view whole image)

Notes:

"Sub domains" here means domains that are independent, but in terms of DNS and Windows Time, they look to the top domain.

PDC means Primary Domain Controller

BDC means Backup Domain Controller

LANDesk is a centralized desktop management product that we use for servers as well. You can force installs, as well as many other things. It's versatile, but not well coded.

WinSCP is an implementation of an FTP Client executable. The online documention leaves much to be desired.

Friday, July 30, 2010

W32Time

Started a new job in Minneapolis about a month ago, and things are finally starting to get exciting. Found that there are a ton of issues for Windows Time here and spent a good deal of effort investigating them. This company does a lot of work that depends on accurate timestamps, so I had little trouble getting management to let me dig into the issues. This company is PCI certified, so there are tons of domains, and the time hierarchy goes a bit like this:

Internet time server

Infoblox DNS servers

Domain controllers in master domain (let's call this D1)

PDC in subdomain (SD2, SD3, etc)

DC in SD2

Member server in SD2

One issue I found right away was that the PDC in SFD2 was getting time from ntp.D1.com. This DNS entry only pointed at the PDC in D1, resulting in a non-redundant time stream, which caused issues when we took the D1 PDC down. I enabled round robin DNS for this by adding a second DNS record with the same name (ntp.D1.com) pointing at the DC's IP.

The next issue was that the D1 DC was not pointed at the D1 PDC (domain-based hierarchy). It instead was pointing at the infoblox server. I decided to ignore this for now.

Next, I put together a list of all the PDC's in every domain and verified NTP access for each to the DC's in D1. I also audited the registry on each and found another issue. The ",0x1" flag for special polling intervals that should be added to "ntp.D1.com" in the ntpserver reg entry was missing in many domains. I'm working through getting this added, but it's not a top priority.

Next, I did audits of the event logs of several domains (~5 servers, a PDC and a DC each) and determined that there was some correlation between reboots and the time failing. I ruled out patch issues or change issues since there was nothing in common. There also seemed to be a strange inverse correlation between the PDC not receiving time properly and the member servers receiving time properly. One worked, the other didn't, and visa versa.

Last, I built an entire test domain mimicking a normal domain, turned on logging, and started testing. I found that things got screwy quick, and eventually determined by parsing the packet reception in the logs that my SD2 PDC had stratum 15 somehow. I got logging turned on in D1and found that the infoblox servers had stratum 13!!! If you count the list above, the highest stratum we should be at is 6, and MS limits stratum to 15 exactly. Stratum 15 also is an indicator flag that the stratum count may not be correct, but in this case it was incremented correctly (after the infoblox, which is the root of our issues).

We're opening a case with infoblox to figure out what's going on with them. After we fix that, I'll know if I can consider this case closed.

Thursday, March 4, 2010

EVAPerf Statistical Limitations

Here's a big issue I wrestled with over the last few weeks:

EVAPerf occassionally hiccups, kicking out a single data point that claims a 2Gb host port has 30Gb/s throughput, or that a single diskgroup has 200GB/s being written to it. Sometimes these are clear overflow numbers (214748.3647 showed up repeatedly) and sometimes they are just absurdly high, though unique and precise. HP recommends using 95th percentile to statistically analyze the performance of your EVA's, and these super-high numbers skew our statistics to the point of being worthless.

My solution: create duplicate, empty SQL tables and daily screen the data to move any data points over thresholds I set to those duplicate tables, where they'd be out of the scope of my automated reporting. The trouble is, where's the threshold?

After a few weeks of emailing back and forth with HP, it became clear that they weren't interested in giving any definite answer or OK'ing my calculations. Part of the reason for that is that they have no real insight into how much actual IO CAN occur in a disk group because of all the behind-the-scenes leveling, parity calculation, and striping that occurs on top of the server-generated IO.

On top of that, block size varies so widely that the IO capacity calculations that they ARE able to do give you no real concept of the throughput capacity of your hardware. For example, let's say their PerfMonkey tool said your disk configuration allowed for 6000 read and 4000 write IO/sec in your disk group. Theoretically, with block sizes of up to 64MB apiece, this means your throughput could be over 380GB/s. So we're without a solid mathematical recourse.

I settled on 50GB/s for disk groups and LUNs, and 10Gb/s for host ports (even though they're only 4Gb ports) after careful analysis of how that affected the data - ends up looking like about 40 data points per month would be moved using those as thresholds.

Working pretty nicely so far.

HP EVA fnames.conf

Our SAN environment has 120TB spread over 4 HP EVA's (3 8100's and a 4400). We've worked through numerous difficulties with these, not the least of which was the dreaded "Saturday morning slowness." Part of our efforts to combat this was to attempt to gain greater insight into where the IO was actually coming from - at the time, we simply didn't have the system in place to do this.

Our friends at HP provide EVAPerf, which kicks out CSV files with a deluge of data (200MB+ per day per array). In order to make sense of this, a good friend on the software engineering side was added as a resource: he did a great job of writing a loading program that took those CSV files and kicking them in SQL. Our company is looking into IP rights, it's that awesome.

Meanwhile, we discovered a few months ago that our CSV's were filled with WWN's, which are pretty cumbersome to work with. HP's solution is the friendly names file fnames.conf, which just allows the EVAPerf task on your server to replace the WWN's with readable, English names. So a fnames.conf file was set up to automatically recreate itself once a day.

Well, as time went on and we changed more and more, the data I was working with was increasingly filled with WWN's. Investigation Monday morning turned up this little gem:
"The fnames.conf file must reside in the directory in which HP Command View EVAPerf was installed. "

Well, whoever set this up had it creating the updated file in a c:\utilities subdirectory - we were working with a months-old fnames file, since EVAPerf was installed under c:\programs\hp... After updating this, we went from 15/150 correctly represented LUNS in one EVA's csv file to 135/150 LUNS showing up with real names, greatly simplifying our vdisk I/O statistics. Woot!

Extremely useful information on decoding EVAPerf data:
http://www.fcoe.ru/index.php?option=com_content&task=view&id=257&Itemid=46#addcomments

See here for more official info on the fnames/evaperf integration
http://h10032.www1.hp.com/ctg/Manual/c00605846.pdf

Resurrection!

I had hoped I'd be able to continue to post solutions to tough problems and general summaries of things I'd learned, but only got 5 posts in before other priorities took hold. A lighter school load this quarter will hopefully give me a better chance to keep this updated!

Pages

Friday, December 3, 2010

Visual Studio 2010 Silent Install

Friday, October 15, 2010

Time Solution

Friday, July 30, 2010

W32Time

Thursday, March 4, 2010

EVAPerf Statistical Limitations

HP EVA fnames.conf

Resurrection!