IT engineering and a little bit of hacking: February 2011

Saturday, February 26, 2011

IBM Tivoli Storage Manager

Pinged a mentor of mine on a technology I haven't had a chance to work with to get his thoughts on it. I know that the internet is full of marketing-speak and tech mumbo jumbo, and he has a singular talent for cutting through it and explaining something in terms that really matter. He gave me a pretty solid and concise summary so good I can't help but pass it on. Enjoy!

"It’s an IBM backup product. Their big efficiency is they do what’s called “incremental forever” backups. Rather than focusing backups on how many weeks you keep a tape for, they look at how many copies of a data set you want to keep. So for example, say you have 10 files you back up nightly. You write a policy that says you want to keep 3 versions of each file on tape. If one file changes every day, it’ll get backed up every day, and Tivoli will release any copy of the file that is more than 3 revisions old. But the other files will only get backed up if/when they change. The problem with Tivoli is that it runs a process of reclamation and consolidation to copy data you need to keep to new media so tapes housing data no-longer-needed can be flushed. If you don’t run this reclamation and consolidation process, you could get into a situation where you need every tape you’ve ever written to restore a single volume.

It’s a neat product, and a different way to look at protecting data. It’s a solid product, well adopted, but hell on your off-site transport costs because you’ll be shuffling tapes around daily to do the reclamation and consolidation.

One other thing you can do is leave all your tapes in the library (for primary copy) and create a synthetic full that you then take off-site."

Wiki linkage:
http://en.wikipedia.org/wiki/IBM_Tivoli_Storage_Manager

Thursday, February 17, 2011

LANDesk Return Codes

Quick thing I picked up recently: Had an issue where a LANDesk task would not return successfully, even though the batch file was running no problem. “EXIT 0” was not working, but the attached documentation suggested using “EXIT /B 0" which worked like a charm.

http://community.landesk.com/support/docs/DOC-2320

Wednesday, February 9, 2011

Thoughts on IT culture

A few thoughts on the inefficiency of IT and how to solve it:

1. Pointing out other people's mistakes is not the same thing as leadership.

2. I have never seen someone criticized by their boss for not taking initiative: I consistently see people criticized for imperfect results when they voluntarily overburden themselves. Do you want a silo'd, turf-war, every-man-for-himself culture where people do their best to limit their workload in order to achieve actual perfection? Or one where people are applauded for efforts to carry a greater share of the workload in order to allow their teammates to accomplish more as well? Leadership should consider what they are incentivizing.

3. Don't waste talent. There's plenty of dirty work in IT and everyone understands that, but as much as you possibly can try to not waste your engineers' time with the small stuff. I've seen guys making six figures who change backup tapes every day (!!!) and guys making $80k who have to spend 2 hours a week creating users in AD. If you're paying him to design bridges then he shouldn't be filling potholes, or else you're wasting capital and drastically slowing your company's technological advancement.

4. Hire interns. The energy, work ethic, and new perspective forces all us full timers to stay on the edge of our game. Further: 1) Give your interns real responsibility. 2) Let your interns shake up the status quo, even if your full timers don't like it.

5. You pay your talent a lot, so get them the tools to be as effective as possible. I really can't make this point any better than Jeff Atwood and Yishan Wong, I highly recommend these two reads.

Coding Horror's Programmer's Bill of Rights:
http://www.codinghorror.com/blog/2006/08/the-programmers-bill-of-rights.html

Engineering Management by Yishan Wong:
http://www.algeri-wong.com/yishan/engineering-management.html

Tuesday, February 8, 2011

Server Management Software

A very underappreciated tool: server asset management. Just knowing basic info about your servers, documented in one place, can save your people tons of time and hassle. The ROI for keeping this type of system up to date is off the charts. Which is why I was so surprised at my current company when I found their tool with incomplete info and missing data. I did a complete 8 hour audit of the datacenter and true'd it up with their mgmt software. Here were my results:

- 15 servers for which we had no record of their location or our records were wrong.

- 15 servers whose front label were incorrect.

- 5 live servers whose front labels were missing.

- 33 servers not labeled or labeled incorrectly in the back of the rack.

- 10 servers where the recorded rack info was correct, but the position in the rack was not.

- 80 servers for which we didn’t have the submodel (e.g. 7979-XXX) or the submodel that was recorded was incorrect (This hasn’t been a requirement in the past so this was mostly information gathering on my part).

- All of the serial numbers were correct (wow).

The Rack Visual portion of of our software is now in harmony with no overlaps - before, servers were mapped as being inside other servers, or on the roof :-). I created tickets for things I wasn't able to immediately resolve like locating hardware and working with security to get their racks properly documented.

Nehalem Performance Optimization (BIOS Edition)

Long-awaited recommendations on the BIOS options for IBM's Nehalem offerings. I can't speak to their Westmere technology, having been unable to get my hands on it yet. I did a considerable amount of research on this - at the end of the day, this has been an effort to aggregate IBM's standards, rather than me attempting to prove/disprove their statements. I'm not sure I can take on IBM like that :-)

You can find the details in the link below, not a virus I promise:
(Update: MediaFire's link expired somehow. Working with their techies to resolve)
Nehalem Recommendations (excel)

Official IBM notes on Performance:
http://www.sbsmn.org/Optimizing%20Nehalem%20server%20memory%20XSW03025USEN.pdf
ftp://ftp.software.ibm.com/systems/support/system_x/dx360m2_dx360m3-cmos-settings-v1.2.txt

VMware notes on Turbo:
http://communities.vmware.com/thread/261540

Optimizing uEFI boot speed (recommendations forthcoming):

ftp://ftp.software.ibm.com/systems/support/system_x/dx360m2_dx360m3-cmos-settings-v1.2.txt

http://download.intel.com/design/intarch/papers/322253.pdf

Daily VMware Slowness

I was in a meeting Friday when someone mentioned that our VM's (of which we have hundreds spread across 4 datacenters and 12 ESX servers) were experiencing degraded performance every day around lunchtime. I set up PerfMon tasks to watch the CPU utilization and I/O writes and reads of the various processes on two different VM's. What I found was: nothing. There was no data to support the slowness. We then started experiencing slowness at 5pm: I analyzed that data, and found this:

Virtual Machine 1, 4:50pm to 6pm:

Virtual Machine 2, 4:50pm to 6pm:

If I were a detective, I would probably call that a clue. Turns out our engineer who owns McAfee for us had recently made some changes, which I haven't been able to dig into, that made all of our VM's update their local virus definitions simultaneously. When there was conversation about that possibly causing performance issues (a theory that was disparaged by some), he moved it to 5pm. Mystery solved. We're going to work to stagger the DAT updates, so not all our VMware CPU and LUN I/O resources are pegged simultaneously.

Couple notes on perfmon:

- Kick logs out to CSV or TSV. Either can be imported into excel, which is much easier and more versatile to work with than PerfMon's own data analyzer.

- Make sure to select "All instances" so that each process gets its own set of data points.

- I took readings every 10 seconds and saw no real performance impact over the course of the day.

- I highly recommend deleting "Idle" and "Total" processes first thing when analyzing the data you've collected, or better yet not recording data from them at all: they're just noise.

- I recommend setting a 1-4 hour time frame for the data collection so the data is broken out in multiple files: nothing is more hassle than trying to work with at 200MB text file. Trust me, learned that lesson a year ago (thanks HP EVAStats!).

Wednesday, February 2, 2011

Duplicate SID's

Interesting challenge to conventional wisdom..."I became convinced that machine SID duplication – having multiple computers with the same machine SID – doesn't pose any problem, security or otherwise. " - Mark Russinovich, of SysInternals fame. In case you haven't heard of Mark, he's pretty much a legend. Against what I've been taught, MS has concluded that duplicate local SID's within a domain is perfectly OK. Domain SID's, on the other hand, need to be unique.

Gotta constantly re-evaluate commonly held truths I guess!

However, I have seen an issue: if the server you're joining to the domain has the same local SID as the DC, you will see some funky results. The domain trust will not function correctly and you won't be able to log onto the member using domain accounts.

http://blogs.technet.com/b/markrussinovich/archive/2009/11/03/3291024.aspx
http://technet.microsoft.com/en-us/sysinternals/bb897418.aspx