Monitor Isilon NFS thread counts

Here at [my workplace] we recently noticed that some of the nodes in our Isilon storage cluster were reaching their NFS thread limit. I won’t go into why that’s a bad thing or the reasons it was occurring, but we quickly realized it was something we should be monitoring closely. To see the current NFS thread counts on all nodes in your Isilon cluster, you use the following command:

isi_for_array -s sysctl vfs.nfsrv.rpc.threads_alloc_current

This returns something like the following:

dm11-1: vfs.nfsrv.rpc.threads_alloc_current: 16
dm11-2: vfs.nfsrv.rpc.threads_alloc_current: 16
dm11-3: vfs.nfsrv.rpc.threads_alloc_current: 16
dm11-4: vfs.nfsrv.rpc.threads_alloc_current: 16
dm11-5: vfs.nfsrv.rpc.threads_alloc_current: 16
dm11-6: vfs.nfsrv.rpc.threads_alloc_current: 16
dm11-7: vfs.nfsrv.rpc.threads_alloc_current: 16
dm11-8: vfs.nfsrv.rpc.threads_alloc_current: 16
dm11-9: vfs.nfsrv.rpc.threads_alloc_current: 16
...etc...

The first column gives you the node name and the last column gives you the current thread count. With few connections, the numbers on the left will be low. Our nodes are set with a 16 thread minimum. As more clients connect to a given node, more threads are spawned as needed to service them.

Running this command manually every once in a while is obviously less than ideal. Since Isilon nodes run an OS based on FreeBSD and python is available on them, I wrote a python script called ‘nfs_watcher.py‘ to monitor the thread counts for me. The script lives in /root on one of the nodes in the cluster and runs every 5 minutes via a cron entry in /etc/local/crontab.local on the same node.

When the script runs, it checks to see if any of the nodes is at or exceeding our warning threshold (70% of the max thread count of 256). The script sends an alert email (via smtp/sendmail) if at least one node has hit the warning threshold. Nodes beyond the threshold are identified at the top of the message in a line that starts “WARN” or “CRIT” followed by the node’s name and thread count. The email alert also includes a complete copy of the thread count data at the bottom so you can check to see if is an isolated spike or if the entire cluster is undergoing a heavy load.

You can find nfs_watcher.py on my github page.

Advertisements

Revealing symlinks in arbitrary paths

Here at [my day job], the scientists I support can generate many tens, hundreds, and often thousands of gigabytes of data. We provide them with a few different storage options, each with different performance, redundancy, and (therefore) cost characteristics. Quite a few of the labs here keep data spread across multiple storage tiers. To make it simpler for them to maintain and access their data, they often put symlinks to, for example, archived data in their primary data directories.

A number of issues can arise from this. One is that the scientists will often forget where their data actually resides. This is a major issue if they are planning on using our compute cluster to analyze this data. One of the trade-offs of storing data on our archive tier, besides being slower than our primary tier, is that only a limited set of computer cluster nodes can access the archive tier. That tier is not robust enough to handle a lot of concurrent traffic, so we only allow a small subset of cluster nodes to access it. Unless these nodes are specifically requested when scheduling a cluster job that involves archived files, that job will fail.

Of course, when a job fails to run, we’re usually asked to diagnose the issue. The most common culprit is that these some or all of the files are on the archive storage. My co-worker was getting frustrated constantly diagnosing this issue and opined,

Wouldn’t it be great if we had a tool that would convert paths into something that made any symlinks in the path obvious?

I took that as a challenge. Thanks to python, I came up with a simple command line tool in fairly short order that does just that, plus a little more. I call it realpath. Here’s the usage screen and some examples of how it works.

Usage: realpath [options] path

Options:
-h, --help show this help message and exit
-f, --full show full symlink paths
-a, --actual show the actual, non-interleaved path

Examples:

# realpath /tmp/pathtest/stuff
/tmp[private/tmp]/pathtest/stuff[../../../Users/admindude/Documents/stuff]

# realpath --full /tmp/pathtest/stuff
/tmp[/private/tmp]/pathtest/stuff[/Users/admindude/Documents/stuff]

# realpath --actual /tmp/pathtest/stuff
/Users/admindude/Documents/stuff

Pretty cool, huh? You can find realpath on my github page.

Convert a RHEL box to Scientific Linux with no downtime

I saw the following tweet today.

I was faced with the same task a couple years ago and discovered that it’s actually a lot simpler than you’d guess. (It helps that 99(.44)% of the code between the same versions of Red Hat Enterprise Linux, Scientific Linux, and CentOS is identical.) After determining the minimum number of packages I needed to remove, replace, or install, and in what order to do everything, I came up with the procedure detailed below. Once I tested it on a few boxes and was satisfied with the results, I handed the job off to puppet and it took care of converting the rest of my RHEL boxen to SL. (If you’re interested in seeing the puppet manifest I used to do this, let me know in the comments.)

Continue reading

WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that a host key has just been changed.
The fingerprint for the RSA key sent by the remote host is
00:11:22:33:44:55:66:77:88:99:aa:bb:cc:dd:ee:ff.
Please contact your system administrator.
Add correct host key in /home/username/.ssh/known_hosts to get rid of this message.
Offending RSA key in /home/username/.ssh/known_hosts:42
RSA host key for 10.1.2.3 has changed and you have requested strict checking.
Host key verification failed.

If you use ssh — if you’re reading this blog, you likely count yourself as a member of that group — you’ve seen this warning. If you bring up and take down vms regularly, or replace servers regularly, or use dhcp on your network with short lease times, you might see this warning more than others. Regardless of how or when you’ve seen the warning, you probably find it annoying. “No, computer, I’m not being hacked. A different host has this IP now. Just let me in already, ok?” At this point, you open the known_hosts file in your editor of choice, find the offending line, delete the line, save the file, and try your ssh command again. Tedious, right?

We can all agree that simply turning off this warning is not the best idea, so how can we deal with it efficiently? (HINT: check this posts categories.) That’s right, we script it. We don’t need any more than a simple one-liner, and the best place to keep your one-liner scripts is in your .bashrc file (or the shell of your choice’s rc file).

Do you see how the error kindly tells you which line of the known_hosts file is the offender? This makes our job extremely easy. Both sed and perl can easily delete a given line from a file. (NOTE: Mac users will need to use perl, since BSD version of sed does not include the functionality shown here.) Both of the following commands will delete line 42 from the file.

sed -i '42 d' ~/.ssh/known_hosts
perl -i -ne 'print unless 42 .. 42" ~/.ssh/known_hosts

These are nice, but we need to wrap the command in a function that takes the line number as an argument. Let’s call the function rmhost.

rmhost () { sed -i "$1 d" ~/.ssh/known_hosts; }
rmhost () { perl -i -ne "print unless $1 .. $1" ~/.ssh/known_hosts; } # for Macs

And there we have it. Put your command in your .bashrc, source it (i.e. load the changes by running ‘source ~/.bashrc’), and the next time you see the above warning (and you know the reason you’re getting it), just type rmhost [line #] and you’re good to go.

Recursively Find/Replace Inside Files Within a Directory

We recently had to change a handful of usernames in LDAP due to a merging of resources. This was a relatively painless process, but since some services use static authorization files to grant access, some manual post-processing was necessary. The script at the end of this post is something I came up with to deal with updating the subversion auth_files. It’s a bash script that uses a couple useful tricks:

tree -ifF --noreport /path/to/dir/ | grep -v '/$'
  • The tree command normally prints out an ascii-graphical representation of the file structure rooted in the given path, recursively. The ‘-i’ option tells it not to display the graphics. The ‘-f’ option prints out the full path to each item. The ‘-F’ adds file-type indicators to the end of file names, which I’m using here so I can filter out directories from the list using an inverse grep.
  • The output of the tree command is piped to grep. The ‘-v’ option activates inverse grep, and the ‘/$’ regex will match trailing slashes. This grep will match all lines not ending in ‘/’.
  • The tree command is not standard on all flavors/versions of *nix. It’s missing on OS X, for example.
perl -p -i -e 's|before|after|[ig]' file
  • This perl command will edit a file in-place, replacing occurrences of “before” with “after”.
  • Adding an i to the end of the substitution string makes it a case insensitive substitution.
  • Adding a g makes the command replace all instances, a.k.a. global, instead of just the first instance.

Why perl instead of sed for in-place edits?

Not all versions of sed allow in-place edits, especially older ones, so perl is the more universal option. If you know your sed can do in-place edits (check the man page for the ‘-i’ option), then you can replace the perl line in the script below with this:

sed -i'' -e "s|$1|$2|g" $afile

Whether you choose to use perl or sed, you must remember to double-quote the substitution string so bash expands the variables and hands the values off to sed/perl. Using single quotes here would result in sed/perl looking for a literal ‘$1’ to replace with a literal ‘$2’.

The Script

This code can easily be repurposed for other tasks, but I present it here as I wrote it for the subversion auth_files purpose. (I named it “auth_find”.)

#!/bin/bash
outfile=/root/auth_find-$1

UPDATE (8/31/09): Added “why perl instead of sed” section in response to comment.

Print PDFs as Postscript to an lpr Queue

I wrote a simple script recently for a user who was having trouble getting certain PDFs to print properly from his linux box (Fedora 10). I first suggested that he try converting the pdfs to ps and printing the resulting file. That worked but he found the process a bit tedious. Here’s the script I wrote to take care of the tediousness. It relies on the standard (in Fedora, at least) pdf2ps package. It should be pretty self-explanatory.

Save the script to a location in your path (/usr/local/bin works) and you’re off.

Quickly Add a Userset to Many Sun Grid Engine Queues

This will be the first (of many??) posts to spill outside the topics one would think you’d find on a site with the name “Your Mac Guy”. You’ve been warned.

Back in January (of 2009) my primary work responsibilities shifted from Mac servers and desktops (and all that entailed) to Linux servers and desktops and the multitude of new things that entails (at least here where I work). One of the new tasks I’ve picked up is user administration of our Sun Grid Engine (SGE) 500-node cluster. New or existing users who want to submit jobs to the cluster need to be added to custom groups or, in SGE-speak, usersets. We create usersets for each lab, so if the user is part of a lab that doesn’t currently have access to submit jobs, I need to create a new userset and add that userset to each of 16 separate queues.

That last part, adding usersets to queues, is the most tedious part. So tedious, in fact, that it forced my hand into developing a scripted solution. I likely could have found an existing script to accomplish the task for me, but then I wouldn’t have had an excuse to brush up on my 3-years dormant perl skills.

Continue reading