Wednesday, November 4, 2009

How I learned to Stop Worrying and Love the Bomb istat

I was recently tasked with organizing 137k+ small .jpg files into a folder structure based on year and quarter, why? users opening this directory with a ftp client complained that it took a "long time" to get a directory listing... apparently 15 - 20 minutes each time they opened the directory, honestly if a program didn't return anything in 15 minutes I would probably kill it and blame the server!

I really didn't think too much of the problem, in my head I though "i'll just use 'find' and 'stat'", which would have worked perfectly EXCEPT that I had to do this on a AIX 4.3 server and mounting the filesystem remotely was not an option.

A few problems with AIX 4.3 - no 'stat' command, in AIX 5.x you can install the coreutils rpm from the AIX toolbox to overcome this problem but you are up the creek without a paddle in 4.3! Also 'find' doesn't have all of the options you would usually have available on a newer version of linux - this was an issue in my case since I had to put the files into subdirectories (example: /basedirectory/2008/Q3) which meant that when searching for files to process in the basedirectory I did not want to descend into the yearly and quarterly subdirectories, easy with the -maxdepth option - which is not available in 4.3.

I ended up getting around the lack of -maxdepth in the find command by using the -prune option to remove subdirectories from processing, because the basedirectoy did not contain any directories except 200{8,9}/Q{1..4} this task was simplified even further by providing a common directory 'Q*'.

The lack of the 'stat' command had me banging my head against 'ls' for a day or so... The problem I have with 'ls' is in trying to get the year from 'ls -l', it works great for files older then 180 days but files less then 180 days are listed with the file modification timestamp in place of the year. I toyed with awk'ing the year/modifaction time column and checking if the value was an integer, which does work but you run into issues if your script is running within 180 days of the end of the year and examining files from the previous year, all the files will have timestamps which would cause you to examine the value of the current month vs. the month of the file being examined to determine the correct year - logic that I was uninterested in writing out.

Enter in my new most loved command in AIX: 'istat'
I was lucky enough to find a post mentioning 'istat' which "displays the i-node information for a particular file". 'istat' is simliar to the linux 'stat' command although it does not allow you modify the output using command line switches - nothing a little grep and awk won't fix! What 'istat' does do is handily format data about file creation, modification and access in an unambiguous matter - dates are always shown in the same format, unlike 'ls -l'. Without this tool I was writing a longer and longer script to deal with corner cases dealing with files modified 180 days ago and files modified around the last 3 months of the year - with 'istat' I was able to make my script much simpler and rely on the computer to hand me information in an consistent format.

I would be surprised if anyone has to solve this same problem but I will post the script anyways, as a warning this script is slow - 'istat' is not a tool for performance! Also working 'xargs' into the mix would make a more elegant solution in-place of 'find' and 'cat'.

In the following script I have disabled the actual move command - this will only print what would happen! uncomment the line beginning with 'mv' and it will move files.


#!/usr/bin/ksh
#
# organize files ending in $fileextension in $basedir
# by moving them into subdirectories $basedir/$year/$quarter
#


# backdate variable controls how many days old a file must be before
# it is considered for processing, 92 days is approx 3 months
# if you don't believe me ask google "3 months in days"
backdate=92

fileext=YOUR_FILE_EXTENTION
outfile=/tmp/jpg_organizer.out
basedir=YOUR_BASE_DIRECTORY

errors=0

# function to calulate which quarter a month lives in
calculate_quarter() {
case $month in
Jan|Feb|Mar)
quarter="Q1"
;;
Apr|May|Jun)
quarter="Q2"
;;
Jul|Aug|Sep)
quarter="Q3"
;;
Oct|Nov|Dec)
quarter="Q4"
;;
esac
}

# rudimentary error checking
error_check() {
let errors="$errors + $?"
if [[ $errors -gt 0 ]]; then
echo "encountered an error, exiting"
exit $?
fi
}

# find files older then $backdate and move them into $basedir/$year/$quarter directories
find $basedir -name Q\* -prune -o -name \*$fileext -mtime +$backdate -type f -print > $outfile
error_check
for i in `cat $outfile` ; do
filename=$i
fileattrib=`istat $i | grep "Last modified:"`
month=`echo $fileattrib | awk '{print $4}'`
year=`echo $fileattrib | awk '{print $7}'`
calculate_quarter
if [[ ! -d $basedir/$year/$quarter ]]; then
mkdir -p $basedir/$year/$quarter
error_check
fi
echo "moving:$filename to $basedir/$year/$quarter/"
#mv $filename $basedir/$year/$quarter/
error_check
done

rm $outfile

exit 0

No comments: