So recently I was consulting on a project which required large scale data ingestion into a MySQL database, spread over multiple server instances running a data ingestion app in Python.
There were approximately 2 billion rows of smart meter data which had to be imported into MySQL, over the course of several days. (It was also simultaneously being ingested into Hbase, but let's ignore that for now).
The trouble was sometimes the data ingestion would halt for a variety of reasons. I needed to write a small Bash script which pinged the server instances regularly to ensure they were responding as expected, and the number of rows in the MySQL database was increasing.
Note: We also use Monit / Nagios and Ansible for other scenarios, but just assume these were not viable in this particular situation - thereby requiring the Bash script.
So the requirements for the script were:
Ping a number of server instances and verify a HTTP status code
Run a select query on a MySQL DB to ensure the number of rows is increasing
Send an email alert if either of these checks fail, up to a maximum of 3 alerts (I did not want to be flooded with emails).
Send a recovered email once our auto-recovery systems kick in and everything passes again.
Here is what I ended up using:
#!/bin/sh
# Step 1: Check server instances are ok.
# Step 2: Check that Mysql rows are increasing.
# Step 3: Send up to a maximum of 3 email alerts, then stop. If checks pass again, send recovered email.
ALERT_EMAIL=myemail@gmail.com
MAX_ALERTS=3
FAILED_SUBJECT="System FAILED!"
RECOVERED_SUBJECT="System Recovered"
IP_ADDRESSES=("10.1.1.1" "10.1.1.2" "10.1.1.3" "10.1.1.4" "10.1.1.5")
IP_PORT=8888
IP_ENDPOINT=my/api/endpoint
DB_HOST=mydb.us-west-2.rds.amazonaws.com
DB_USER=myawsuser
DB_PWD=myawspassword
DB_DATABASE=import_db
DB_TABLE=import_data
# Just some ANSI colors for prettifying the output
ANSI_END='\033[0m'
ANSI_RED='\033[91m'
ANSI_GREEN='\033[92m'
ANSI_YELLOW='\033[93m'
# First, initialize some variables.
ERRORS=()
SUCCESS=()
if [ -z "$MYSQL_COUNT" ]
then
export MYSQL_COUNT=0
fi
if [ -z "$NUM_ALERTS_SENT" ]
then
export NUM_ALERTS_SENT=0
fi
# Step 1: Loop through IP addresses and check http status code. Note this is a GET request, so a 404 is considered
# successful. A 500 response is a failure.
for ip in "${IP_ADDRESSES[@]}"
do
STATUS=$(curl -o /dev/null --silent --head --write-out '%{http_code}\n' http://$ip:${IP_PORT}/${IP_ENDPOINT})
if ((STATUS >= 500 ))
then
ERRORS+=("$ip has status $STATUS")
else
SUCCESS+=("$ip passes")
fi
done
# Step 2: Check MySQL rows are increasing.
COUNT=$(mysql -h $DB_HOST -u $DB_USER -p$DB_PWD $DB_DATABASE -e "SELECT COUNT(1) FROM $DB_TABLE" | head -2 | tail -1)
# Update the count as an env variable
OLD_COUNT=$MYSQL_COUNT
export MYSQL_COUNT=$COUNT
DIFFERENCE=`expr $COUNT - $OLD_COUNT`
COUNT_MSG="Now $COUNT records, previously $OLD_COUNT. Grew by $DIFFERENCE."
if (( $DIFFERENCE <= 10 ))
then
ERRORS+=("$COUNT_MSG")
else
SUCCESS+=("$COUNT_MSG")
fi
# Step 3: Send up to a maximum of 3 alerts if any errors since the last check.
if (( "${#ERRORS[@]}" > 0 ))
then
msg="Errors:\n"
for error in "${ERRORS[@]}"
do
msg+="${error}\n"
done
# Also add passing checks
if (( "${#SUCCESS[@]}" > 0 ))
then
msg+="\nPasses:\n"
for pass in "${SUCCESS[@]}"
do
msg+="${pass}\n"
done
fi
if (( $NUM_ALERTS_SENT < $MAX_ALERTS ))
then
echo -e $msg | mail -s "$FAILED_SUBJECT" "$ALERT_EMAIL"
output="${ANSI_RED}Sent Failed alert email $ANSI_END"
else
output="${ANSI_YELLOW}Already $NUM_ALERTS_SENT sent, email not sent. $ANSI_END"
fi
export NUM_ALERTS_SENT=`expr $NUM_ALERTS_SENT + 1`
else
# If check passes, reset the NUM_ALERTS_SENT counter and send email.
output="${ANSI_GREEN}Everything passes. MySQL Records grew $DIFFERENCE, currently $COUNT. $ANSI_END"
if (( $NUM_ALERTS_SENT > 0 ))
then
mail -s "$RECOVERED_SUBJECT" "$ALERT_EMAIL"
export NUM_ALERTS_SENT=0
fi
fi
echo -e $output
When using this script:
(obviously) you have to edit the variables at the start of the file
since we store environment variables (eg NUM_ALERTS_SENT) globally, you need to call this script within the context of your calling shell. You can use either source mycheck.sh or . mycheck.sh. Read more about how to export variables in bash.
I used this in a cronjob running every 5 minutes, namely:
*/5 * * * * . /home/myuser/system_check.sh
That's it, just a simple monitoring script...