Tuesday, June 2, 2009

Google Spreadsheet Server Monitoring

Monitor your websites using a Google Spreadsheet and some PHP

What do you mean the website is down?

So, your client calls you and tells you that the contact form on their website isn't working. One of their customers called them to tell them, and it looks like it's been down for awhile. Your client wonders why you're the last one to know - why do they pay for maintenance anyway?

This is the situation we faced too many times, years ago, and why we started monitoring our servers. We quickly went from being the last one to know when a website stopped working properly, to being the first. We also began collecting a lot of valuable data about the quality of our web hosting services. Further more, we did a kind of testing that really meant something real to us. Instead of just checking to see if a server was up we created "sensors", that we placed on client websites, and would do things like make a simple call to the website's actual database, emulating what the website did as closely as possible. This told us more about what the actual user experience was like, and about whether our servers were doing what they were supposed to, than just pinging a server to see if it was up.

A few weeks ago I was thinking about server monitor software and I realized that most of the mechanics behind the software is actually pretty simple; the more difficult part is the reporting side of things. Fortunately Google Spreadsheet has the ability to read data from external sources and wonderful graphs and gadgets (like speedometers) for translating the server monitoring data; and to display our information meaningfully and handily. I admit I'm a big fan of the Google docs webapps and I decided, mostly for fun, to try my hand at writing a server monitor with a little PHP and one Google spreadsheet.

Google Docs to the Rescue?

I decided to make a project out of building a server monitor that used a Google Spreadsheet for its front end. I was right in that the core was quite simple, but I admit I added a few unforeseen yet indispensable "enhancements" along the way (like data "archiving"). It worked (and was fun to do) so I decided to write a blog about how it works and also showing people how they may do something like it for themselves (and hopefully inspire people to create things like it - I may also do a series on other things that might be done like search engine ranking reports, etc). I provide all my code here, and instructions for creating the spreadsheet.

I started by defining what I wanted it to do, partly inspired by the kinds of things I know I can do with Google Docs. This was my list:
  1. It would send out an email/sms notification if something went down
  2. It would send out an email/sms if it went back up again
  3. I could view the status of all the sensors
  4. I could view detailed history for any single sensor
  5. It would email me a daily report
  6. I could share the 1live data with other people, and publish it back out to the Internet (again, live data)
Well, It worked out rather well. The only real drawback is it's not as immediate as I'd have liked (I want data updated by the second if I can get it). However, Google docs isn't going to poll your datasource every second (and for good reason!) so immediate updates aren't going to happen. However you can force an update if you need to and it refreshes often enough, I think, to stay quite useful.

Below I describe how you can make one for yourself. I'm not sure if I need to say this, but: I did this project and wrote this blog to amuse myself; and I provide this information here for your own benefit/amusement. It's up to you whether or not it's as dependable as is you want a server monitor to be and I'm not responsible if it doesn't work as well as you think it should (nor is it Google's).


What you'll need:

You'll need a Google account (of course) and a server (I used a shared Linux server at LunarPages) that isn't on a webserver you are monitoring. You'll have to write some server side script (I used PHP) and you'll need a database (I used MySQL). The sensors you create will be in whatever you use on your website currently (I show a couple of examples further on). You will also have to add two jobs to the cron so you'll need to make sure you have permission to create cron jobs (most of our hosting providers provide an interface for creating cron jobs in their control panel).

How it Works

The basic design has a group of small "things" (scripts and worksheets) working together to make it all work.

A cron job calls a script that tests all the sensors, and sends out notifications if necessary. The results are put in a database. The spreadsheet populates itself with the data from the database by calling some scripts, which pass back the data using CSV, and the spreadsheet uses that data to create all of our fancy charts and graphs. The settings for the application (e.g. the list of sensors) are also stored in the spreadsheet and the PHP scripts use that information to determine which sensors to call, etc. Finally a daily script sends out a summary email report and "compresses" old data to save space.

Step 1: Creating a Sensor

A sensor, in our terms, is a fairly simple thing. In fact it could be an existing web page if all you want to do is see if the site's web server is serving up pages. The server monitor simply tests to see if a sensor (which is a web page) returns an error code in it's header. With something like a database sensor we will "artificially" return an error code in the response header if there is a failure to send a query to the database.

Really you can create a sensor to test anything your want, even things on an application level (e.g. test to see if a variable has an expected value). Ideally a sensor would be able to tell that a website is acting entirely the way that it is supposed to, and short of regularly [24/7] parsing each page on the website for error codes, missing images, broken links and basically a rigorous testing régime, I think the sensor approach is about the best one can do (I would love to hear that I'm wrong - please comment below if you think I am).

I suggest, however, even if you are just going to test to see if a webserver is serving up pages that you make a special page, something simple, that has no linked resources, and is only called by your server monitor. Something like this:

and call it something like /sensors/websensor.html

For a database sensor I suggest making a very simple call to one of the database tables actually used by your website. Further more, if you use an include file for connecting to your database on your website, I suggest you use the same include file your site uses. More than once a sensor has told us of a problem when someone accidentally overwrote a connection string file with a file from a test/staging server (Human error is the biggest problem actually).

A typical database sensor, written in PHP, might look something like this:

Now we have a sensor that will tell you if the webserver is serving up pages AND is able to access your database. Please note that it doesn't matter what the server side code is here, I just used PHP in my example. We have sensors in multiple languages testing multiple aspects of our web sites; all that matters is if the sensor returns an error code in the response header or not.

Step 2: Creating our Spreadsheet


We're going to step away from our text editors/IDEs long enough to start our spreadsheet now. We'll start by creating the first 2 worksheet in the spreadsheet, which I call the "sensor list". I strongly suggest that you build your spreadsheet just the same way I did, in terms of labels and what rows and columns data is put in, and then play with it afterward when it's all working. It will be easiest to follow me if you start out more or less exactly as I describe.

This first worksheet is going to do two things: It's the place where we are going to list the sensors that the application will test (our Sensors Tester script is going to read this list to determine which sensors to call). Also, beside each sensor on the list (columns A & B), we're going to display the sensor's current status in terms of green, yellow and red "lights" (but we'll save that for Step 4). Set up your spreadsheet so that it looks like this:

(note that I have left the A and B columns blank for now - that's where we're going to display the sensor's status in Step 4)




The first column in our data (column C, the "Sensor ID" column), is what the server logs are keyed to. I used this method for two reasons: Using an integer means less storage space in the database (each log record can be associated with the appropriate sensor with as little as one byte of data) and if I keyed it to an existing field (e.g. sensor name) I wouldn't be able to edit that field without orphaning the sensor's previous data.

The second column is the name that will be used by the application when referring to the sensor (for instance, the email notifications will this name in their alerts). This will also be the name used on Graphs so try not to make and of these labels long if you can avoid it.

The third column is the email address that is used when sending out notifications for this server (use commas to list more than one address). Personally I like to have the monitor text message my cell phone when a server goes down. This can usually be done quite easily as many cell phone providers provide an email address that you can use to text your cell phone.

This is all we have to do to create new sensors. Create the sensor itself and install it on the corresponding server, and add the sensor to this list. Automatically the sensor will begin being scanned, we will be notified when there are issues, and we can see the sensor's status are read reports on it (as soon as there is enough data to do so).

IMPORTANT: Please note that Google Docs doesn't update the published document immediately after making changes; rather there is a lag time between when you change the document and when it republishes it. You can usually make the spreadsheet update the published version immediately if you turn off publishing and then turn it back on again.

We also need to create another worksheet that will contain some settings. We'll call this worksheet "Sensor and Report Settings", and in it you should create the following fields:




The yellow cell, B2, should contain the following formula:

=VLookup(B1,'Sensor List'!C3:D7, 2, FALSE)


Rather than create a separate set of worksheets for each sensor report, we're going to make one set of work sheets that will display the data from any sensor. We will control which sensor is being reported on by changing the number (sensorID) in the green box on this worksheet. The above formula gives you a little positive feedback by displaying the name of the sensor that you've just selected


Before our application can read these settings we must publish the spreadsheet. At the top select Share > Publish as web page... and you will get a dialog box where you can publish the document. Click Publish Now and then click More publishing options on the bottom of the dialog box. This is where you can create a URL for specific ranges of data. We're going to generate two URLs, one for the sensor list on the first worksheet, and the second to for the settings on the second worksheet. In the pop-up dialog that appears when you click More publishing options set the File Format to CSV, under What sheets select Sheet "Sensor List" only, and under What cells enter C3:F50 (I picked 50 at random, the number only has to be higher than the last sensor on your list, but equal to or less than the number of rows currently on the spreadsheet). Make a copy of this URL for yourself and generate one for the settings on the Sensor and Report Settings worksheet (cell range B3:B6).





Step 3: Testing our Sensors


Okay, now we have:
  1. a list of sensors to test,
  2. at least one sensor script installed on a web server that we will test.

Now, on the server that will be doing the testing (again, the server that you are using for your testing should not be on a server that you plan to test), we will add some PHP script and a MySQL database that will do the actual testing and send out notifications (if needed) and store the test results in our database. Clearly this is the nexus of our application.

Let's start by creating our database (I named my database sensors). Here is SQL for creating the tables that I am using:



This is a very straight forward set of tables. The sensor_log table is where we store the results of our tests, and the sensor_log_archive table is where we store our "compressed" data (our script archives data by taking the aggregate results for an entire day for each sensor and inserts it into the archive table, thus we reduce the amount of space by a factor of nearly 100 to 1).

Now we'll start creating our testing script. I call this script testallsensors.php. and I keep the script in a folder called sensors. The first thing you're going to need is a connection string to your database. Again I keep this in a separate file, like always, and call the file database_connection.php. My database connection looks something like this:

database_connection.php


(As you can see, I connect to the server and select my database in my connection script. I rarely have an application that accesses two databases on the same server so I find this most useful)

We're also going to read some data from our spreadsheet. I use a readCSV function that I found in the comments area of one of the PHP Manual pages, that I modified very slightly for this purpose (see http://www.php.net/fgetcsv). I keep the function in an include file as I use on multiple pages. It looks like this:

csv.php



Okay, now we'll start testallsensors.php. The first thing we'll do is import the settings from our spreadsheet. We'll start by ECHOing the settings to the script output so we can verify that the settings are being imported correctly. (make sure you replace the URL so that it's the URL for the settings that you determined in Step 2).

testallsensors.php (stage one)



If everything went well, you should see the settings you entered into your spreadsheet when you run this script. If it didn't work, check the URL and make sure that your spreadsheet is Published.

Lets go ahead an check out the rest of this script. I'll discuss it in detail below:

testallsensors.php (stage two)



Okay, there are a few things that need explaining here. I'll describe in plain language what's going on:

First the script reads the settings, as we reviewed in testallsensors.php (stage one).

The very outside loop is the retry loop. If any sensor fails (i.e. returns a status code other than 200) that wasn't failing previously, this loop continues until the sensor is either good, or the script has exhausted it's number of retries (the number of retries, and the length of time this script sleeps between each retry is determined in our settings). Each time a test is made, the lag time and the response code is INSERTed into the sensor_log table.

The loop nested inside the retry loop is a loop that goes through each sensor list (if there is more than one). I built the back end so it could serve more than one list of sensors (i.e. lists from multiple spreadsheets); I do this by looping through an array of sensor list URLs (the example has just one URL but you can add others). As it reads each sensor from the list, it calls it, gets the result and times the response time (what I refer to as lag).

I built it so one can have multiple sensor lists for organizational purposes. I thought there was pretty good chance that people would want to create separate spreadsheets for different customers etc. There's a lot to be said for having only one list though (one place to go to view the status of all of your sensors) and even if you have one sensor list you can, of course, create as many reports as you want for any sensorID on any number of spreadsheets. Just remember to never reuse a sensorID (as least not with the same backend and database). If you use the same sensorID on any sensor list twice, with the same back end/database, their data will get mixed together.

Within the inner loop we do our sensor test and if it comes back up after being previously down then an "Up" notification goes out. If a sensor returns $settings_failures failures after previously being up, then a "Down" notification also goes out. And whatever happens the result and the lag are inserting into the database with the current timestamp and the sensorID.

At the end of the loop we see if a sensor is failing (by which I mean, if it came back down but has not returned $settings_failures failures yet. If there is such a circumstance, the process sleeps for $settings_retry_minutes before it tries again.

Run a few tests with the script and make sure that it's properly reading the information from your spreadsheet, that it's conducting it's tests and INSERTing the test results in the database. Finally, simulate one of your sensors going down (you can do this by simply temporarily renaming a sensor so the script can't find it / gets a status code of 404) and make sure you get the notifications as the script detects the error. It's possible that your tests will time out on your webserver because of the sleep commands. This won't be an issue later as the cron will be running the script and it won't be running it using the webserver, but the timeouts can make testing difficult. If you get webserver timeouts, try shortening the $settings_retry_minutes and/or $settings_failures values temporarily for your tests or extend your server's timeout.

Important Note: If you edit this code, make sure you don't cause the script to go into an endless loop. Eventually, when you begin using the cron will thread a new instance of this script every 15 minutes and if the script isn't terminating correctly you could make your testing server very very unhappy.

I'm using LunarPages for hosting and they have a very easy to use option in their control panel for creating cron jobs, but every shared hosting service we use has a similar facility. Usually you have to call the script by passing it to the PHP interpreter (e.g. php /path/to/script/testallsensors.php). I find running the script every 15 minutes is perfect. More often than that isn't much more useful, and it may get your hosting services upset with you. If you have issues creating a cron job call your hosting service and they'll surely help you out.

When you are satisfied the script is working correctly, we'll move on displaying the current status for all the sensors in our spreadsheet.

Step 5: Current Server Status


To display the current server status, first we need to pull results from the database and populate a new worksheet with them. We'll do this by having the worksheet execute an importDATA function that calls a script that returns CSV values that the function will use to populate the worksheet. We'll start by creating the script that returns the CSV data:

sensorsummary.php




Simply put, this script looks at all of the sensors on a sensor list and returns their current status (again, as CSV data, which I used as it was easiest).

If you have more than one spreadsheet/sensor list using your back end you will either have to create one of these scripts for each of your spreadsheets or pass something to this script to that identifies which sensor list URL it should use.

Once you have tested the script and made sure it is indeed outputting the sensor status data correctly, you can go ahead and import the data into your spreadsheet. Go to your spreadsheet and create a worksheet called Sensor Status Data. The worksheet should look like this:




In the cell A2, insert the following function:

=importData("http://www.yourserver.com/sensors/sensorsummary.php?temp=" & INT(NOW()/TIME(0;10;0)))


The temp value appended onto the end of the function causes the filename to change every 10 minutes; this helps to keep the data fairly current. Please understand that Google can't poll your script every few seconds or anything like that; nor would it be a good idea anyway (that's a LOT of traffic). Normally the spreadsheet updates the pulled data at variable freqencies, presumably depending on how busy their servers are. There is a certain amount of lag time here, especially during heavy traffic periods, but you can force an update if you really need to make sure the spreadsheet is as current as possible, and you can call testallsensors.php if you need to re-pole the servers being tested (you can force the data to reload manually by editing the cell with the function and changing the cell contents - usually I just add a space to the end of the cell contents).

I also made a second version of this script that pulls in some more details and orders the data into columns rather than rows. It really is very close to the same data used here sensorsummary.php, but I'll include it here for convenience sake; some of the graphs and gadgets that are available in Google Spreadsheet require the data to be organized like this:

sensorsummary_detailed.php


Colouring Our Data

This is useful stuff; there's nothing like having problems stand out in red. Google Docs Spreadsheet has a very easy mechanism for colouring your cells based on rules. In my spreadsheet I took all of the values, including the values in the worksheets that contain the imported data, and made them so that they changed colour based on their value. This makes it really easy to spot trouble.

Colouring Lag Values

Ideally I think I'd like to base the colours on tolerances within what would be considered normal for a specific sensor, but in my example I used a gross scale that I apply to all the values. The values I used are:
  1. < 500 = good (green),
  2. 500 - 1000 = medium (yellow),
  3. > 1000 = bad (red)
Please note that some kinds of sensors are going to naturally take longer to return a result than others. For instance, a "web sensor" doesn't have to have any server side code at all, where as a "database sensor" needs to open a connection to the database server, run a query and inspect the results.

The diagram on the right shows the exact settings I used. Make sure you select the entire column (or at least from row 2 to the bottom of your worksheet) that you intend to create your rules for before you create your rules.

Colouring Errors

Errors are easier to colour because there are only two states: error (red) and no error (green). On the front worksheet (the sensor list), beside each sensor in the first column, I made a "light" by inserting the error value from the worksheet that I pull the sensor status into (I conveniently return the values in the same order that they appear on the list). I then colour the text so you can't see the value at all, just bright green or red by making the rule change the text colour so that it's the same as the background colour.

Okay! We've come a long way now. We have the sensors being tested, notifications being sent out, data being stored, and the results coloured with current status lights beside each of the sensors on our sensor list. Now we just need to create useful reports on individual sensors, daily sensor reports, and just to be thorough, we're going to archive/compress our old data.

Creating our Detailed History Report

History reports allow us to get a bigger picture of a sensor's status and allow us to see in finer detail what went wrong and when. When I first started this project I wasn't quite sure how I was going to create a history report (which requires multiple worksheets) for each sensor. It soon occurred to me that if I create an adaptable report, where I could change one setting and have the report populate itself with the data from any sensor, I would save the end user (in this case, me!) a lot of work (and if we ever need to send someone a copy of a sensor history report, we can always "hard wire" a copy of the report for that specific use).

The way that I chose to do this was by creating a cell (that I colour Green) on the Sensors and Report Settings worksheet where the user enters the sensor ID that they want to create a report for (if someone can think of a way to do this with some kind of select box or something I'd like to hear from you). For convenvience sake, I chose this page to display a list of the sensors and their hourly averages (the list also shows the current hour compared to the same hour's recent historical average) so that the user has the sensor list handy.

When the sensor ID is changed, two other worksheets are populated by calling two scripts that return 24 hour and 10 data historical data for the sensor indicated in the green box. Usually the data is populated within a few seconds of changing the [sensor ID] number in this cell, because changing the cell value alters the URLs that the data is read from and that typically triggers a [nearly] immediate update.

Then, finally, I have a fourth worksheet (titled the "Sensor Report") that shows two graphs based on the historical data for the sensor ID entered. The 24 hour graph shows actual values, where as the 10 day graph shows hourly averages for that period.

I pulled in the data by entering the following formulas in the A2 cells on the 24 Hour Error Trend Data and the 10 Day Error Trend Data worksheets:


for the 24 Hour Error Trend Data worksheet:


for the 10 Day Error Trend Data worksheet:



Graphing the Data

I used the Interactive Time Series graphs for this report. I found they were a good way of allowing the end user to examine any part of the data easily. Please note that these graphs will not be able to display any data until enough data collected first. You can [patiently] wait until there's enough data or you can generate some test data in the database if you are feeling particularly impatient.

Use the following for your Range in the graph's settings:


'24 Hour Error Trend Data'!A2:C130

and
'10 Day Error Trend Data'!A2:C250

(note that the end of these two ranges can't go beyond the end of the last row that you actually have in these two work sheets. I padded the worksheets with extra rows because, at least with the 24 hour data, you can't know exactly how many actual readings there will be (because bad sensor readings generate extra follow-up readings to verify the trouble wasn't just some temporary network fluctuation - plus you may have triggers several test readings).

And the following are the two PHP scripts that are called to pull in the data:

24hours.php


10days.php

Creating the Daily Report and Maintaining Our Database

We're going to take care of both of these tasks with one script that we'll have the cron call every 24 hours:

The Daily Report

The daily report is a report that we'll have emailed to us first thing in our day so that we can see at a glance how our servers have been doing in the last 24 hours. The report I made is quite simple in the it shows the sensor list, as well as each sensor's up time and lag time, for the most recent 24 hours. For each value I use a small function that calculates an appropriate RGB value (colour) for each of the values shown on the sensor list.

"Archiving" old data

Also, as it would be unnecessary (or even excessive) to keep every ping value in perpetuity, we take data that is old (I use >2 weeks) and then average the values for each day we are archiving and place the averaged lag/up time values in another table (sensor_log_archive); deleting the old sensor values as we go. This should make your database nearly 100 times smaller than it would be otherwise. Here is the PHP script that I created for the report. Remember that you will have to create a cronjob that will call the script once a day.

daily.php

Remember now, to add new sensors all you have to do is upload a sensor script to the server you are monitoring and add 1 line to the sensor list. The application will see the published list and start calling the sensor. I have several little extra features I've added to my spreadsheet (e.g. a call that tells the user the next unused sensorID) that I'd be happy to share with people (I just don't want to turn this article into a book - grin).

Happy Days are Here Again!

Oh, Happy Customers! Now you are the first to know when a website goes down. What's more, you've got charts and graphs to show your customers the great service they are getting and demonstrate the diligence you show on their behalf. And, you have historical data that you can compare and will give you are better idea of how well your servers are performing, as well as providing you with data that you can use when working with your providers to help diagnose issues, identify bottlenecks and improve service [where needed]. If you like this post, and you'd like to see some more projects along these lines, drop me a line (especially if you have ideas you'd like to contribute). If there is enough interest I will consider doing a series on using Google Spreadsheets as front ends to other types of reports and monitoring webapps.

  1. By "live data" I refer to data that is connected to an external data source, and contains the most current information available.
  2. A worksheet is like a page within your spreadsheet. You can select between, and create, worksheets using the tabs at the bottom of your spreadsheet.

9 comments :

  1. This is incredible, please let me know if you'd be willing to help implement this for an idea I have to improve a process. Possible compensation from grant money to implement this in the near future, I love the concept and think it would be an amazing technology to partner with this dataset.

    ReplyDelete
  2. Great! This is very useful info. Love Google Docs, especially Spreadsheets. Thank you

    ReplyDelete
  3. if I use i=ImportData(http:xxx.com/sensorsummary.php") it doesn't work.
    But if I cut and paste the data outputted by that file as a .csv. file and point to that - it works.
    Any suggestions?
    thanks

    ReplyDelete
  4. Anonymous: Sorry I didn't get back to you before this.

    First off, you're missing a quote in your importData command.

    But maybe that's just a comment typo. Possibly the problem has something to do with newline characters or something (maybe your text editor is converting them). Try comparing the two outputs character by character.

    ReplyDelete
  5. This comment has been removed by a blog administrator.

    ReplyDelete
  6. Think I'll give this a try with external data-sources in gDocs.
    One can extract an ID from an request with some less code:

    $req = array_merge($_GET, $_POST);
    if(isset($req['sensorID']) && is_numeric($req['sensorID'])) {
    $sensorID = (int)$req['sensorID']
    }
    else {
    die('No sensorID given.');
    }

    ReplyDelete
  7. This comment has been removed by a blog administrator.

    ReplyDelete
  8. Hi again,

    I’m worried my previous message didn’t arrive with the person responsible for this, could you connect me to the right person or send me the email of the person in charge of this?

    I’m the Assistant affiliate manager of monitive.com and I saw that you had already tagged us in some related content here in this post.

    https://monitive.com is an uptime monitoring tool in an easy-to-understand language with an intuitive dashboard. Most of our customers have small and medium sized online businesses (think ecommerce businesses, or solopreneurs selling courses/ ebooks/ services online)

    The problem is, a lot of people assume their hosting company is good and their site is up. Which almost never happens. And I know there aren’t many interesting affiliate Uptime Monitor Affiliate programs out there besides us.

    I want to make sure you join the program while we have higher commissions (20%). Once you join of course we’ll grandfather you into it and you’ll also earn every time one of your referrals renews (another 20%)

    If you ARE the right person, all you need to do is reply to this message and I’ll set everything up for you. Of course I can also share more info or answer any questions you may have
    Best,
    Sophia

    ReplyDelete