Tuesday, November 24, 2009

Things learned from Website moves

I recently moved several client sites off of servers that they had been on for the better part of a decade to new VPS boxes. They had MySql backends and used Java/Velocity CMS system (InfoGlue) hosted on Tomcat with an AJP connector to apache web servers on the front end. One site uses htdig for it's search implementation.

Issues (with resolutions) that I ran into...

1. GZIP'ing a running Tomcat instance
Before one of the sites was moved, a client tried to make a backup of the Tomcat instance that they were running under. They executed a:
gzip -r filename.gz ./tomcat/
which ended up recursively zipping all files under the tomcat directory. This crashed the running instance of Tomcat. (Perhaps trying to tar or gzip a running tomcat instance isn't a great idea. Copy it first and then tar/gzip that) Everything was quickly unzipped and they tried to restart Tomcat but it wouldn't start. After an hour of search, I was asked to help. I poked and prodded for about 90 minutes and discovered that all the webapps were missing the
file. I don't know how/why gzip and upzipping would make these files disappear, but they were there anymore. So I replaced them and tried restarting Tomcat and everything came up.

2. Problems with MysqlDump
I had a couple of problems backing up the MySql database. First MysqlDump would execute, but not put all of the tables into the backup file. I discovered that I had run out of disk space. Then it turned out that mysqldump doesn't care about relational integrity when it creates the backup script so when I tried to import it into the new database I ran into errors complaining about foreign keys. To resolve this, I had to add set foreign_key_checks=0; at the top of my backup script. This allowed me to import it successfully.

3. Mysql connection pool Exhausted

After getting the db imported and the web server moved over, the application seemed to be running fairly nicely on the new VM's. Then I left it alone for a couple of hours. When I came back it was throwing connection pool exhausted exceptions all over the place. I could resolve this by logging back into InfoGlue's cms instance, but that wasn't the right resolution for me. So I did some checking. Many people suggest to set the wait_timeout parameter in the my.cfg file for mysql higher or lower. I tried that and it didn't seem to work. What I ended up doing was adding a line in my database.xml file:
<param name="validation-query" value="select * from someTable" />
This helps keep the connection alive by pinging the db with the query every so often.

4. MySql Data Directory Change
While trying to resolve the previous issue I was changing/adding parameters into the mysql my.cfg file. I did a copy/paste which had a number of properties, one of which was the pointer to the data directory for mysql. This pointer was different than what I was using. As a result, when I restarted mysql, all of my databases, tables, users, and data was gone. I freaked. Then after thinking about things for a little bit and checking through files one more time I realized the mistake I had made and commented that line out. Restarting mysqld brought all my databases, users, and data back again. Whew.

5. Getting HtDig going again
The original installation of HtDig on the old server hadn't been indexed in over two years because cron had been broken. I didn't have access to the root password to fix this problem so the client was very interested in getting it running again on the new server. I had no previous experience with HtDig.
I copied over all the files I could find related to HtDig off of the old server and installed them on the new box. After a few tries, re-indexing worked, but I still had a problem with displaying the search results page in the web site. It turned out that I was missing a virtual directory configuration in my httpd.conf file for the directory where htsearch was running from (as a cgi script). The only reason I figured that out was by using lynx (linux CLI web browser). After fixing that, I got my newly indexed results displaying on the web site.

6. Using /etc/hosts helps

I've found that the /etc/hosts file is a big help in moving sites like this - whether you want to test the site while the live site is still running, or quickly configure pointers to a server dependency.

7. MySql sql queries are case sensitive
One of the website moves I did involved porting a MySql database from a Windows box to a CentOs box. The dump on the Windows box made all characters for tables lower case. I didn't pay too much attention to this at the time. They imported the same way.
When I started up my Tomcat server, it tried to start up, but threw errors related to jdbc connection pool and a ValidateObject. After a bit of googling I discovered that this is related to the validate query (the query I wrote about earlier that checks the connection every so often to make sure it hasn't gone stale). I tried running that query right on the mysql box and it would run because the table name was all lower case. So I changed all my table names to camel case and things worked.

8. Issues with unknownDefaultHost and serverNodeName in Tomcat
My inability to set the serverNodeName was resolved by adding a line in my hosts file to point the IP of my box to the Host variable found when I run the 'set' comment in CentOS. My issue with unknowDefaultHost was resolved by going into the server.xml file and editing the Engine element's defaultHost attribute - changing it from my old domain name to 'localhost'

9. MySql error 1153 - Got packet bigger than 'max_allowed_packet' bytes
Got this error a couple of times with different DB imports. To resolve I needed to go to the my.cnf file (found sometimes in /etc/) and set that property like this:
for the mysqld (mysql daemon). Then I had to restart the daemon (/etc/init.d/mysqld restart) and I could run my import (mysql -p -h localhost dbname < dbImportFile.sql)

Wednesday, November 11, 2009

Musing on Automated Deployments

I have been a key player in big automated deployment strategies in two significantly sized organizations now. One used ant with a java code base, the other used Visual Build with a VB code base. Both of these implementations deployed multiple dependent projects onto a variety of server types into development, testing, staging, and production environments. With the exception of prod, each environment had more than one instance of the environment running.

Some of my earlier musings on automated builds and deploys can be found by clicking here.

One would think that an automated deployment would be deterministic. In other words, given the logic in the deployment file(s), it should deploy the same every time. Surprisingly, we have found this not always true. Since many of these deploys are pushed to remote boxes, hiccups in the network end up throwing a proverbial wrench into things. And (again) surprisingly, these can occur more often than I would've though. We actually blamed these hiccups on increased solar activity for a while. I have no solutions to getting around these network hiccups, except to say that if you see your deployments failing consistenly at a certain time during the day, schedule them for another time. Our Sunday evening deploys lately have always been failing. Yet when we kick them off Monday morning (with no changes to deployment logic) everything this fine. We're thinking that there's possibly a weekly batch job or two that are running during our Sunday deploy that is bogging the network down....
I've also seen automated deploys act inconsistently (only with Windows) with registering dll's in the assembly. We can deploy and Gac things fine onto our bare metal, VM servers with no problem. Yet, when we deploy the same software onto a legacy hardware server where the dll's are already gac'ed (our deployment logic un-gac'ing and re-gac'ing the dll's) they fail to register it seems. I've wondered if perhaps the deployment moves through all the logic too fast? The command is definitely correct. Sometimes we'll even see the dll's in the assembly folder in the GUI, but the application can't. Manually registering them from the command line fixes the problem, but we shouldn't have to do that.

Something else to consider when implementing automated deploys - do you want to deploy everything from scratch (bare metal deploy) or do you want to deploy onto an already working image or server (overlay deploy)? I have tossed this question around a number of times. I think the correct answer for you depends on how you answer the following questions:
Are you thinking about deploying to a system that's already running in production? Are all the configurations that make that production system work documented? Are you confident that you could rebuild the production server and getting it running without any major problems? If you answer 'yes' to all of these questions, then you could probably save some time and implement overlaying automated builds. If you are starting work on a greenfield (new) application or you aren't confident that you could rebuild you production server, then you should probably consider bare metal deploys. Bare metal deploys done properly essentially become self documenting DRP's.

Cookies and Perl

I have a client who is using and old (outdated and unsupported) php/perl CMS as an intranet site. In implementing a DRP (Disaster Recovery Plan) they switched all the server references in the code and on the server from the server IP address to a DNS name. This was done using a search and replace :-(
This effectively broke the intranet site. They were able to get most of it back up and running with the exception of logins and how sessions were managed. For some reason, once this CMS saw a domain name instead of an IP address, it changed the path in the cookie to something like this:
from what worked before which was:
We searched through config files and didn't find anything conclusive to begin with. Then we found some code where cookies were being created and modified it so that after the cookie string was created we did some string manipulation on it like this:
$setcookie =~ s/http:\/\/mywebserver//m
This resets the setcookie var by searching through the setcookie var (it's a multiple line string so we needed to use /m) for the string 'http://mywebserver', replacing it with nothing.
It turned out that the CMS was creating more than one cookie, so we found this bit of code in two more files, added our hack there, and sessions worked again!
I have to confess, I was surprised that this little hack worked.