I recently moved several client sites off of servers that they had been on for the better part of a decade to new VPS boxes. They had MySql backends and used Java/Velocity CMS system (InfoGlue) hosted on Tomcat with an AJP connector to apache web servers on the front end. One site uses htdig for it's search implementation.
Issues (with resolutions) that I ran into...
1. GZIP running Tomcat instance
Before one of the sites was moved, a client tried to make a backup of the Tomcat instance that they were running under. They executed a:
gzip -r filename.gz ./tomcat/
which ended up recursively zipping all files under the tomcat directory. This crashed the running instance of Tomcat. (Perhaps trying to tar or gzip a running tomcat instance isn't a great idea. Copy it first and then tar/gzip that) Everything was quickly unzipped and they tried to restart Tomcat but it wouldn't start. After an hour of search, I was asked to help. I poked and prodded for about 90 minutes and discovered that all the webapps were missing the
/WEB-INF/web.xml
file. I don't know how/why gzip and upzipping would make these files disappear, but they were there anymore. So I replaced them and tried restarting Tomcat and everything came up.
2. Problems with MysqlDump
I had a couple of problems backing up the MySql database. First MysqlDump would execute, but not put all of the tables into the backup file. I discovered that I had run out of disk space. Then it turned out that mysqldump doesn't care about relational integrity when it creates the backup script so when I tried to import it into the new database I ran into errors complaining about foreign keys. To resolve this, I had to add set foreign_key_checks=0; at the top of my backup script. This allowed me to import it successfully.
3. Mysql connection pool Exhausted
After getting the db imported and the web server moved over, the application seemed to be running fairly nicely on the new VM's. Then I left it alone for a couple of hours. When I came back it was throwing connection pool exhausted exceptions all over the place. I could resolve this by logging back into InfoGlue's cms instance, but that wasn't the right resolution for me. So I did some checking. Many people suggest to set the wait_timeout parameter in the my.cfg file for mysql higher or lower. I tried that and it didn't seem to work. What I ended up doing was adding a line in my database.xml file:
<param name="validation-query" value="select * from someTable" />
This helps keep the connection alive by pinging the db with the query every so often.
4. MySql Data Directory Change
While trying to resolve the previous issue I was changing/adding parameters into the mysql my.cfg file. I did a copy/paste which had a number of properties, one of which was the pointer to the data directory for mysql. This pointer was different than what I was using. As a result, when I restarted mysql, all of my databases, tables, users, and data was gone. I freaked. Then after thinking about things for a little bit and checking through files one more time I realized the mistake I had made and commented that line out. Restarting mysqld brought all my databases, users, and data back again. Whew.
5. Getting HtDig going again
The original installation of HtDig on the old server hadn't been indexed in over two years because cron had been broken. I didn't have access to the root password to fix this problem so the client was very interested in getting it running again on the new server. I had no previous experience with HtDig.
I copied over all the files I could find related to HtDig off of the old server and installed them on the new box. After a few tries, re-indexing worked, but I still had a problem with displaying the search results page in the web site. It turned out that I was missing a virtual directory configuration in my httpd.conf file for the directory where htsearch was running from (as a cgi script). The only reason I figured that out was by using lynx (linux CLI web browser). After fixing that, I got my newly indexed results displaying on the web site.
6. Using /etc/hosts helps
I've found that the /etc/hosts file is a big help in moving sites like this - whether you want to test the site while the live site is still running, or quickly configure pointers to a server dependency.
7. MySql sql queries are case sensitive
One of the website moves I did involved porting a MySql database from a Windows box to a CentOs box. The dump on the Windows box made all characters for tables lower case. I didn't pay too much attention to this at the time. They imported the same way.
When I started up my Tomcat server, it tried to start up, but threw errors related to jdbc connection pool and a ValidateObject. After a bit of googling I discovered that this is related to the validate query (the query I wrote about earlier that checks the connection every so often to make sure it hasn't gone stale). I tried running that query right on the mysql box and it would run because the table name was all lower case. So I changed all my table names to camel case and things worked.
8. Issues with unknownDefaultHost and serverNodeName in Tomcat
My inability to set the serverNodeName was resolved by adding a line in my hosts file to point the IP of my box to the Host variable found when I run the 'set' comment in CentOS. My issue with unknowDefaultHost was resolved by going into the server.xml file and editing the Engine element's defaultHost attribute - changing it from my old domain name to 'localhost'
Tuesday, November 24, 2009
Wednesday, November 11, 2009
Musing on Automated Deployments
I have been a key player in big automated deployment strategies in two significantly sized organizations now. One used ant with a java code base, the other used Visual Build with a VB code base. Both of these implementations deployed multiple dependent projects onto a variety of server types into development, testing, staging, and production environments. With the exception of prod, each environment had more than one instance of the environment running.
Some of my earlier musings on automated builds and deploys can be found by clicking here.
One would think that an automated deployment would be deterministic. In other words, given the logic in the deployment file(s), it should deploy the same every time. Surprisingly, we have found this not always true. Since many of these deploys are pushed to remote boxes, hiccups in the network end up throwing a proverbial wrench into things. And (again) surprisingly, these can occur more often than I would've though. We actually blamed these hiccups on increased solar activity for a while. I have no solutions to getting around these network hiccups, except to say that if you see your deployments failing consistenly at a certain time during the day, schedule them for another time. Our Sunday evening deploys lately have always been failing. Yet when we kick them off Monday morning (with no changes to deployment logic) everything this fine. We're thinking that there's possibly a weekly batch job or two that are running during our Sunday deploy that is bogging the network down....
I've also seen automated deploys act inconsistently (only with Windows) with registering dll's in the assembly. We can deploy and Gac things fine onto our bare metal, VM servers with no problem. Yet, when we deploy the same software onto a legacy hardware server where the dll's are already gac'ed (our deployment logic un-gac'ing and re-gac'ing the dll's) they fail to register it seems. I've wondered if perhaps the deployment moves through all the logic too fast? The command is definitely correct. Sometimes we'll even see the dll's in the assembly folder in the GUI, but the application can't. Manually registering them from the command line fixes the problem, but we shouldn't have to do that.
Something else to consider when implementing automated deploys - do you want to deploy everything from scratch (bare metal deploy) or do you want to deploy onto an already working image or server (overlay deploy)? I have tossed this question around a number of times. I think the correct answer for you depends on how you answer the following questions:
Are you thinking about deploying to a system that's already running in production? Are all the configurations that make that production system work documented? Are you confident that you could rebuild the production server and getting it running without any major problems? If you answer 'yes' to all of these questions, then you could probably save some time and implement overlaying automated builds. If you are starting work on a greenfield (new) application or you aren't confident that you could rebuild you production server, then you should probably consider bare metal deploys. Bare metal deploys done properly essentially become self documenting DRP's.
Some of my earlier musings on automated builds and deploys can be found by clicking here.
One would think that an automated deployment would be deterministic. In other words, given the logic in the deployment file(s), it should deploy the same every time. Surprisingly, we have found this not always true. Since many of these deploys are pushed to remote boxes, hiccups in the network end up throwing a proverbial wrench into things. And (again) surprisingly, these can occur more often than I would've though. We actually blamed these hiccups on increased solar activity for a while. I have no solutions to getting around these network hiccups, except to say that if you see your deployments failing consistenly at a certain time during the day, schedule them for another time. Our Sunday evening deploys lately have always been failing. Yet when we kick them off Monday morning (with no changes to deployment logic) everything this fine. We're thinking that there's possibly a weekly batch job or two that are running during our Sunday deploy that is bogging the network down....
I've also seen automated deploys act inconsistently (only with Windows) with registering dll's in the assembly. We can deploy and Gac things fine onto our bare metal, VM servers with no problem. Yet, when we deploy the same software onto a legacy hardware server where the dll's are already gac'ed (our deployment logic un-gac'ing and re-gac'ing the dll's) they fail to register it seems. I've wondered if perhaps the deployment moves through all the logic too fast? The command is definitely correct. Sometimes we'll even see the dll's in the assembly folder in the GUI, but the application can't. Manually registering them from the command line fixes the problem, but we shouldn't have to do that.
Something else to consider when implementing automated deploys - do you want to deploy everything from scratch (bare metal deploy) or do you want to deploy onto an already working image or server (overlay deploy)? I have tossed this question around a number of times. I think the correct answer for you depends on how you answer the following questions:
Are you thinking about deploying to a system that's already running in production? Are all the configurations that make that production system work documented? Are you confident that you could rebuild the production server and getting it running without any major problems? If you answer 'yes' to all of these questions, then you could probably save some time and implement overlaying automated builds. If you are starting work on a greenfield (new) application or you aren't confident that you could rebuild you production server, then you should probably consider bare metal deploys. Bare metal deploys done properly essentially become self documenting DRP's.
Cookies and Perl
I have a client who is using and old (outdated and unsupported) php/perl CMS as an intranet site. In implementing a DRP (Disaster Recovery Plan) they switched all the server references in the code and on the server from the server IP address to a DNS name. This was done using a search and replace :-(
This effectively broke the intranet site. They were able to get most of it back up and running with the exception of logins and how sessions were managed. For some reason, once this CMS saw a domain name instead of an IP address, it changed the path in the cookie to something like this:
mywebserver/http://mywebserver/somesite
from what worked before which was:
mywebserver/somesite
We searched through config files and didn't find anything conclusive to begin with. Then we found some code where cookies were being created and modified it so that after the cookie string was created we did some string manipulation on it like this:
$setcookie =~ s/http:\/\/mywebserver//m
This resets the setcookie var by searching through the setcookie var (it's a multiple line string so we needed to use /m) for the string 'http://mywebserver', replacing it with nothing.
It turned out that the CMS was creating more than one cookie, so we found this bit of code in two more files, added our hack there, and sessions worked again!
I have to confess, I was surprised that this little hack worked.
This effectively broke the intranet site. They were able to get most of it back up and running with the exception of logins and how sessions were managed. For some reason, once this CMS saw a domain name instead of an IP address, it changed the path in the cookie to something like this:
mywebserver/http://mywebserver/somesite
from what worked before which was:
mywebserver/somesite
We searched through config files and didn't find anything conclusive to begin with. Then we found some code where cookies were being created and modified it so that after the cookie string was created we did some string manipulation on it like this:
$setcookie =~ s/http:\/\/mywebserver//m
This resets the setcookie var by searching through the setcookie var (it's a multiple line string so we needed to use /m) for the string 'http://mywebserver', replacing it with nothing.
It turned out that the CMS was creating more than one cookie, so we found this bit of code in two more files, added our hack there, and sessions worked again!
I have to confess, I was surprised that this little hack worked.
Tuesday, October 27, 2009
IIS default app pool proc terminated
Here's a great link to error codes for w3wp (event id 1009) and what they likely mean:
http://blogs.iis.net/brian-murphy-booth/archive/2007/03/22/how-to-troubleshoot-an-iis-event-id-1009-error.aspx
My current issue is error 0x0 which is documented in this link but fairly sparse. I checked the debugger flags with gflags and everything was unchecked. I'm still looking for a resolution to my issue.
http://blogs.iis.net/brian-murphy-booth/archive/2007/03/22/how-to-troubleshoot-an-iis-event-id-1009-error.aspx
My current issue is error 0x0 which is documented in this link but fairly sparse. I checked the debugger flags with gflags and everything was unchecked. I'm still looking for a resolution to my issue.
Wednesday, October 14, 2009
Tools for working with SVN & Visual Studio
We're moving to Subversion as our code repo. Tools that we are using are:
- VisualSVN Server on the server box for maintaining the svn server (managing users, creating/importing new repos, etc)
- TortoiseSVN on client boxes. This works in conjunction with windows explorer to tell you the status of files in your local copy of the repo. We've found that the icons don't change status immediately - you need to be patient with them.
- Collabnet AnkhSVN - Subversion plug-in for visual studio. Allows you to see the status and check files in and out inside of VS.
- Collabnet SVN command line client. We found we needed this (particularly svn.exe in the PATH environment variable) if we wanted to run the SVN steps in a Visual Build file.
- VisualSVN Server on the server box for maintaining the svn server (managing users, creating/importing new repos, etc)
- TortoiseSVN on client boxes. This works in conjunction with windows explorer to tell you the status of files in your local copy of the repo. We've found that the icons don't change status immediately - you need to be patient with them.
- Collabnet AnkhSVN - Subversion plug-in for visual studio. Allows you to see the status and check files in and out inside of VS.
- Collabnet SVN command line client. We found we needed this (particularly svn.exe in the PATH environment variable) if we wanted to run the SVN steps in a Visual Build file.
Wednesday, October 7, 2009
Monitor the right things
I started re-reading Release It! by Michael Nygard this morning on the commute into work. In his first chapter he talks about a (very small) issue that turns into a colossus and takes down an airline's check-in system. The system had a monitor configured and performing checks on it, but it turned out that it wasn't checking the right things (it was looking at the http port on transactional servers when it should have been looking at the RMI port).
It totally reminded me of something that happened about a month ago. We have a bunch of web applications that run on our production server. After fine tuning our monitoring to look at pages that the application has to apply logic to to server up (rather than a static home page) we found that our monitoring corresponded much closer to complaints from users.
Think twice about what you want to monitor and where to point it.
It totally reminded me of something that happened about a month ago. We have a bunch of web applications that run on our production server. After fine tuning our monitoring to look at pages that the application has to apply logic to to server up (rather than a static home page) we found that our monitoring corresponded much closer to complaints from users.
Think twice about what you want to monitor and where to point it.
Monday, October 5, 2009
MS Sql Server bug
We ran into an Sql Server bug today that was rather interesting. We were implementing synonyms across a number of views, tables, and stored procs in a couple of DB's. Everything was working fine until another team did a deploy and changed the index on a view that was referenced by one of our synonyms. It turns out that there is a documented bug which requires that any time the DDL has changed on a view that is referenced by a synonym, that synonym looses it's connection to the view. This includes just updating the index to the view.
Subscribe to:
Posts (Atom)