Tuesday, November 24, 2009

Things learned from Website moves

I recently moved several client sites off of servers that they had been on for the better part of a decade to new VPS boxes. They had MySql backends and used Java/Velocity CMS system (InfoGlue) hosted on Tomcat with an AJP connector to apache web servers on the front end. One site uses htdig for it's search implementation.

Issues (with resolutions) that I ran into...

1. GZIP'ing a running Tomcat instance
Before one of the sites was moved, a client tried to make a backup of the Tomcat instance that they were running under. They executed a:
gzip -r filename.gz ./tomcat/
which ended up recursively zipping all files under the tomcat directory. This crashed the running instance of Tomcat. (Perhaps trying to tar or gzip a running tomcat instance isn't a great idea. Copy it first and then tar/gzip that) Everything was quickly unzipped and they tried to restart Tomcat but it wouldn't start. After an hour of search, I was asked to help. I poked and prodded for about 90 minutes and discovered that all the webapps were missing the
/WEB-INF/web.xml
file. I don't know how/why gzip and upzipping would make these files disappear, but they were there anymore. So I replaced them and tried restarting Tomcat and everything came up.

2. Problems with MysqlDump
I had a couple of problems backing up the MySql database. First MysqlDump would execute, but not put all of the tables into the backup file. I discovered that I had run out of disk space. Then it turned out that mysqldump doesn't care about relational integrity when it creates the backup script so when I tried to import it into the new database I ran into errors complaining about foreign keys. To resolve this, I had to add set foreign_key_checks=0; at the top of my backup script. This allowed me to import it successfully.

3. Mysql connection pool Exhausted

After getting the db imported and the web server moved over, the application seemed to be running fairly nicely on the new VM's. Then I left it alone for a couple of hours. When I came back it was throwing connection pool exhausted exceptions all over the place. I could resolve this by logging back into InfoGlue's cms instance, but that wasn't the right resolution for me. So I did some checking. Many people suggest to set the wait_timeout parameter in the my.cfg file for mysql higher or lower. I tried that and it didn't seem to work. What I ended up doing was adding a line in my database.xml file:
<param name="validation-query" value="select * from someTable" />
This helps keep the connection alive by pinging the db with the query every so often.

4. MySql Data Directory Change
While trying to resolve the previous issue I was changing/adding parameters into the mysql my.cfg file. I did a copy/paste which had a number of properties, one of which was the pointer to the data directory for mysql. This pointer was different than what I was using. As a result, when I restarted mysql, all of my databases, tables, users, and data was gone. I freaked. Then after thinking about things for a little bit and checking through files one more time I realized the mistake I had made and commented that line out. Restarting mysqld brought all my databases, users, and data back again. Whew.

5. Getting HtDig going again
The original installation of HtDig on the old server hadn't been indexed in over two years because cron had been broken. I didn't have access to the root password to fix this problem so the client was very interested in getting it running again on the new server. I had no previous experience with HtDig.
I copied over all the files I could find related to HtDig off of the old server and installed them on the new box. After a few tries, re-indexing worked, but I still had a problem with displaying the search results page in the web site. It turned out that I was missing a virtual directory configuration in my httpd.conf file for the directory where htsearch was running from (as a cgi script). The only reason I figured that out was by using lynx (linux CLI web browser). After fixing that, I got my newly indexed results displaying on the web site.

6. Using /etc/hosts helps

I've found that the /etc/hosts file is a big help in moving sites like this - whether you want to test the site while the live site is still running, or quickly configure pointers to a server dependency.

7. MySql sql queries are case sensitive
One of the website moves I did involved porting a MySql database from a Windows box to a CentOs box. The dump on the Windows box made all characters for tables lower case. I didn't pay too much attention to this at the time. They imported the same way.
When I started up my Tomcat server, it tried to start up, but threw errors related to jdbc connection pool and a ValidateObject. After a bit of googling I discovered that this is related to the validate query (the query I wrote about earlier that checks the connection every so often to make sure it hasn't gone stale). I tried running that query right on the mysql box and it would run because the table name was all lower case. So I changed all my table names to camel case and things worked.

8. Issues with unknownDefaultHost and serverNodeName in Tomcat
My inability to set the serverNodeName was resolved by adding a line in my hosts file to point the IP of my box to the Host variable found when I run the 'set' comment in CentOS. My issue with unknowDefaultHost was resolved by going into the server.xml file and editing the Engine element's defaultHost attribute - changing it from my old domain name to 'localhost'

9. MySql error 1153 - Got packet bigger than 'max_allowed_packet' bytes
Got this error a couple of times with different DB imports. To resolve I needed to go to the my.cnf file (found sometimes in /etc/) and set that property like this:
max_allowed_packet=16M
for the mysqld (mysql daemon). Then I had to restart the daemon (/etc/init.d/mysqld restart) and I could run my import (mysql -p -h localhost dbname < dbImportFile.sql)

Wednesday, November 11, 2009

Musing on Automated Deployments

I have been a key player in big automated deployment strategies in two significantly sized organizations now. One used ant with a java code base, the other used Visual Build with a VB code base. Both of these implementations deployed multiple dependent projects onto a variety of server types into development, testing, staging, and production environments. With the exception of prod, each environment had more than one instance of the environment running.

Some of my earlier musings on automated builds and deploys can be found by clicking here.

One would think that an automated deployment would be deterministic. In other words, given the logic in the deployment file(s), it should deploy the same every time. Surprisingly, we have found this not always true. Since many of these deploys are pushed to remote boxes, hiccups in the network end up throwing a proverbial wrench into things. And (again) surprisingly, these can occur more often than I would've though. We actually blamed these hiccups on increased solar activity for a while. I have no solutions to getting around these network hiccups, except to say that if you see your deployments failing consistenly at a certain time during the day, schedule them for another time. Our Sunday evening deploys lately have always been failing. Yet when we kick them off Monday morning (with no changes to deployment logic) everything this fine. We're thinking that there's possibly a weekly batch job or two that are running during our Sunday deploy that is bogging the network down....
I've also seen automated deploys act inconsistently (only with Windows) with registering dll's in the assembly. We can deploy and Gac things fine onto our bare metal, VM servers with no problem. Yet, when we deploy the same software onto a legacy hardware server where the dll's are already gac'ed (our deployment logic un-gac'ing and re-gac'ing the dll's) they fail to register it seems. I've wondered if perhaps the deployment moves through all the logic too fast? The command is definitely correct. Sometimes we'll even see the dll's in the assembly folder in the GUI, but the application can't. Manually registering them from the command line fixes the problem, but we shouldn't have to do that.

Something else to consider when implementing automated deploys - do you want to deploy everything from scratch (bare metal deploy) or do you want to deploy onto an already working image or server (overlay deploy)? I have tossed this question around a number of times. I think the correct answer for you depends on how you answer the following questions:
Are you thinking about deploying to a system that's already running in production? Are all the configurations that make that production system work documented? Are you confident that you could rebuild the production server and getting it running without any major problems? If you answer 'yes' to all of these questions, then you could probably save some time and implement overlaying automated builds. If you are starting work on a greenfield (new) application or you aren't confident that you could rebuild you production server, then you should probably consider bare metal deploys. Bare metal deploys done properly essentially become self documenting DRP's.

Cookies and Perl

I have a client who is using and old (outdated and unsupported) php/perl CMS as an intranet site. In implementing a DRP (Disaster Recovery Plan) they switched all the server references in the code and on the server from the server IP address to a DNS name. This was done using a search and replace :-(
This effectively broke the intranet site. They were able to get most of it back up and running with the exception of logins and how sessions were managed. For some reason, once this CMS saw a domain name instead of an IP address, it changed the path in the cookie to something like this:
mywebserver/http://mywebserver/somesite
from what worked before which was:
mywebserver/somesite
We searched through config files and didn't find anything conclusive to begin with. Then we found some code where cookies were being created and modified it so that after the cookie string was created we did some string manipulation on it like this:
$setcookie =~ s/http:\/\/mywebserver//m
This resets the setcookie var by searching through the setcookie var (it's a multiple line string so we needed to use /m) for the string 'http://mywebserver', replacing it with nothing.
It turned out that the CMS was creating more than one cookie, so we found this bit of code in two more files, added our hack there, and sessions worked again!
I have to confess, I was surprised that this little hack worked.

Tuesday, October 27, 2009

IIS default app pool proc terminated

Here's a great link to error codes for w3wp (event id 1009) and what they likely mean:
http://blogs.iis.net/brian-murphy-booth/archive/2007/03/22/how-to-troubleshoot-an-iis-event-id-1009-error.aspx
My current issue is error 0x0 which is documented in this link but fairly sparse. I checked the debugger flags with gflags and everything was unchecked. I'm still looking for a resolution to my issue.

***

I did find a resolution to this issue. I went through all the configs and our DRP (Disaster Recovery Procedure) document to see if a could figure it out. No dice there. I downloaded and hooked up procmon to see if it would tell me which registry key or file was causing the issue. No dice there either. It turned out that since our app had a dependency on SoftArtisan's FileUpEe, the fileUpEe dll's were registered in the wrong order! I had registered them 'com' folder first, 'dotnet' folder second (per the DRP). They needed to be registered 'dotnet' first and 'com' second. That made my IIS appPool stop cratering (it was definitely the issue).

Wednesday, October 14, 2009

Tools for working with SVN & Visual Studio

We're moving to Subversion as our code repo. Tools that we are using are:
- VisualSVN Server on the server box for maintaining the svn server (managing users, creating/importing new repos, etc)
- TortoiseSVN on client boxes. This works in conjunction with windows explorer to tell you the status of files in your local copy of the repo. We've found that the icons don't change status immediately - you need to be patient with them.
- Collabnet AnkhSVN - Subversion plug-in for visual studio. Allows you to see the status and check files in and out inside of VS.
- Collabnet SVN command line client. We found we needed this (particularly svn.exe in the PATH environment variable) if we wanted to run the SVN steps in a Visual Build file.

Wednesday, October 7, 2009

Monitor the right things

I started re-reading Release It! by Michael Nygard this morning on the commute into work. In his first chapter he talks about a (very small) issue that turns into a colossus and takes down an airline's check-in system. The system had a monitor configured and performing checks on it, but it turned out that it wasn't checking the right things (it was looking at the http port on transactional servers when it should have been looking at the RMI port).
It totally reminded me of something that happened about a month ago. We have a bunch of web applications that run on our production server. After fine tuning our monitoring to look at pages that the application has to apply logic to to server up (rather than a static home page) we found that our monitoring corresponded much closer to complaints from users.
Think twice about what you want to monitor and where to point it.

Monday, October 5, 2009

MS Sql Server bug

We ran into an Sql Server bug today that was rather interesting. We were implementing synonyms across a number of views, tables, and stored procs in a couple of DB's. Everything was working fine until another team did a deploy and changed the index on a view that was referenced by one of our synonyms. It turns out that there is a documented bug which requires that any time the DDL has changed on a view that is referenced by a synonym, that synonym looses it's connection to the view. This includes just updating the index to the view.

Thursday, September 3, 2009

Handy Utilities for Windows

I don't want to forget about these handy software utilities for doing troubleshooting and debugging on the windows platform.

SpaceMonger is a tool that graphically displays how data is utilized in you hard drive.

SysInternals software...

TcpView essentially provides a graphical view of a netstat on a windows box with a few extra handy features (counting the number of sockets in time_wait, for example).

Procmon allows you to get information about what is happening with a specific process id.

Wednesday, September 2, 2009

windows account security

The domain (active directory) policy will override any local security setup for a particular account on a server. If the domain security policy says that an account will lock after 5 failed attempts and the local security policy says the account will lock after 3 attempts, the account will lock out after 5 attempts.
Here's the interesting thing:
The account lockout counter is reset every 24 hours or with every passed login attempt. So if you have two services using the same account, one with a correct password and one with an incorrect password, you can likely run indefinitely before the account will lock out. One service will never work and you should get a lot of errors in the security event log.

Wednesday, August 26, 2009

More IIS automated deploys

Been working more with IIS automated deploys and found a couple of good posts in David Wang's Blog along with some other good blog posts.

This one is related to how to manipulate IIS list data. He provides a handy vbs script to do this that worked nicely for me in adding a new application mapping to IIS.

This one talks about IIS App pool crashes - fatal communication errors between the Application Pool and IIS.

This one talks about app pool recycling and IIS availability. There's some great conversations past David's article further down the page.

This one (not David Wang) shows how to change COM+ MSDTC settings programmatically. This script worked great for me as well.

This one (Ian Morrish) shows how to manipulate COM+ Security launch and activation permissions using DComPerm from the windows SDK.

With all of these scripts, I found I had to carefully read the directions and make sure I had the parameters that were being passed in correct. I was often passing in erroneous parameters on my first couple of tries.

Friday, August 21, 2009

Windows bug with adding users to a group

I was using a script to try and add users to a local group today in an automated build script. I was sent a script that worked, but when I plugged in my values, it didn't. I always got a script syntax error, like one of my parameters was not in the right place, or had spaces or something. Well the group name did have spaces, but that wasn't my issue.

It turns out that there is a documented Windows bug (http://support.microsoft.com/kb/324639) that limits the username to be added to only 20 characters. Any more and it won't run.

Friday, August 14, 2009

Log parsing/network security tools

I discovered some new tools today that are useful for network security. QRadar from Q1 labs (http://www.q1labs.com/) is a really slick log parsing tool for organizations that are looking to implement a distributed log management offering to collect, archive, and analyze network and security event logs. It then parses this information into graphs and data that you can tune to alert you when things go awry. You can configure it to look at firewall logs, web server access logs, event logs, etc.
Splunk (http://www.splunk.com/) seems like it might be a competitor. At a glance, I'd say that QRadar has a lot more features and might be a lot more expensive.

XML manipulation in Visual Build (and vbscript)

We've been working on basically turning our Visual Build deployment files into DRP (Disaster Recovery Plan) scripts. Essentially, everything required to get our applications running on a 'blank' server box is documented as a config, permission, or push in our Visual Build files. This has required updating then machine.config and web.config files that reside under windows/microsoft.net/framework/v1.1.4322 or v2.0.50727/config at times. Sometimes we're adding attributes to existing element, sometimes we're adding entirely new elements.

It seems to me that adding entirely new elements in Visual Build (using the Write XML Action) has a bug in it. It will add new attributes to existing elements no problem. But it won't add new elements. So we hacked around that and used the 'Run Script' Action with vbscript to add elements to our files. To do this in vbscript it looks something like this:

Dim objXMLDoc, objNewNode, objText, strXPath, objParentNode, objChildNode, objCaseExist

Set objXMLDoc = CreateObject("Microsoft.XMLDOM")
objXMLDoc.async = False

' load the XML file - make sure to include the fully qualified path
fSuccess = objXMLDoc.load("\\%TARGET_SERVER%\c$\WINDOWS\Microsoft.NET\framework\v2.0.50727\config\web.config")
If Not fSuccess Then
wscript.echo ("error loading XML file")
Else
wscript.echo ("XML file loaded")
End If

' set to proper node
Set nodeList = objXMLDoc.getElementsByTagName("configuration/system.web/compilation/buildProviders")
If nodeList.length > 0 Then

Set objParentNode = nodeList(0)

' add the SGAS SecurityDisabled element/node
Set objNewNode = objXMLDoc.createElement("add")
objNewNode.setAttribute "extension", ".uplx"
objNewNode.setAttribute "type", "System.Web.Compilation.PageBuildProvider"
objParentNode.appendChild(objNewNode)
objParentNode.appendChild objXMLDoc.createTextNode (vbCrLf)

set objNewNode = Nothing

else

wscript.echo ("node list empty")

end if

' save the XML file
objXMLDoc.save("\\%TARGET_SERVER%\c$\WINDOWS\Microsoft.NET\framework\v2.0.50727\config\web.config")

Thursday, August 6, 2009

Error 8510 and MSDTC...

We had the whole development environment down for a day. After spending a good bit of time debugging, we discovered that we were getting a significant number of errors in our event logs saying:
Inner: The transaction has aborted.
Inner: Failure while attempting to promote transaction.
Inner: Fatal error 8510 occurred at Aug 5 2009 1:09PM. Note the error and time, and contact your system administrator.
A severe error occurred on the current command. The results, if any, should be discarded.

They were thrown against the running of several different stored procs and because of the architecture of our system, all of our COM+ components were rendered useless.

I found a post that talks about 'fatal error 8510' that was interesting:
http://blogs.msdn.com/asiatech/default.aspx - go to the bottom of the page (it's a ways down). However that turned out to NOT be the resolution to our problem

The resolution to our problem resided in the fact that SQL Server could not initiate a distributed transaction. We had thought that there was a problem with the MSDTC cluster, but that turned out to not be an issue as other sql servers in the cluster could initiate distributed transactions. Restarting services on the sql server that could not initiate distributed transactions resolved the problem

Friday, July 31, 2009

linux reboot and chkconfig

Setting up a new linux server pair for clients, I've did some things I haven't done in a while, so I thought I'd document them here so I'd remember in the future.

To set up a service in linux, use chkconfig. Running the chkconfig --list command shows all the potential daemons that are sitting in /etc/init.d/ and whether they are set to on|off|etc for each linux run level. The main run levels to set are 2,3,4,5. You can set the to on for http like this:
chkconfig --level 2345 httpd on

It's also a good idea to go to the script for that service in /etc/init.d (httpd in this case) and modify the chkconfig line at the top of the file to add those changed params.

To reboot the server you can run reboot. It actually calls shutdown -r or shutdown -h depending on the linux implementation.

Monday, July 20, 2009

Automated Windows Deploy scripts and tricks

Permissions (domain is optional):
cacls drive:\folder /T /E /G "domain\username:C" - :C modify permissions
cacls drive:\folder /T /E /G "domain\username:R" - :R read permissions
cacls drive:\folder /T /E /G "username:F" - :F full permissions

net localgroup "IIS_WPG" "domain\username" /add - adds user to IIS_WPG localgroup

IIS configs:
Set the friendly name for an IIS website:

cscript drive:\path-to\adsutil.vbs SET W3SVC/WEB_SITE_Number/ROOT/AppFriendlyName IIS_WEB_SITE_Name
Set the port of a website:
cscript drive:\path-to\adsutil.vbs SET W3SVC/WEB_SITE_Number/ServerBindings ":8080:"
Install an IIS Application pool:
cscript drive:\path-to\adsutil.vbs CREATE w3svc/AppPools/AppPoolName MyApplicationPool
Set identity type on AppPool
cscript drive:\path-to\adsutil.vbs SET W3SVC/AppPools/WebSiteName/AppPoolIdentityType 3 (2 is the predefined network service account)
Set appPool username
cscript drive:\path-to\adsutil.vbs SET W3SVC/AppPools/WebSiteName/WAMUserName username
Set appPool password
cscript drive:\path-to\adsutil.vbs SET W3SVC/AppPools/WebSiteName/WAMUserPass password
Delete a web site:
cscript c:\windows\system32\iisweb.vbs /delete "WebSite name"\
Create a web site:
cscript c:\windows\system32\iisweb.vbs /create "IIS File Path" "WebSite name" /dontstart
Set server bindings
cscript drive:\path-to\adsutil.vbs SET W3SVC/WEB_SITE_Number/ServerBindings :80:myWebSite :80:myWebSite.com
Set virtual directory permissions
cscript drive:\path-to\adsutil.vbs SET W3SVC/WEB_SITE_Number/ROOT/vDir_Name/AccessFlags 513
Set enable anonymous user for vDir
cscript drive:\path-to\adsutil.vbs SET W3SVC/WEB_SITE_Number/ROOT/vDir_Name/AuthAnonymous TRUE
Set vDir anonymous username
cscript drive:\path-to\adsutil.vbs SET W3SVC/WEB_SITE_Number/ROOT/vDir_Name/AnonymousUserName username
Set vDir anonymous password
cscript drive:\path-to\adsutil.vbs SET W3SVC/WEB_SITE_Number/ROOT/vDir_Name/AnonymousUserPass password
Set vDir to use integrated windows authentication
cscript drive:\path-to\adsutil.vbs SET W3SVC/WEB_SITE_Number/ROOT/vDir_Name/authNTLM TRUE
Set vDir ASP.NET version
C:\WINDOWS\Microsoft.NET\Framework\version_number\aspnet_regiis -s W3SVC/WEB_SITE_Number/root/vDir_Name

Tuesday, June 9, 2009

Troubleshooting

Things to always watch out for when troubleshooting:
- locked out users - users who have been locked out of the system because the wrong password has been entered 3 consecutive times
- out of disk space (proper monitoring would fix this problem)
- GUI display that is contrary to what is actually happening on the system. Gac'ed dll's that really aren't gac'ed. In other words, assume nothing.
- inconsistent application of permissions across environments.
- duplicate libraries/dll's on machines.
- closed ports or ports/services not listening/running. Also port conflicts. More than one service wanting to use the same port.
- services, IO connections, db connections not getting closed/terminated properly.
- logic between functions or classes that creates infinite loops

Friday, June 5, 2009

IIS binary formatting error

Oh man, did I get a slap in the face yesterday. I was helping/coordinating a patch into production for our team and was asked if we needed to 'drainstop' (read turn off load balancing) to deploy to our app servers. I said no, I had never done that before here. Later I realized that I probably haven't done a prod patch to the app servers here period.
Anyway, we pushed our change onto running servers and everything appeared to deploy fine until we actually tried to use them. It seemed that even though we pushed new components and gac'ed them, that was ignored because the server still had a running request on the old one. Once that request was gone, our app servers were down. When our app servers go down and our web servers try to access them in that state, we get this wonderfully intuitive error called a 'binary formatter error' with a bunch of junk that doesn't make any sense.
So, in the end, two lessons learned:
1. Always drainstop servers when you're deploying onto a live system
2. Binary Formatter errors can mean that you communication with a remote servers is down.

Sunday, May 24, 2009

Tomcat Alias

As my previous post says, I went live with a new website on Friday. Well, within minutes of going live I realized we didn't have an alias set up for www.myNewSite.com and got my client's administrator to add it. Once it was made live, the alias still didn't work. I had it configured in my tomcat server.xml file, but no dice. What was going on?

When I installed tomcat, I got rid of all the default sites it comes with (and I thought I had gotten rid of all their configs as well. It turned out I hadn't. In the server.xml file the host was configured correctly, but the 'engine' configuration was still pointing to 'localhost' which wasn't in my host configuration anymore. So I pointed it to my new 'root' configuration for the server - so now it looks something like this:

Engine name="Catalina" defaultHost="www.myNewSite.com"

and the host lookes like this:

Host name="www.myNewSite.com" appBase="webapps"
unpackWARs="true" autoDeploy="true"
xmlValidation="false" xmlNamespaceAware="false"
alias myNewSite.com alias

MySQL service restart, InfoGlue

I have MySQL running behind Infoglue for a new web app one of my clients has. We went into production on Friday with that web site. Saturday evening the hosting providing had a scheduled update and so all the boxes got rebooted. Unfortunately, the MySQL service which was configured to autorestart didn't come back up. In the application event logs (I've got it running on a Windows server) it said:

...MySQL Server 5.1\bin\mysqld: unknown option '--enable-pstack'

I searched the MySQL forums and Google for this error and nothing. I reinstalled the service a couple of times.. and nothing - same error. I was sweating by this time.

Then I started looking at the my.ini file for Mysql in the installation folder (sibling of the bin folder). I saw these lines at the very bottom:

#Print a symbolic stack trace on failure.
enable-pstack


Ah ha. So here's where that 'unknown option' was coming from. I had no idea what was making a failure happen on startup to make it print out this 'stack trace' so I just commented out that line. ...And low and behold, MySQL started up happily and things worked.

Whew.

Friday, May 15, 2009

Custom header in Cruise Control

We have multiple CruiseControl (CCNET) instances running on our build boxes and we've been wondering how to differentiate which one is which so people can know at a glance which one they are pointing at.
This turns out to be relatively easy. In a typical CCNET installation, there's a \cruiseinstalldir\webdashboard\templates directory. Inside it you'll find a file called TopMenu.vm. Open the file up and after the last '#end' put some literal spaces in and type the title of your CCNET site with some plain html formatting to make it obvious (make the font bigger and bold, for example)

...save it and refresh your dashboard page.

Windows and Automated builds

We've run into a couple of interesting situations with Windows with our automated build scripts that I wanted to document. Both of them have to do with inconsistencies with what the Windows GUI is saying about a configuration versus what the operating system is actually using for a configuration.
Both of these situations were discovered while we were trying to do an expansive upgrade of our code to use the new .NET framework. In the first instance we were manually trying to install/uninstall dll's from the gac on a particular box. We manually dragged and dropped a particular dll into the c:\windows\assembly folder and the GUI would say everything was successful with registering the dll in the gac. However, check it from the command line, or try to run something in the application against the newly gac'ed dll and the OS didn't seem to see it. The only way we could get the OS to successfully see the change was to gac the dll from the command line.
In the other instance (which is sort of similar in nature), our automated deploy scripts are creating a Virtual Directory underneath an IIS web site on a remote server. After the Vdir is created, we had a 'step' that automatically updated the ASP.NET version on the Virtual Directory fo 2.0.50727 (C:\WINDOWS\Microsoft.NET\Framework\v2.0.50727\aspnet_regiis -s W3SVC/%WEB_SITE_NUMBER%/Root/%WebSiteName%). This script ran successfully, and the ASP.NET tab in the VDir on the web site displayed the change. However, IIS didn't actually see the change and continued to run the web site on the old 1 .1.4322 version - even though the GUI in IIS said it was. We knew this because the C:\WINDOWS\Microsoft.NET\Framework\v2.0.50727\Temporary ASP.NET Files folder didn't have the new corresponding virtual directory folder (this is where IIS keeps the compiled pages - it's like the 'working' folder in Tomcat). It seems the IIS metabase was updated, but the registry was not. I have a hunch that this might be because during that particular deploy, we left IIS running while we did the deploy. Perhaps shutting it down would make the update work properly through to the registry.

Tuesday, May 12, 2009

Don't assume anything

We are doing a big upgrade of our components in our upcoming release. The security structure of our organization is fairly complex with users, roles, user groups assigned to roles, boxes in the domain, boxes outside of the domain, and a whole other layer of users and roles at the database level. Managing security for this release has been a hassle. It seems in every environment there are differences with who is assigned to what group, which users and which groups have 'these' permissions on 'this' folder, etc.
Thinking we had most of the issues documented, automated and swept under the rug, we deployed into our staging environment. Surprise, surprise, there were permissions issues there. The really interesting thing was these turned out to be new issues. Well, we spend the good part of the day trying to see if we missed a permission on a folder somewhere. In the end, it turned out that our architecture group (who is responsible for the shared components code and the overall enterprise architecture) had some code for applying special permissions that was hard-coded to be applied against ONLY our staging and production environments.
Shock.
I guess it goes to show, don't assume anything - no matter who is responsible for code or where the error seems to be, or how stable you think your configuration are.... We also discovered that that particular componenet had a different security setting for the COM+ in the Staging env compared to the Testing env.

Thursday, May 7, 2009

InfoGlue installation

I've been doing a lot of work with InfoGlue lately. I downloaded the new 2.9.6.X release and went to install it and got a 49.0 version error. It seems that the new version of InfoGlue doesn't run as good on Java 1.5 as it's documented to. It worked when I finally upgraded to 1.6. I also had issues getting it to work with MySql because my install of MySql was done with a production box in mind. It seems that the MySql configuration engine clamps down on security when you specify this, and as a result, I had to allow temporary access for root to all schematas, and then have to specify permissions for each user by schemata. While I'm very impressed and happy about this, it made for a little more grief than I was expecting in the installation process.
Also in the InfoGlue installation process, you do want to change the first username/password it asks you for to be the specific one that InfoGlue will use to call the DB with. Not realizing that, I had to go through all my web contexts and change the references to those properties.
Also, deleting a repo in InfoGlue using it's 'force' feature is not a great idea. Better to rename it instead. Sigh. I had to go through and insert data back into a couple of tables to get the repo working properly again.

Tuesday, April 7, 2009

IIS debugging

Been working lately on getting a new IIS web application into production. We've been continuing to run into some more issues with anonymous/integrated windows authentication as the build is progressing through different environments. However, I found this cool tool yesterday that has some neat utilities for working/debugging IIS. It's called the Internet Information Services (IIS) Resource Kit Tools and you can download it here. Some of the cool utilities in that package are:
  • WFetch - allows you to fully customize an HTTP request and send it to a Web server so that you can see the raw HTTP request and response data. Sort of like Ethereal for IIS.
  • SelfSSL - allow you to create and sign you own ssl certificates for testing with ssl in dev environments
  • A Metabase Explorer - allows you to view all the configuration properties IIS and it's web sites, app pools, virtual directories, and more
  • And an Apache to IIS migration tool which would apparently allow me to migrate an apache web server (and it's configs) to an IIS6.0 server.

Friday, March 27, 2009

IIS and Nagios

Working with IIS on a particular app, we had a situation where the behavior of the web server seemed to be inconsistent with how we had configured it. We had configured it to allow Authenticated access based on Integrated Windows authentication only, no anonymous access. When we browsed to the page we wanted to view on the web server itself, the page came up fine. But when we tried to browse to the page from another box, we were required to authenticate. This was not expected since it should have been using the Integrated Windows authentication to log in transparently.
After comparing it with another server that worked and seeing that most of the GUI configuration looked identical, I started googling and found this article that answered the problem for the most part. The only difference we found was even the order of the different NTAuthentication providers matters - the working server had "NTLM,Negotiate"; the server that didn't work was set up as "Negotiate,NTLM". Changing the broken server to look like the other server fixed our problem.

Our infrastructure teams has been doing some changes and moved the configurations for the email servers a bit. Unfortunately, this wasn't transparent to my Nagios/GroundworkOpensource installation (since I was lazy and hadn't reconfigured the email 'from' address for host). What ended up happening was all the Nagios emails ended up getting sent outside the internal network (because the domain name was still 'localhost.localdomain') and getting caught by the public antispamm appliance we have. I ended up switching notifications to use 'service-notify-by-sendemail' instead of 'service-notify-by-email' and then overriding the host in the command to point to our local email server. And that worked. Similar to this post on the groundwork forum.

Wednesday, March 18, 2009

Automatically deploying IIS web apps

We've been working on automated builds and deploys at work. All of our web apps run on IIS and configuring 'good' automated deploys for these applications has been challenging. But I think we're seeing the light at the end of the tunnel now.

Microsoft provides a good number of support scripts (in vbs) to configure IIS, App Pools, and Virtual Directories from the command line. Some of them come with the IIS installation, some of them we had to download - if I remember correctly.

There are two 'sets' of these CLI vbs scripts that I'm aware of that work with IIS. One is found in the ../inetpub/adminscript folder. There's a bunch of vbs scripts in there. The other set is found in c:/windows/system32/iis*

Here's some examples of what you can do with these scripts:
  • create a virtual directory: cscript c:\windows\system32\iisvdir.vbs /create webSiteName virDirName virDirPath /s serverName
  • set perms on a virtual directory: c:\inetpub\adminscripts\adsutil.vbs SET w3svc/webSiteNumber/Root/virDirName/AccessFlags permNum (like 513)
  • set ASP.NET version for the virtual Directory c:\windows\Microsoft.net\framework\v2.0.50727\aspnet_regiis -s w3svc/webSiteNumber/root/virDirName
  • and tons more...
You can view all the website numbers in the IIS manager. They show up in a column.

Monday, March 16, 2009

IT Security

Over the past few years I've seen sensitive information exposed in some very interesting places on enterprise networks and servers. Sometimes this leftover information can be super helpful if you're trying to debug problems or get an idea of what happened on the box in the past. In other cases, it just plain bad. Here's some of what I've seen:
  • Shared drives mapped all over the enterprise. Shared drives mapped on production boxes with access to files that contain sensitive info like passwords for production users.
  • Kickstart configuration files with username and passwords for domain users in clear text forgotten on servers
  • Passwords and sensitive information exposed in .bash_history files. Bash_History files are a treasure trove of information. They'll show you all kinds of things - where the db server is located, what the connection string is, where http servers are installed, how to shut them down and start them up....etc.
  • *.udl files - Microsoft specific. They store connection information in clear text for db servers. Don't leave them lying around and exposed.
  • Installations for UPS (Universal Power Supply) systems left with their default configured administrator username and password. I happened to find a login page for a UPS console one day and logged in on the first try using the first password I could think of. The dashboard I subsequently found myself on gave me the power to shut down the entire enterprise.
Here's some simple ways to make your network/enterprise more secure:
  • Don't allow a plethora of undocumented mapped drives.
  • Do searches for text like 'password' on any boxes, drives, etc that you might be concerned about. If you get results, take steps to either encrypt or delete those files or references.
  • Change default installation passwords

Sunday, March 15, 2009

CMS experiences

In the past year or so I have gotten a bit of experience with different CMS systems. I was given two clients(www.mross.com, www.auma.ca) that run on InfoGlue - a java/velocity based cms system. I am also messing around with Joomla, and I have one client (www.brooks.ca) that runs on it. I have yet to do anything with Droopal, but from what I've heard, it sounds more like InfoGlue than Joomla - that is, it's more geared towards an enterprise/portal centric CMS than what Joomla appears to be.

InfoGlue is very configurable and supports internationalization. This makes is somewhat cumbersome to configure to start out with. I've found that the important files to know about when doing a configuration on the fly are:
- WEB-INF/classes/cms.properties
- WEB-INF/classes/hibernate.properties
- conf/localhost/live.properties, etc

Some issues that I've noticed with the InfoGlue instances that I work on are:
- on logging into the CMS, I have to refresh the page 3 times before I acutally get the GUI. Only one of my instances does this, so I think it's a configuration thing.
- sometimes Tomcat seems to get it's knickers in a knot and needs to be rebooted in order for content changes to be saved. This happens periodically.
- deleting content (cleaning things up) can be a huge pain - especially if there are a lot of references to the content in other parts of the site. You end up having to delete the references first before you can delete the content.

So far I've been pretty impressed with Joomla. I was able to figure out how things were put together fairly quickly. I bought this book which helped some - it has some good chapters on SEO and working with Joomla templates. I was able to manage a complex upgrade to the site I maintain within a week of getting bootstrapped on the Joomla CMS. This impressed me.

Tuesday, March 10, 2009

Visual Build

I've been working with Visual Build in a .NET environment off and on for almost the past year. When we decided to go forward with Visual Build, I thought it would be a more painful process than it turned out to be. Our technology stack includes ClearCase, Subversion, Ms SQL, PSExec, VMWare, VB.NET, IIS, CruiseControl, Nant, Nunit, and a bunch of 3rd party tools and servers related to document management.
Visual Build provides a bunch of example build files that are quite helpful in getting all kinds of different functionality going. We automatically compile, unittest, deploy (to remote boxes), and run verification tests using Visual Build. There a numerous other interesting smaller tasks that we've got Visual Build to manage for us like setting perms on remote boxes, stopping and starting servers, services and com+ objects, performing baselines, checkins/outs, and updates of clearcase streams and views, etc.
Some of the gotchas we've discovered in our work with Visual Build:
- managing windows (child build files that get spawned in a complex build process) and logs is not so trivial. If they aren't managed correctly, your build will stop without much notification as to why. We found that piping the output from the build to a separate file worked for us with logging. With windows, we found that matching the build context, waiting for completion, running the GUI app in silent mode, and not closing the GUI app on failure were what made things tick.
- psexec needs to be on the path on the box you're running Visual Build on.
- CruiseControl integration was relatively easy - we just used the command line with options:
<tasks>
<executable>C:\PathToVisualBuildInstallation\VisBuildCmd.exe
</executable>
<baseDirectory>D:\pathToBuildFile
</baseDirectory>
<buildArgs>/b "XXX.bld" -other args to pass to buildfile go here
</buildArgs>
<buildTimeoutSeconds>1000
</buildTimeoutSeconds>
</tasks>

Friday, March 6, 2009

Replatforming an app in 21 hours....

I got a cold call a couple of weeks ago. A prominent institution in Calgary had a java application that ran on a Sun OS with an iPlanet web server that was key to their business that didn't work with a key upgrade from a 3rd party vendor. It needed to be upgraded an running in production 2 weeks. Could I help?
I agreed to help and showed up that afternoon to see what could be done. The current code in production ran fine. However, there was no guarantee it would run fine with the upgrade. Unfortunately, the code base was a mess. Multiple versions of the same file were all over the production server with .bak, .ver2, etc extension. Everything had been done on the fly in the past. The logs couldn't provide us with any useful information, and they had no test environment.
After seeing what could be done on the Sun box (which had been running this application non-stop for more than 10 years!), I decided to try and port the application to Tomcat - in this container I knew what I was doing and I could debug the problems more effectively. Within 14 hours we had the application running on Tomcat with some minor bugs.
Over the weekend, new production and test virtual machines were requisitioned and when I came back on my next visit, we got the last of the bugs out of the way and had everything working in the new test environment.
Some lessons learned....
- Don't be afraid to replatform in Java. I would've had a much harder time trying to do this with Microsoft technology. The clients were VERY happy to get off that old server. No one was around to support it anymore...etc. That was 10 year old java code that I didn't have to do a thing to when I moved it onto j2sdk1.4.2_XX.
- It's almost scary how many different ways you can configure things in Tomcat. We had outstanding issue that was 'bugging' me after 14 hours of work. The client required the application to be available from two separate url paths without using an apache web server to do url rewriting. I tried various different configurations in the app's web.xml, the global web.xml, the server.xml and found that I could get the url's to work, but then I had problems with sessions getting lost/corrupted. In the end, we wrote another servlet (sort of like this, but different) that we configured to 'catch' the url we needed it to and redirect the request to the controller servlet that was already mapped to the other url. This worked beautifully.
- We discovered that if your dealing with pages that have a lot of scriptlet code that does response.sendRedirect(someUrl), these method calls will sometimes throw IllegalState exceptions. The (quick) way to get around throwing the exception is to add a return; statement right after it. Or you could put that kind of logic into a servlet if you have time. I didn't (have time).
- Having good logging is critical to debugging an app in an emergency. Seeing my System.out.println() outputs in the tomcat console was like drinking hot chocolate after a day of skiing. So RIGHT! Of course I commented out those lines before we put the code into production.
- Using technologies that don't required registry keys makes configuration and multiple environment installations very easy. I found that I could write two short paragraphs of directions (documentation?) and that was enough for my client to install java and Tomcat into their new test environment by himself. (Pretty much copy - paste, and add the JAVA_HOME system variable and add the j2sdk...\bin onto the Path system variable). They were very happy about that too.
- Having hard coded path references to properties files in java code is so nasty! Every time we changed the path, we had to recompile all our classes. I know that's not the right way to do things, but the client just wanted to get things working and worry about refactoring later.

Thursday, March 5, 2009

Groundwork/Nagios and Rockstars

I've implemented Nagios at a few different places. In my current position, I've implemented Nagios with Groundwork, and I'm really happy with it. Groundwork provides a 'community' (free) version of their tool on a VM imaged instance that you can just plunk into a VM client. We use the npre_nt agents to monitor specific services and metrics on a number of windows servers. I reconfigured some of the vbscripts in the remote agents to call different methods in the WMI api which I've detailed in this forum. I've also customized the check_mssql to call specific tables looking for issues in production data (querying for held up business processes and corrupted documents). In order to do this, I had to download freetds and install it on the VM, overriding the default installation target with /usr/local/groundwork/. Then the freetds.conf file gets put into /usr/local/groundwork/etc and you can configure the servers you want to call in there. I put two configs for each server I wanted to call in that .conf file as I wanted to be able to test from the command line. I found that groundwork ends up calling using the IP of the box - so each server conf has a proper domain name config, and a IP name config.
One other thing about configuring the Groundwork/Nagios installation - I had to configure the CentOS to point to our exchange server so the email notifications would work.

I've read a couple of of good tech books lately (I now own them both). Release It! by Michael Nygard, and Secrets of Rockstar Programmers by Ed Burns.

Wednesday, March 4, 2009

Schema Crawler

I got back to using schema crawler this week, trying to make sure we can explain the inconsistencies we see in the metadata between our different environments. I found this cool little script to do port scanning so I could discover which ports our DB's were using.

HOST=127.0.0.1;
for((port=1;port<=65535;++port)); do echo -en "$port \n"; if echo -en "open $HOST $port\nlogout\quit" | telnet 2>/dev/null | grep 'Connected to' > /dev/null;
then echo -en "\n\nport $port/tcp is open\n\n";
fi;
done