Skip to content. | Skip to navigation

Personal tools

Navigation

You are here: Home / weblog

Dominic Cronin's weblog

Showing blog entries tagged as: Infrastructure

Websphere and Xalan fun for SDL Web 8

Of the small number of people who follow this blog, an unreasonably large proportion will be familiar with SDL Web 8, and the promise it holds for freedom from classpath hell. The new service-based architecture is a huge step forward, but we aren't out of the woods yet. I'm currently busy with an upgrade project where we're taking an interesting mix of web applications from SDL Tridion 2011 to SDL Web 8.

Web 8's much-vaunted REST-ful microservice approach was initially communicated as pretty much a drop-in replacement for the existing Content Delivery APIs. In practice, it turned out that the focus on backwards compatibility wasn't as clear as it might have been, and if you use JSPPage when invoking dynamic component presentations from a JSP page, you are out of luck, because this class doesn't have an implementation in the REST-ful facade. This is annoying, as I can't see any reason why it couldn't or shouldn't be made to work. The missing support is a "known issue", however I'm told there's not much appetite for fixing it. After all, goes the argument, we can use the in-process API, which does have JSPPage, so that's a workaround isn't it? Except that then we don't get the benefit of the dependency-free service architecture, and that, as I shall explain, is no small thing. 

With the in-process API, of course, the idea is that all the necessary jars to do Tridion content delivery things have to be on your classpath. The general idea is simple enough, but in practice, we have to deal with the fact that there are several class-loaders arranged in a hierarchy, and each of these has their own classpath, although it's not always called that. At the top you've got the class loaders that belong to java itself. This means the boot classloader that loads the nuts and bolts of java itself, and also the one that works from the java CLASSPATH variable plus one for java extensions. And then lower down you have Websphere's own extensions classloader, some magic called the OSGi class loader gateway, and then the application's class loader and one for the module. Yes I know - it sounds pretty insane, but I didn't make it up. Have a look over here if you don't believe me! 

So what kind of trouble did we get into, and how did we get out of it? Well we had all the Web 8 jars in a directory, and we'd deployed our application and set things up so the jars would be on the classpath. Keeping the jars outside the application has been the customer's preferred way of doing things for some years, and it's worked well, so our initial expectation was that things should "just work", but once we started testing, we started to see exceptions like: 

[java.lang.ClassCastException: org.apache.xml.dtm.ref.DTMManagerDefault incompatible 
with org.apache.xml.dtm.DTMManager

This is a bit of a weird one,  because if you look up the sources, org.apache.xml.dtm.ref.DTMManagerDefault and org.apache.xml.dtm.DTMManager are actually defined in the same jar. How could they possibly be incompatible? Well as it turns out, it's possible for java to load two incompatible versions of the same jar simultaneously, from different jars. 

If you look it up, this problem is about the Xerxes library, and it's associated serialiser jar. I think this comes about because Websphere uses Xerxes itself, (as do several other application servers) and because Tridion's own Content Delivery installation has these as third-party jars, any difference in the required versions will be problematic. Of course, it could happen with other libraries, but in practice, it's Xerces. (OK - so we also had similar issues with another application that uses JSTL.)

But let's start with "parent first" and "parent last". When working with the hierarchy of classloaders, the default method of loading a class is parent first. What this means is that when the module classloader needs to load a class, it first checks to see if its parent (the application classloader) can load the class. The application classloader then asks its parent, and so on all the way up. I've visualised this in the left hand diagram with the arrows going down, because in practice, what this means is that classes are made available from the top down. If the java classloaders have the class, that's what will be used throughout. 

Parent last is the opposite arrangement. If the module classloader can find the class in its classpath, it loads it itself and doesn't trouble the parents with it. This effectively means that the lower the classloader, the higher priority its jars have, and hence the direction of the arrows in my right-hand diagram.  

Classloader Parent last                  Classloader Parent first

So to get rid of the ClassCastException, we flipped the configuration from Parent First to Parent Last. This works. It's what SDL recommend that you should do if you encounter these exceptions in your environment. But...

Well it turned out that our problems weren't over. Instead of a ClassCastException, now we had a ClassNotFoundException. I can't post it here, because all this happened a while ago and I'm writing this up later, but as I said earlier, it's all about Xerces. The problem in this case is that if a class is loaded by a classloader, it can't call a class that's loaded by a classloader that's lower in the hierarchy. Parent last leaves you rather more sensitive to this kind of problem, because you're deliberately loading classes lower down that might also be available further up the tree, and also might be expected by classes further up. In any case, even though the class is available, it can't be loaded, and you get a ClassNotFoundException. 

In our case, we were able to solve the problem by moving the xalan and serialiser jars to Websphere's ws.ext directory, where they would be loaded by the Websphere Extensions classloader. 

All this is a bit like the "dll hell" that we used to waste days on on Windows systems before the .NET framework came along. Sooner or later, the answer always ended up being that you needed to know far more than you wanted to about the nuts and bolts of how it all worked, and the various possible locations that Windows would look for a dll. "Classloader hell" is not much different. I've been able to avoid it for a long time just by breezily saying - "Oh yes - that's a Java problem". These days, I seem to be more engaged with Java than I used to be, so having to figure out classloader hell is probably fair game. 

It's been a while since I linked to Joel Spolsky's classic: "The law of leaky abstractions", but this seems like a reasonable moment to do so. Joel's piece probably makes for far more entertaining reading than either this, or this, (both of which are pretty good) or any of the other detailed coverage that Google will turn up for you on the complexities of classloaders. My own description here has been deliberately only a sketch to give the big picture. I've skated over many details, missed others out entirely, and probably got a few things wrong (in which case, comments are welcome). My point is that in any given environment, there's a good chance you'll have to solve this kind of thing. It's frustrating, and it costs time that you probably feel like you don't have, but once you engage with the detail, you will find a solution. I'm not saying the solution outlined here is the best one. There may be other ways to get it working, and some of them may well be better. 

To finish on a rather more upbeat note, we should all be happy to be moving slowly but surely towards the new architecture. Having to deal with these issues is actually a very welcome reminder of why we're investing in new architectures in the first place. The difficulty lies in the fact that you can't necessarily have a rebuild of all your legacy systems in the scope of an upgrade project, so we live with some things that aren't perfect, but we are moving in the right direction. Next time will be better! 

Revisiting validateXml

Some time back in 2009 I blogged about validating Tridion's content delivery configuration files. It was a good idea then, and it's remained a good idea ever since. These days, we're dealing with SDL Web 8 and with the new micro-services architecture, you've got a lot of configuration files to get right. (On my fairly unambitious test system, running staging and live together, I just counted almost 80 configuration files.) Fortunately these seem to be reliably supported with schema files that are simply in each of the microservice folders that you copy during an installation. 

Back when I first wrote the ValidateXmlFile powershell function, I'd left it rather unfinished. It was good enough to let me do some validations and detect problems, but it had a significant flaw, in that if a schema file was not present at the location indicated by the noNamespaceSchemaLocation attribute, it would simply not bother with validation. Considering that we're using an XmlReader to do the validation, this is a pretty reasonable design decision - after all the main purpose is to read in the XML, and validation is perhaps a bit of a side-effect. Fair enough, but it's a nasty hole in our defences, so now that I'm revisiting the technique, I've beefed up the script a bit so that it checks that the location is present and that there's a file in the location. 

I've also made sure that the script does some pushd/popd to make sure that everything is nicely lined up when the location is relative to the file (which it generally is).

Here's the updated script

function ValidateXmlFile {
    param ([string]$xmlFile       = $(read-host "Please specify the path to the Xml file"))
	$xmlFile = resolve-path $xmlFile
    "==============================================================="
    "Validating $xmlFile using the schemas locations specified in it"
    "==============================================================="
    # The validating reader silently fails to catch any problems if the schema locations aren't set up properly
    # So attempt to get to the right place....
    pushd (Split-Path $xmlFile)

    try {
        $ns = @{xsi='http://www.w3.org/2001/XMLSchema-instance'}
	# of course, if it's not well formed, it will barf here. Then we've also found a problem
        # use * in the XPath because not all files begin with Configuration any more. We'll still 
        # assume the location is on the root element 
        $locationAttr = Select-Xml -Path $xmlFile -Namespace $ns -XPath */@xsi:noNamespaceSchemaLocation
        if ($locationAttr -eq $null) {throw "Can't find schema location attribute. This ain't gonna work"}

        $schemaLocation = resolve-path $locationAttr.Path
        if ($schemaLocation -eq $null) 
        {
            throw "Can't find schema at location specified in Xml file. Bailing" 
        }

        $settings = new-object System.Xml.XmlReaderSettings
        $settings.ValidationType = [System.Xml.ValidationType]::Schema
        $settings.ValidationFlags = $settings.ValidationFlags `
                -bor [System.Xml.Schema.XmlSchemaValidationFlags]::ProcessSchemaLocation
        $handler = [System.Xml.Schema.ValidationEventHandler] {
            $args = $_ # entering new block so copy $_
            switch ($args.Severity) {
                Error {
                    # Exception is an XmlSchemaException
                    Write-Host "ERROR: line $($args.Exception.LineNumber)" -nonewline
                    Write-Host " position $($args.Exception.LinePosition)"
                    Write-Host $args.Message
                    break
                }
                Warning {
                    # So far, everything that has caused the handler to fire, has caused an Error...
                    # So this /might/ be unreachable
                    Write-Host "Warning:: " + $args.Message
                    break
                }
            }
        }
        $settings.add_ValidationEventHandler($handler)
        $reader = [System.Xml.XmlReader]::Create($xmlfile, $settings)
        while($reader.Read()){}
        $reader.Close()

    }
    catch {
        throw
    }
    finally {
        popd         
    }
}

Of course, what you really want is to be able to verify all your configurations in one go. Once the script is in your powershell $profile, you can put together some fairly simple command-line-fu to take care of that. I have all my microservices in one directory, which I guess is a pretty common pattern, so all I had to do was CD over there and execute the following: 

gci -r -file -include *conf.xml | % {ValidateXmlFile $_}

By running this, I've also picked a couple of things that might be false positives. That aside, this is a real time saver if you're trying to solve issues. There's nothing like being able to eliminate a lot of the stupid typos from consideration all in one go. 

System refresh: new architecture for www.dominic.cronin.nl

It's taken a while, and the odd skinned knuckle and a bit of cursing, but I can finally announce that this site is running on...erm.. the other server. Tada! Ta-ta-ta-diddly.... daaahhhh!!!!

Um yeah - I get it. it's not so exciting is it really? The blog's still here, and it's got more or less the same content. It doesn't look any different. Maybe it's a tiny smidgin faster, but even that's more likely to do with the fact that we switched over to an ISP that actually makes use of the glass that runs in to our meter cupboard. 

But I'm excited. Just a bit, anyway. Partly because it's taken me months. It needn't have, but it's the usual question of squeezing it into the cracks between all the other things that need to get done in life. That and the fact that I'm an utter cheapskate and I don't want to pay for anything. There's also plenty not to be excited about. As I said, the functionality is exactly as it was. The benefits I get from it are mostly about the ability to do things better going forward. 

So what have I done? Well it all started an incredibly long time ago when I started tinkering with docker. I figured that the whole containerisation technology thing had such a lot of potential that I ought at least to run docker on my own server. After all, over the years, I'd always struggled with Plone needing to have a different version of Python than the one available in the current Gentoo ebuilds. I'd attempted a couple of things, including I think an early version of what became LXC, but then along came virtualenv, which made the whole thing moot. 

Yeah, well - until I wanted to play with docker for itself. At this point, I just thought I'd install it on my server, and get going, but I immediately discovered, that the old box I was running was 32-bit, and docker is just far too hip to run on anything so old-fashioned. So I needed a new server, and once I'd realised that, that's when the whole thing started. If I was going to have a new server, why didn't I just containerise everything? It's at this point that someone inevitably chips in with a suggestion that if I weren't such a dinosaur, I'd run it on the cloud, wouldn't I? Well yes - sure! But I told you - I'm a cheapskate, and apart from that, I don't want anyone's soul-less reliability messing with my carefully constructed one-nine availability commitment. 

Actually I like cloud tech, but frankly, when you look at the micro-budget that supports this site, I'd have spent all my time searching out a super-cheap host, and even then I'd have begrudged it. So my compromise with myself was that I'd build it all very cloudy, and then the world's various public clouds would be my disaster recovery plan. And so it is. If this server dies, I can get it all up in the cloud with a fairly meagre effort. Still not going to two-nines though.

So I went down to my local high street where there's a shop run by these Indian guys. They always have a good choice of "hardly used" ex-business computers. I think I shelled out a couple of hundred Euros, and then I had something with an i5 and enough memory, and a couple of stupidly big disks to make a raid. Anyway - more than enough for a web server - which is just as well, because pretty soon it ends up just being "the server", and it'll get used for all sorts of other things. All the more reason to containerise everything. 

I got the thing home, and instead of doing what I've done many times before, and installing Gentoo linux, I poked around a bit on the Internet and found CoreOS. Gentoo is a masochist's delight. I mean - it runs like a sports car, but you have to own a set of spanners. CoreOS, on the other hand, is more or less maintenance free. It's built on Gentoo's build system, so it inherits the sports car mentality of only installing things you are going to use, but then the guys at CoreOS do that, and their idea of "things you are going to use" is basically everything that it takes to get containers up and keep them running, plus exactly nothing else. For the rest, it's designed for cloud use, so you can install it from bare metal to fully working just by writing a configuration file, and it knows how to update itself while running. (It has a separate partition for the new version, and it just switches over.) 

So with CoreOS up and running, the next thing was to convert all the moving parts over to Docker containers. As it stands now, I didn't want to change too much of the basics, so I'm running Plone on a Gentoo container. That's way too much masochism though. I'd already been thinking I'd do a fresh one with a more generic out-of-the-box OS, and I've just realised I can pull a pre-built Plone image based on Debian (or Alpine). This gets better and better. And I can run it all up side-by-side in separate containers until I'm ready to flip the switch. Just great! Hmm... maybe my grand master plan was just to get to Plone 5! 

The Gentoo container I'm using is based on one created by the Gentoo community, which you can pull from the Docker hub. Once I found this, I thought I was home and dry, but it's not really well-suited to just pulling automatically from a docker file. What they've done is to separate out the portage tree into a separate container. This is smart, because you are unlikely to want the whole of portage in your container for any given purpose that makes you want to run Gentoo. What you do instead is mount the portage data using docker's --volumes-from argument. With it mounted, you can run emerge and install whatever packages you need, and then at runtime you get to run a much slimmer system. Which is great, but it means you have to create and store your own image manually rather than using a dockerfile. (At least, that's how it ended up for a noob like me, once I realised that dockerfile doesn't have an equivalent of --volumes-from.) 

My goal was to set up CoreOs to automatically pull the docker images it needed, and run some setup commands. This meant that I'd need to have my personalised Gentoo image available somewhere. Some of the data in there was sensitive, so I went looking for a private Docker registry that I could upload it to. There are plenty of private registries, but most of them aren't free. (If you don't mind the whole world pulling your containers, then free registries abound.) I eventually found https://canister.io/, which suited my needs. That said, my needs aren't much. If I ever need an alternative to canister, I'll probably look at Google Cloud Platform, which isn't free but has a private container registry where you only pay for storage and data egress, at pretty reasonable rates. Or I could just host it myself, but that's maybe too many eggs in the same basket. 

Meanwhile, my very next step ought most probably be to get backups sorted out. The "Dockerish" way to do this is to run up yet another dedicated container to deal with just this concern. Then if I want to host it separately, and my backup approach changes, nothing else needs to. Once I have the backups sorted out, it will definitely be worth the while to tidy things up so that I really can just push to the cloud if needs be. The way it's set up now, I could be up and running again very quickly but we're probably talking hours rather than seconds. 

I'm really enjoying the flexibility that containerisation gives me, although it's definitely important to get into the right mindset. Being able to build containers that will run on a really generic platform is quite liberating.

Hotfix rollups are the new Service Pack

I was recently surprised to learn that a Hotfix Rollup shipped from SDL Tridion is something quite different to what you'd expect from the title. For at least the last 10 years, and probably longer, the distinction between a hotfix and a service pack was very simple:

Service Pack

A collection of product improvements shipped between full version releases. The improvements would include bug fixes, and possibly new features, but never "breaking" changes. The intention was that customers should install the latest service pack for their current version. The service pack would have been thoroughly tested by R&D and would be the basis for on-going support until the next release.

Hotfix

If an issue was found in software in the field, a hotfix could be created to address this issue. There wouldn't be an installer - just some files and some instructions. Often a hotfix would be seen as suitable for any customer to install, but other hotfixes were riskier, and if you didn't have the problem, installing the hotfix would be a bad idea. Hotfixes were tested by customer support. The next service pack or full release would supersede any hotfix. In a reasonably thorough risk-management strategy, the standard play was to avoid taking hotfixes until you needed them. The official advice from Tridion as of 2011 was this:

IMPORTANT NOTE: Hotfixes are released at the discretion of SDL Tridion based on technical complexity, customer business requirements and schedules. Hotfixes are made and tested only for the described problem on a particular environment/configuration and therefore should only be installed if approved by SDL Tridion Customer Support. Hotfixes should be replaced as soon as possible by the subsequent service pack where the problem is fixed.

And then along came Hotfix Rollups...

Hotfix rollups

You might be forgiven for thinking that a hotfix rollup was, well a sort of erm... roll-up of hotfixes. A collection of hotfixes. A gathering together of a handy bunch of hotfixes to make life easier for the less risk-averse who like to install everything. (Like me, when I'm installing my own dev image. Love the handiness of it.) That's what the name means in any normal interpretation of the English language. The point here is that this is not what SDL Tridion mean when they say Hotfix Rollup. From discussions with various SDL people, it seems that they see a hotfix rollup as having the following characteristics:

  • It is not expected to cause any problems on your system and can safely be installed.
  • To this end, it has been tested by the relevant specialists in R&D
  • In the same way that you are expected to install a service pack, you are expected to install a hotfix rollup. Should further hotfixes become necessary, they will have the hotfix rollup as a dependency, not specific hotfixes. (This means that if you need that hotfix, you'll end up installing the hotfix rollup too, probably at a moment that you'd prefer to have chosen yourself.)

 

This is my best understanding at the current moment, but I am not aware of any formal communication from SDL that makes this clear, or otherwise updates the advice from 2011. Obviously, feel free to get formal confirmation via the usual channels

And as for you, SDL: your customers' risks are not your risks. You owe it to your customers to communicate correctly and in a timely way about this kind of thing. If anyone thought this would engender trust and confidence, that person was not thinking clearly. I wouldn't be saying this, but people out in the field often spend significant effort trying to balance risks like this, and it's in all our interests to make sure it goes well.

One-nine availability

Posted by Dominic Cronin at Aug 16, 2014 09:43 AM |

A couple of weeks ago, this site went down. That happens from time to time. It went down just as I left the country to go on holiday, and it could only be fixed via physical access, so it was down for a week. At least one person has commented that maybe I should stop with this silliness of running my own server on an old Gentoo box in the meter cupboard, and get some proper hosting.

The thing is, that when I started this blog, some years ago now, I went through a detailed requirements analysis, and a full MoSCoWMeh matrix. If you aren't familiar with MoSCoWMeh, this is an enhanced variant of the well-known MoSCoW technique, which also accommodates the needs of private and hobby-run systems.

The requirement for reliable hosting and 5-nines up-time was classified as Meh, and has remained so since. So now you know.

Keeping your feet dry

Posted by Dominic Cronin at Feb 14, 2014 11:10 PM |

When designing and implementing web-content-managed web sites with Tridion, the usual arrangement is to have at least four distinct environments, designated for specific purposes. Development, Test, Acceptance, Production. Often we refer to this as the DTAP street. Each environment has its own peculiarities. The production environment serves web pages to the visiting public, so will have at least some servers in the "demilitarized zone". There will probably be multiple web servers behind a load balancer, and particular attention will be paid to defences against the ne'er-do-wells of the Internet. The acceptance environment will be used for the final testing of software releases before they are allowed on to the production system. If the production system is load-balanced, so will the acceptance system be, and lots of attention will be paid to ensuring that the A-environment is a truly representative copy of P. The hardware will be close to identical, and all software will be patched to exactly the same levels as on the production system (unless, of course, a patch is being rolled out, in which case this will take place first in A). The other two environments belong to the development team. The Test environment is used for testing during a development project, to ensure that the necessary quality levels are achieved before moving on to acceptance testing. New versions of the software may be frequently deployed - perhaps daily or several times a day. In general, the environment will be maintained to be a good representation of the Production environments, but not to quite the same levels of obsession as for the Acceptance environment. The Development environment will be quite similar to the Test environment, but is likely to have extra software installed for use in development. Programming software, automated build and test software - that kind of thing. Typically the programmers will have more access privileges in the Development environment than in the other environments. Depending on the organisation, they may well be system administrators in D, and have significant privileges in T. This makes sense, because often they will need to try new approaches, and set up new configurations, or perhaps they might need to attach a debugger to the running software to analyse its processing.

All of this adds up to a significant investment. There will be an entire team of people busy for quite a while to get this set up and to maintain it. Hardware (although often virtualised these days - still complex enough), software - operating systems, databases, security, etc, etc, Then you have to add in all the work of simply managing the whole thing. It's not cheap, and then on top of that, the licenses usually aren't free. So there's a temptation to cut corners. This can mean missing out an environment entirely, or even two - although even the most miserly will usually draw the line at doing development work on the same system that serves the public. It can also mean taking shortcuts in configuration expenses. Maybe you can't afford to have your system administrator spend his time making special configurations for the development environment. The thing is, making sure everything is done right can be unpalatably expensive. So, of course, the first thing to do is ensure that such an expensive set-up is a good fit for your needs. For the vast majority of web sites, you definitely don't need a high-end enterprise web content management system like Tridion. If you do, however, then it's probably a pretty good sign that the expense of running a proper DTAP street is also worth it.

But what if you want all that goodness without having to pay for it? Well in that case, the responsible technicians need to make it clear what the trade-offs are. You can save money, but it's a gamble. The problem is that getting these things working doesn't just cost money. In an emergency, you can usually get more of that. The trouble is, that it also costs time, and the definition of an emergency is that you don't have any of that spare. So - imagine a situation where you would like to be able to debug a problematic piece of software, but your security requirements are pretty heavy, and cast in procedural concrete. So you attempt to set up the necessary tools in your development system, but it doesn't work. To get it working, you estimate you'll need to spend a couple of days of research (say - a day each for a developer and a sysadmin). Maybe it's twice that, maybe it's half, and maybe you need to write a report on all the possible approaches, and have it approved by a committee of architects. Whatever - it's more expensive than you'd like... so you choose not to do it. This is the point at which clear communication is essential.

Living, as I do, in Amsterdam, I can't help feeling just a bit smug as I listen to the news on the radio. In England, the Somerset Levels have been flooded for a long time, and it's still raining. The amount of rain landing on the South of England, and on Wales just now is more than normal - that is to say, it only comes down like that a few times in a century. The people of the Somerset Levels are complaining vociferously that maybe they'd have stood for being flooded for a week, but the water won't go away so they've had it for weeks and weeks. Now some people near London are getting their feet wet too, so suddenly it's important. :-) On my way home, I felt some of that weather. The rain was lashing down, with a pretty solid wind behind it. Was I worried? Not in the slightest, even though I live at least as far below sea level as the people of the levels. You see, here, if the weather gets like that, the only real effect you'll see, is perhaps some smoke coming out of the chimney at your local friendly pumping station. The whole landscape is littered with places for water to go. Every little canal or pool has big sloping sides that will accommodate several times the normal amount of water.

So - when it rains in Somerset (nothing personal, folks), they get their feet very wet indeed, and have some very uncomplimentary things to say about the government's Environment Agency, whose job it is to build and maintain the DTAP street. Various government agencies turn up to provide sandbags. When it rains here, the pumps kick in, and we're good. Sometimes I find it hard to articulate to budget holders exactly why I'd like them to spend money on the odd pumping station that's never really going to get used, is it? I mean come on, what are the chances? Did I say pumping station?

Seriously - if you're going to cut corners on your infrastructure, make sure all your stakeholders know the difference between Somerset and Amsterdam.