Recovering when Google Apps fails the Enterprise

This is a story about IT, the cloud and all the great and terrible things that happen when the two come together.  In many ways cloud computing and Software as a Service (SaaS) is a blessing for understaffed, overworked Information Technology departments who are expected to deliver the best technology has to offer with 100% uptime and no budget.  With the emergence of web enabled services from major vendors like Amazon, Microsoft, and Google the  age old question posed to IT professionals — is it free — can finally be answered with a resounding yes! (At least for the basic version).

But like many of our grandparents have said before, experience has taught that the old adage remains: you get what you pay for. Sure, the service might be free, but with no service level agreement (SLA) there is no recourse when there is an email outage such as the 6 our more outages that affected Google Mail in 2009.  But worry not end users! There are paid upgrades available that include an SLA guaranteeing 99.9% uptime as well as phone and email support for critical issues.  At $50 per user per year this sounds like a steal!

What exactly do they mean by uptime and critical?

It all looks great on paper; there is no need to maintain physical hardware assets, no IT employees to pay, and the cost is less than licensing alone for most standalone solutions! Unfortunately, these marketing guarantees look like vaporware when the data hits the router.

To illustrate the shortcomings in the cloud, let’s use a real life example. A not for profit customer we’ll call “Charity X” wanted to reduce overhead and improve services for their staff. After over a decade of handling email and services internally, they decided it was time to move to the cloud.  Fortunately for them, as a not for profit institution they qualified for the non-profit edition of google apps.  Great!

Here’s the benefits they sought:

  1. We can reduce our bandwidth use by not handling email ourselves
  2. Our server resources will be freed up and can be used for other purposes
  3. We can provide better webmail for our employees.
  4. We get access to shared resources such as contact management and documents that are ‘in the cloud

The reality though was that this change wasn’t as simple as they had expected. It wasn’t a matter of click migrate users.

That’s when Charity X called me.

I began with configuring their Google Apps account and going through the process of upgrading it to the non-profit edition. I migrated their user’s mail using the built in migration tools, and I synchronized the user accounts with Google Apps Directory Sync Tool (GADS).  And that is where the utilities fell short.  It appears that Google thought about the idea of users synchronizing their active directory passwords with Google Apps.  I emphasize thought, because apparently that is as far as Google got.  Want Single Sign On? Want Synchronized passwords using LDAP or AD? Time to turn to third party tools.  Okay, fair enough. Third party tools are available to do this, and they work. No harm, no foul. But a little more documentation on the subject wouldn’t hurt.

Then all hell broke loose. A handful of users could not login to their Google Accounts. So the hunt for the cause began.

Were they suspended? — No!

Were they using the wrong password — No!

Could I update their password — No?

Strangely I could not update their user accounts using the web interface.  Any change — password, nicknames, email routing, names — resulted in an unknown error #1000.

Okay, I thought. It says to try again later, but I should be able to update their password using GADS.

GADS FAIL — 1301 Entity does not exist!

The GADS logs indicated that the user does not exist. That was unexpected since I had just looked at the user’s information in the web interface and it most certainly did exist.

Ticket Time

Time for that support to kick in! Since this was a non-profit edition, support is included. So I clicked support and followed the routing questions. Google determined that this was not a ‘critical’ issue since the service was not down completely, and I can’t say I disagreed. Google said the only support option available is esupport.

I filled out the form, and was quickly greeted with a friendly email from one of their customer service representatives. The representative worked with me to quickly determine that this issue was beyond his level of expertise and escalated the ticket to a specialist.

So far so good.

I then waited for 3 days. I received inquiries from staff about when they could check their email. When would it be fixed? They had deadlines and important matters that needed to be dealt with in their inboxes.

I asked support, which responded that they “don’t know any workarounds at the moment.” The only possibility that we were able to come up with was to delete the account and recreate it – not a very good solution since all mail would be lost between the transition and the time the account could be recreated.

So we waited some more. The email could not even be forwarded for affected users — any changes to email routing resulted in an unknown error.

A week passed, so I posted to Google’s support forum hoping someone — anyone — would have a suggestion on how to fix or workaround the issue. User after user reported similar problems since Google’s transition of Apps users to Google Accounts, but there were no resolutions other than to wait for manual intervention by a Google employee.

Time keeps on ticking

After eight days of waiting I came to a solution on my own.

Since it appeared that the accounts were partially created (they appeared in the web interface and could receive email but could not be modified or access services) it appeared their had been an error in how the accounts were processed during creation. It seemed clear to me that the problem was an inconsistency in Google’s infrastructure. Removing the accounts would allow the systems to agree on the account’s status, but could they both be made to agree that the accounts existed?  Perhaps I could force Google to recreate the already existing accounts without losing the current data.

I attempted to rerun the transition to Google Accounts on the affected users manually, but despite the auto-fill recognizing their usernames, when I clicked next I was given an error that the accounts did not exist.

My next approach was to create new users with the same names using the dashboard. This time I was told there was an error because the accounts did exist. Clearly there were two separate systems being queried.

It occurred to me that I might be able to get more useful error messages by going to the command-line. So I installed Google Apps Manager, which is also known as GAM. GAM uses the Google Apps provisioning API to interact with the Google Apps account.

I ran a number of queries on accounts I knew were affected using GAM to see what happened.  While queries would return data about the username, nicknames, and whether they had agreed to the terms of service, the users could not be modified, returning the error “1301 entity does not exist”. So I tried to create the user again, which had an unexpected result.

ServerBusy(1001)

Strangely, I was told that the system was busy. So I waited and tried again.

EntityExists(1300)

As I suspected, it told me the account already existed. But out of curiosity, I tried to login.

It worked

For whatever reason, I was now able to login to the affected account. I tried another account.

ServerBusy(1001)

Thinking maybe this error meant more than it was letting on, I attempted to login to this second account before trying again. This account also now worked as expected.

I checked the dashboard and the web interface now worked as expected on these accounts as well.

While the response was strange, whatever happened on the server end appeared to correct this problem, so I ran a script that contained all affected users and resolved the problems on the remaining accounts.

The cloud needs shock absorbers

Charity X is now working merrily away using Google Apps, Docs, and other tools in the cloud, but the transition has made it clear to management that there isn’t a magic oven that you can just “set and forget” when it comes to IT. While there is still a great value and savings to be had in the cloud, this downtime is disconcerting as an IT administrator.  When I run my own systems, I know what is happening behind the scenes, and if things go wrong I know where to look for the answers.  With hosted services I am at the mercy of third parties to provide adequate documentation and timely support. Even if my work is not the cause of the problem, I am still expected to deliver solutions to my customers quickly. When I can’t look under the hood, I need partners that can resolve problems in hours — not weeks. Customer support means providing actual solutions. Simply acknowledging a problem is not enough.

When cloud providers are vying for access to lucrative contracts with state and federal agencies and large corporations it is important that they remember that each customer counts. When a user can’t access their email it might not be considered critical to Microsoft or Google, but it is a critical issue to that user.  A week of downtime is simply unacceptable to users, especially if the affected user happens to be a decision maker.

I look forward to cloud services that work like a magic oven, but for the foreseeable future I see only job security in the cloud.