Disaster Recovery

What Happens After the Unexpected Occurs

We have all participated in activities designed to make sure we are ‘ready’ when the unexpected occurs. The thing about the unexpected is, there is no way to really be ready. What’s more, there is no drill for what happens AFTER a disaster. What can you do to make sure your institution can not only survive a disaster, but recover quickly?

 
 

Life’s a Beach (Or Some Other Bumper Sticker)

Each of us likes to imagine ourselves sitting beside the Gulf, letting the water lap our feet, a fruity drink in hand. It’s an easy lifestyle. It's the dream setting, the freedom to step away (or retire) without fear that things will fall apart in our absence. The days, weeks, or even years of planning and preparation to retreat and enjoy our destination of choice can be strenuous, chaotic, and sometimes filled with uncertainty. 

However, planning and preparation will only get you so far.  Even if one is to achieve this destination, the larger question looms: are you able to be present? Are you able to soak it in and sit and do nothing and enjoy your time? With all of the interruptions and ways to access each other ubiquitously via technology -- 24 hours a day -- can you really escape? 

If you serve as an administrator at an institution, you may be responsible for the health, maintenance, and stability of the IT infrastructure, or the upkeep of the grounds and facilities, or the general well being of 3,000 students living on your campus. At what point is it possible to step away and distance yourself, if even for a week?

These are the questions that lead us to our upcoming series on Disaster Recovery and Planning.

We believe there is a clear path towards finding that stability and reliability that every institution deserves. After all, you deserve it! Your team deserves it! Everyone deserves that break and a flight to the Keys with loved ones for a few days (or a few weeks). Let’s face it: we are connected. Everywhere, at any time. The reality is, we can enjoy our dream setting now with just a little bit of preparation and planning and support. And the best news is, we don’t have to wait until our retirement plan is in repayment. 

This is our primer on Disaster Recovery. We spoke with many in the Higher Ed community, and we've heard stories of how some folks are managing resources more efficiently than ever (for little to no additional overhead). Best practices are ripe and ready to be employed by institutions willing to make the time. If, as you follow along, have questions or comments, please let us know. We offer this guide not as a map to success, but rather as an approach to helping your institution become less dependent on the resources that are required to keep things running on a day-to-day basis, and more focused on institutional effectiveness as an operation. 

By considering the possibilities of what could happen in a worst-case scenario, it becomes easier to lay out a framework and a road map for how to react if and when the time comes. More importantly, you will find that you are free to address the immediate needs of the institution, the students that each of you service, and the teams that you lead.

Effective planning will allow you the opportunity to be present, no matter what your dream setting. It's time to put that plan in motion!

 

Not All Data is Created Equal

The concept of a disaster recovery plan is simple: ensure business continuity in the wake of an emergency. If something goes wrong, how fast can order be restored?

The reality of a sound DR plan is slightly different. If one is created, the likelihood that it’s maintained, tested, and updated regularly is less certain. New systems and processes are defined throughout the course of an academic year, and so it can become challenging to pause and ensure that all aspects of business function are secure.

Without intimating a picture of doom and gloom, everyone knows that there are a multitude of risks that could exist. However, the good news is that getting started isn’t as time consuming or difficult as the nature of the task may allude. In fact, if you are interested in updating your disaster recovery plan, this series will guide you. Spend about 30 minutes over two business days and concentrate on identifying your institution's most important systems. We believe in less than an hour, your team can initiate a clear path to an updated disaster recovery plan.

Prioritize Recovery

To formulate a solid disaster recovery plan, one must begin with identifying a list of critical business functions and supporting systems. From that list, determine the level of importance for each system in production. For example, if a system like Colleague or Banner were to go offline during a power outage and its data corrupted, business functions would cease. If the website were to go offline, however, the impact is more marketing-related and only affects those trying to gain access to certain materials, yet it doesn’t render the institution inoperable.

Once each system has been assigned a level of importance, consider whether each is mission critical and then assess the complexity of rebuilding or restoring services were each to go offline unexpectedly. Force yourself to list systems from most to least important in a single file list, even when it feels like two systems are equally important. (Your budget works linearly, and so should your planning.)

To take the plan one step further, consider starting a spreadsheet that contains specific information about each system identified in your exercise. With just a few key metrics, the institution can prioritize each system. This list will form the basis of the disaster recovery plan. (Don’t worry about adding every system. If you can’t think of a system right away, chances are that it will go on the bottom of the list, and it can be safely added later.) Assign each system to a matrix structure:

  • System Name

  • Function

  • Audience/Stakeholders

  • Priority

  • Complexity to rebuild

  • Mission Criticality

  • Life and Death: Will your college exist tomorrow if the system is lost?

Every institution should place a core set of services on the list: its ERP (Colleague, Banner, etc.); email server; network infrastructure; document imaging system (i.e. Soft Docs); content management and file shares; learning management system and other online course materials; website server(s); and telephone systems.

In some instances, communications might fail, but consider each system’s importance in relation to what it might take to rebuild.

Assess Cost

Not all disaster recovery plans come with a hefty price tag, although there are some costs to consider. As one can imagine, this is a discussion worth having before any systems are in the cross hairs and teams placed under pressure to restore business operations. This exercise can also help provide a business case for working with stakeholders to understand the importance of what is being backed up and the costs associated with prioritizing certain systems over others. With the advent of cloud based services (i.e. Amazon Glacier, Amazon Web Services, Google Apps) and other infrastructure options, institutions can consider allocating funds into its operational budget and prepare to make resources available before an emergency.

As an example, for a mid-sized institution running Colleague, maintaining a secondary Colleague server capable of assuming basic college functions will cost between $20,000.00 and $60,000.00 every four years (if funded with capital improvement cash). This type of system should take approximately one month to set up and four hours per week of staff attention to maintain.

 

The Matrix

In a previous post on disaster recovery planning, we introduced the concept of identifying and prioritizing each system on campus that would have a major impact on operations if failure were to occur.

Following the exercise that was presented, institutions should now have a clear picture of what challenges might lie ahead in the wake of a major campus systems outage.

We encourage readers to build upon that initial list in this week’s post and craft a matrix to help guide the decisions that will need to be made.

This week, we are going to ask you to commit to a plan to get to a Plan. As a follow up exercise, update last week's spreadsheet to include an additional column: Preparedness.

In the new column on the spreadsheet, ask your team, “In terms of preparedness (on a scale of 0-100), how close is the institution to being able to restore each system quickly and effectively, based on the priorities that have been set?” Are you near your goal, or is there a lot of work to be done before feeling more confident in your team’s response? 

Make all judgement-based decisions. There is no right or wrong answer in how to complete the spreadsheet. In fact, this exercise is designed to give you the freedom to make open, honest assessments of your environment. That said, this should not be taken lightly: this is your institution’s sustainability and future infrastructure health. The answers that your team provides will help guide decisions to ensure that the business is secure and its path aligned with its mission.

We understand that planning for disaster recovery takes effort. We also realize that initial true costs are in the time it takes to plan. When we spoke with Isothermal Community College CIO Robby Walters earlier this spring, he told us that part of what drove his institution from idea to action was simply trying to, "Prove that we can do it without any cost involved." Isothermal's team has been working in conjunction with a sister institution to co-locate and implement a disaster recovery plan for its entire IT infrastructure. 

This series on disaster recovery is being written with that exact premise in mind: When emergencies occur, how can one ensure business continuity? The path ahead is clearer than you might imagine.

 

So, What’s the Punchline?

With a list of critical systems, a matrix of priorities, and a measure of preparedness, it’s now time to create goals around the plan. Focus on what can be done in the short term to develop and/or improve a response to how restoring each system will help ensure business continuity. Once that is completed, make notes on how the plan might be improved from year to year.

There are scores of CIOs that we’ve worked with over the years, and without over generalizing, each can be placed into one of two camps: Those that say, “I have a plan,” and those that will tell you, “I don’t have one (or it’s not good enough).” The difference between those that know they have a solid plan and those that think their plan isn’t good enough is at the core of understanding the purpose of a disaster recovery plan.

A good plan is one that is living and continues to evolve over time. The reality is that a disaster recovery plan is never done. Instead, it should serve as a tracking of improvements year over year. Its purpose is to identify a misalignment of resources and resource starvation.

At a very high level, concentrate on the plan in terms of the matrix that was generated. From there, you should be able to place priorities on critical systems that lack the resources to restore service quickly and effectively, and then set budgeting goals for the upcoming year to address those systems first.

Unlike our previous posts on disaster recovery planning, there’s no real “assignment,” but we are available to meet and help review this idea. Please feel free to schedule a phone call with us if you’d like to discuss these ideas in more depth.

 

Low Hanging Fruit

Our series on disaster recovery planning intends to challenge you to create an initial list of systems, ranked by criticality to the institution. The overall grade you assigned to those systems, subjective as it may be, is the correct number. Do not second guess your methodology, and do not re-rank systems. Because of your institutional knowledge, the background information that you possess about these systems is more relevant than any standard method you can find.

Now, we will move to addressing low-hanging fruit. By low-hanging fruit, consider this term to represent the least important system(s) in your list. Typically, an institution's least important systems are either homegrown applications or those procured by external vendors to serve a specific business function on campus (not necessarily those that have institutional-wide impact). Placing emphasis on these types of systems at this phase can help your team become comfortable with discussing the topic with stakeholders and serve as an easy introduction into how to justify cost and importance externally. Document the cost of redundancy, security, and performance in terms of the following metrics:

  • Likelihood of failure

  • Time to return of service given a light interruption

  • Time to return of service in the event of a severe interruption

  • A brief explanation of bare metal restoration (total loss)

Example: IT Ticketing System

  • Likelihood of failure (3 out of 10).

  • Time to return of service given a light interruption (2 hrs)

  • Time to return of service in the event of a severe interruption (6 hrs)

  • Time to return of service in case of total loss (1 day)

Explain under which circumstance each could happen. With this list in hand, you can now refer to the list to incite a real conversation with your peers. Important: Scaring others is never a good strategy. Make sure you ease everyone into the conversation by explaining the purpose of this exercise. 

After you have completed the exercise with the first system in your list, move down the line to the next system, increasing priority.  

You know your institution best, and you should make a judgement of how many of these systems you will address at the same time with your peers.

 

The Magic of Prudence

Now that the less important systems have been fully documented,  we focus on qualifying the systems in your list that may require challenging conversations. These are the systems which can stop your business in its tracks if failure occurs.

The systems in the middle order of your list are of no less importance than those at the top. Their lifecycle and the actions related to them will come naturally as you work to address the systems at the opposite ends of the spectrum. Furthermore, the middle systems play a very important role in this process. Because of their functions, their resources are best positioned to be demoted to help the low-end systems or promoted to serve higher level purposes. (We will discuss this in detail in a future article.)

Go through the same exercise that you used with the low-end systems to evaluate the most important systems.  The biggest thing to remember is that each of these systems are scary to the executives in your business. Administration rarely understands the complexities of these systems, but they do understand they are complex, and their importance to the every day business functions. 

It is never a good strategy to list issues related to these systems without at least providing three plans on how to resolve them. You may even want to begin your conversation about these systems by mentioning that there should be no apprehension leading into planning. The report that you've created should not be a recipe for impending doom, but rather presented as the foresight of a prepared professional in keeping the business sustainable.

Document a Deliverable

Write a paragraph about the purpose of each system. As you target an executive audience, consider engaging others on campus that may be fluent in marketing their ideas, whether an academic or staff member responsible for another department on campus. Ask that they read the document and solicit their input. They will tell you if and what they don’t understand. The end product should be clear and concise. (Technical personnel may not be the best in presenting a business case and overall importance, as their understanding of these systems is very technical in nature and often biased to the technology.)

Included is an example of possible document structure.

Disaster Recovery Plan

Most Important Systems

I – Colleague

Purpose: Central repository of information for all functions of the college. It controls registration, student accounts, accounts payable, accounts receivable, financial aid, scheduling, HR, admissions, and more.

Description: The Colleague system is the single most important system at the College. It is essentially a database of information (people, registrations, classes, employees, bills, etc.) and software (Colleague) designed to enforce security (who can do what), business rules (how can something be done), and accuracy (check this before doing that) for just about every function of the college. It enjoys the highest attention from staff and resources from the College of any of the systems we have.

Goals: We want to create a hot DR site for this system. This means we will have a twin Colleague system with a copy of the software, data, and everything else Colleague needs on a secondary site away from the data center at the College. If something unexpected happens to Colleague in our data center, this hot site will allow us to return business functions to the College in a short period of time while the unexpected failure is addressed.

II – SoftDocs

Purpose: (similar write up, etc.)

III – Card System

Purpose: (similar write up, etc.)

In order to remain consistent, be sure to use this same document and format often and always. Information is usually best absorbed when presented consistently and repeatedly. So, every time you speak about systems and disaster recovery, re-distribute the same document, including any relevant updates. 

As professionals, your knowledge and experience will help guide these exercises. We do not believe everything that we're saying in this series is new to you or your team. Our only purpose is to provide you with one or two things that may be useful on how to approach your challenge.

 

Present Preparedness

With a completed disaster recovery planning document in hand, the time has arrived to present the information gathered to the entire executive cabinet of your institution. Now, when an emergency is not at hand, is the time to make everyone aware that it takes the proper preparations to avoid an emergency. Just as you have no choice but to provide them with service, they have no choice but to care. They need to understand the state of the technology in the business. This is a prudent exercise that places attention on the future state of operations, and one that should actually instill some sense of pride for putting the effort into ensuring that business continuity is being addressed ahead of some unforeseen emergency.

Just remember that no reaction to the plan is not necessarily bad, nor should it take you off-guard. Consider, however, asking every executive to initial the plan indicating it properly addresses their needs. The purpose of it is not to say, “I told you so” in the future, but rather ensures that each member of the council reads the document today. You now possess the flexibility to clearly communicate the impact of a change to the plan. With a firm list of priorities, costs, dependencies, and importance of every key system on campus, you should be able to speak to how funding impacts performance and reliability. 

Using the matrix example, let’s focus on a typical legacy email system: The license for the legacy system costs the institution $20K/year, and a backup and maintenance plan costs an additional $15K annually. As part of your discussions with administration, help illustrate how to prioritize the system in relation to other systems. Consider how to prioritize the need for this system by evaluating cost in relation to the costs of other systems in the matrix. 

This type of approach will lend itself to two possible paths: Either the institution decides to place a higher priority on maintaining the legacy email system and allocate funds above other system needs, or it can reduce costs and outsource its mail to Google . 

This is the approach and these are the conversations that help technology teams and institutions across higher education create compelling business cases for senior leadership. Without those that put in the work and develop the plan, there would be a lot more instability in the daily operation of an institution and even more uncertainty in the future experience of its students.

A Side Note

Ironically enough this week, even as we publish our series on disaster recovery, our company was faced with a crisis that required us to conduct a live exercise in response to a mistake that took one of our own key business systems offline. While preparing for a future roll out of a new product, a member of our team accidentally loaded a new data structure and its contents into the wrong destination database. The scripts that ran resulted in dropping existing data tables and replacing the tables with empty shells (or in some cases, records from the new system). The application dependent was rendered inoperable, and a scramble ensued. 

Luckily, cool heads prevailed, and within a few minutes, our team sprung into action to identify affected data, locate the most recent snapshot of the system, and systematically restore each data source using a series of two factor checks and balances before certifying a full restore. The entire activity took less than 50 minutes. 

Today’s technology options as related to disaster recovery are far superior to what existed less than a decade ago. Without the services offered in a cloud infrastructure, for instance, this week’s gaff could have resulted in not only the loss of significant revenue but also in the dissatisfaction of clients and subsequent attrition of a customer base. Instead, snapshots of servers in our structure are available with a few clicks. Our clients never even noticed that we ran into an issue.

Orchestrating a recovery effort takes practice, but the technology is available now.