Thought: Does your team have a Production Architect?

MSF Agile breaks the roles on a software project into:

  • BA’s
  • Architects
  • Project Managers
  • Developers
  • Designers
  • Testers (if you are lucky)
  • Release Managers

Your company’s methodology may have different roles, and different names for them, but I’m guessing there is one thing we can count on:

They all focus on delivering software, not keeping it running.

Of course, we are all meant to make things run. Everyone knows that there’s no point writing code if it doesn’t work in production. The roles above exist to mitigate risks of applications not running in production:

  • BA’s envision production
  • Architects put in place designs, frameworks and processes to ensure the application should run in production
  • Project Managers make sure the code gets to production on time
  • Developers write the code that runs in production
  • Testers make sure that it functions correctly when it gets to production
  • Release managers enable it to get to production cleanly

Again, look at what those roles are. They all concentrate on getting the code into production; not what happens after the code is in production.

So who’s job is it to keep the application running?

In most companies, it’s some kind of infrastructure team. They may go under various names; IT, Operations, Infrastructure, Networking. We all like to believe that once we deliver the software, the infrastructure team will take care of the rest.

There may have been a point in time where infrastructure teams were capable of keeping every part of a system running. Perhaps it made sense to keep delivery and operations separate.

I have met very few infrastructure guys that:

  • Knew what an Application Pool was
  • Knew the difference between a Thread, and Process and an AppDomain
  • Knew the difference between CGI and ISAPI extensions
  • Knew what connection pooling was

Likewise, I’ve met very few developers who:

  • Knew what a load balancer was
  • Knew the difference between a SAN and a NAS
  • Knew how to debug a running application without Visual Studio
  • Knew why Application Pools matter

With infrastructure teams who don’t know the basics about the technologies we build our applications on, how can they hope to keep our applications running?

Architects love to design systems that use BizTalk, .NET, Java, SQL, CRM and SharePoint all in one solution. Microsoft love to sell those solutions to you. But even when you round up enough developers to build it, how on earth do you plan to keep it online?

To mitigate this, we write deployment guides, we automate deployments as much as we can, we build knowledge bases, and we document every feature. We push code into production, safe in the knowledge that the operations guys should be able to figure it out.

What happens when things break?

I think the dirty, secret truth of our industry is that:

  1. Every company has a policy that developers develop, and operations operate. Yet, even so:
  2. Developers end up doing most of the operations work, for the applications they build

Here’s how it works:

  1. Larry (user) phones Bob (operations): Hi Bob. I just tried to fill in my timesheet. I get 500 errors. Why is it not working?
  2. Bob to Larry: Not sure, but thanks for reporting the problem. I’ll have a look.
  3. Bob reads all the documentation he can. He reboots IIS. He reboots the machine. That’s about as far as his knowledge goes.
  4. Bob phones Sarah (developer): Hi Sarah, I understand you built the timesheet system? I’m getting error reports from users. Can you help me troubleshoot it?

Then the painful debugging process begins. Sarah has some idea about how IIS works, but no idea about the servers it is installed on. Bob knows there are boxes called servers, but wouldn’t know an application pool from a swimming pool. Sarah doesn’t have access to any of the servers to check things herself, and Bob doesn’t have authority to give her access. Eventually, with enough wild guesses and geese chasing, they stumble upon the problem.

Introducing: The Production Architect

For lack of a better title, the Production Architect is a new role on your team. While the rest of the team concentrate on getting the application into production, the Production Architect is worried about what happens in production.

  • He is KPI’d on the uptime of the applications they run
  • He wears the responsibility for setting, and sticking to, Service Level Agreements
  • He is involved in the architectural design of the system from the start
  • He is involved in setting up the infrastructure of the system
  • He owns load testing and performance testing
  • He owns the test environment
  • He knows infrastructure as well as code
  • He has the political clout to get operations and development to help him stick to his responsibilities

Above all, the Production Architect’s responsibility is to make sure the application works in production, and to keep it working. Like any architect, he relies on the team around him to help troubleshoot issues, but he has a great deal of knowledge about the application’s internals and how to keep it running.

Does your team have someone who acts as a Production Architect?

Are you paying them what they are worth?

Or is your operations team skilled enough in .NET and the supporting technologies to own these responsibilities themselves?

13 Responses to “Thought: Does your team have a Production Architect?”

  1. Interesting
    I have worked in a company that had a team fulfilling this role. It did work well most of the time. The main problem was in this company, bugs were never assigned to the person who wrote the code in the first place, and it was a big team. So you had a lot of people checking in shitty code, the code would go into a freeze for QA and release. Then the release team (i.e. production architects) would have to spend a lot of time beating the code into a workable state that they were able to support. It would have been a good system if the developer team had been more competant, what happened in reality is the release guys burnt out and moved on.

  2. As an aside…. a recent DotNetRocks episode spent a lot of time on ways you could develop with operations in mind. This includes things like better logging and instrumentation, and getting buy in from the ops team during the design / requirements phase. Well worth a listen.

  3. One idea I have tried (and failed) to implement is that the operations group should be treated as clients of the system, almost as much as the “real” clients. If we treat them that way, the theory is that they will prioritize what is important for support, such as instrumentation, etc.

  4. That’s the upshot of the discussion in the DNR episode I listened to.
    I imagine it’s very hard to make it work well, there is too much animosity between the two groups usually and it’s very hard to sit down and call a truce.

  5. To add to Paul’s original post - there is MOF to complement MSF. MOF tells us (supposedly) about the various groups that keep things running (Ops) and troubleshoot issues (Support).

    MOF also tells us that our applications should have a defined health model, and then the support group can properly troubleshoot the app. A health model would define all the different states of the application, and associated errors, events etc. Operations Manager management packs are built on health models for MS products.

    But this does rely on the app developers properly architecting/instrumenting the application to integrate with whatever monitoring tools the ops guys are using.

    The problems I see are twofold:
    a) most organisations don’t have the operational maturity/skills to properly monitor anything that isn’t supplied “out of the box”
    b) most organisations simply don’t have the staff that are smart enough (and interested enough) to master multiple disciplines.

    I agree with Paul that it is very rare to find an IT Pro that is interested in developer topics (or even understands processes/threads, heap and stack, user/kernel mode etc). And there are few developers that understand operational needs. If you can find someone that can bridge those gaps (and even better if they can handle databases and security as well), they can be worth a lot to the org.

  6. […] Does your team have a Production Architect? - Paul Stovell discusses a role responsible for the continued running of your application […]

  7. not a fan of this. kpi’d on the uptime when he didn’t write it? eh?

    imho the programmers should take care of almost all the things you list under this persons purview. it’s the programmers job.

  8. silky,

    Perhaps if there’s only one developer on the team. What happens when there’s a handful? Who is responsible for the uptime then?

  9. The lead.

  10. Silky,

    I disagree. When you write an application that depends on numerous other components within the environment, you simply can’t have the lead responsible for ensuring uptime. Is the lead supposed to examine every security patch for every product that comes through? What about when the organisation puts in a new network? or puts in a new firewall? Or upgrades the SAN? or any number of things that could affect the overall performance of the application.

    Lead developers, as Paul has noted, typically don’t know know much about SANs, or networking, or security, or even infrastructure in general. Asking them to ensure uptime in a complex environment is asking too much. They should be managing developers instead.

  11. […] I work for a consulting company, a Production Architect isn’t really a direct role I can play for clients. The role of a production architect […]

  12. Ken,

    You disagreeing with me isn’t exactly new. It’s crazy to think I’m saying that lead dev should play the role of network admin. There is room for both. I just question the requirement for a specific role to sit on top of both.

  13. […] Filed under: Resources — Grant Holliday @ 1:18 am Paul Stovell introduced the concept of The Production Architect - which is a role on the project team that is responsible for bridging the gap between IT Pros […]

Leave a Reply