One of the problems that came up this week is actually a problem that came up in December.
On December 15th we got a warning about disk health in a server; there is a drive that is at risk of failure.
A ticket was created for me to create a quote to replace the drive.
There was no part number associated with the ticket, and because of the type of server, there was no easy way to access configuration information online and our hardware documentation is a disaster (I have thought it was a disaster since the acquisition; I set up hardware documentation at the old job specifically to avoid issues like this and now all that documentation is gone because we didn’t keep any licenses of the old job’s CRM). This was not a situation where I could find a part number.
I contacted Tech Alice and asked her to check the part number on the server. Alice reported back that because the drive was part of a RAID array, she couldn’t get the part number. She recommended asking Bob, and put her time entry on the ticket.
I contacted Tech Bob and asked him if he could find the part number for the drive on the server; Bob also reported back that he could not find a way to get the part number from the server, he recommended that Charlie collect the part number when he went onsite. Bob added his time to the ticket (still my ticket) and added the status “onsite needed.”
Now it is December 23rd. I have messaged Charlie and asked him to check the part number when he is onsite and have added him to the ticket. I’m out of the office today, Charlie is out of the office next week. Charlie does not remember to look at the part number when he is onsite. It is the end of the year.
Now it is January 15th. We lost the first week of the year to assessments, and the second week of the year to the state and our clients being on fire - people were unable to go onsite because of all of that. Charlie is going onsite. I remind him to get the part number when he is at the client site. When he is at the client site he alerts me that actually he is at their other location, not the location with the server.
Now it is January 27th. Charlie is going back onsite, he is on my ticket, the ticket is set to onsite needed. I remind Charlie that we need the part number. Charlie does not remember.
Now it is February 6th. We have created a whole new ticket for Charlie with the *EXPRESS STATED PURPOSE* of going onsite to collect a part number for the failing drive in the server. Charlie marks the ticket as “waiting materials” and makes a note that he can’t replace the drive until we order the part.
Now it is February 7th. We have explained, in writing, in Charlie’s ticket that we can’t order the part until he goes onsite and collects the part number, because we cannot get it because the server won’t report the part number if it’s in a raid array for reasons that I’ll be honest I do not understand.
Now it is February 14th. Charlie closes his ticket and he and Bob pull me into a meeting. The server at the client site is so old they’re not sure it’s a good idea to replace the drive. Charlie has recommended that the project team quote a migration to sharepoint, which the client has expressed interest in in the past. Bob makes a note of this in my ticket. But I do not close my ticket. I do not close my ticket because I know there must be some fuckery coming. So I put my ticket to “on hold” and set it to reactivate on March 10th so that I can follow up with the project team and see if the migration project is making any progress or if we still need to replace this drive because the server drive is still failing.
It is March 13th. I have a bad week. A very bad week. My manager looks at my open tickets and asks why on earth I still have a server drive failure ticket open from December. I explain that I only have it open to follow up on the migration because the technician suggested server replacement but if there wasn’t progress we should still quote a drive, but I still didn’t have the part number.
My manager puts me in a chat with me, Charlie, the Project team lead, my manager, and the service team lead and asks what the fuck is going on. I paste Charlie’s last update on my ticket and say that I’ll be happy to quote a hard drive but I still don’t have the part number.
Charlie says “Oh, I put the part number in the ticket” and pastes a photo of a drive (low light, low contrast, and blurry but with a visible part number) in the chat.
“Great!” I say, and immediately assemble a quote and find stock. Then i look back at my ticket. “But I’m actually not seeing the part number on this [my] ticket. Where was that again?”
Charlie has put the part number on his ticket, which I was never on, which he closed.
“Ah, okay. I see.”
And here’s where the different standards that all of us are used to using work against us.
My old job built RAID servers all the fucking time. It was totally standard, totally easy, totally sensible, and I always knew to double the number of drives we needed for the storage we got because we’d be mirroring. Because we’d be using RAID 10. Because it’s robust and can take a lot of failure. A drive failing in a server configured with RAID 10 is not ideal, but it’s also not a drop-everything and panic emergency. I *still* wouldn’t want to leave it two months in an ideal world but I can’t drive up to San Francisco and get a part number, and sometimes the world literally catches on fire.
However, these new folks use RAID 5.
A drive failing in a server configured with RAID 5 *IS* a drop everything emergency, because if one drive goes down the whole system goes down until you can replace the drive and rebuild the array, and because RAID 5 is slower than 10, this can take a very, very long time depending on how much data there is. And if *two* drives fail the data is *gone*
So.
Whose job is it to get the part number, and whose job is it to know that the server is at imminent risk of failure?
Well, now I have properly reconfigured my internal alarms about any failing server drive, but I don’t understand why none of the three technicians who worked on this ticket with me didn’t at any point say “hey this is an emergency” (Alice is from my old team and used to RAID 10 also, I’m willing to give her a pass) and I’m *really* confused why Bob and Charlie would recommend *not* replacing a drive in a server that is that close to failure.
(And again, I just didn’t know. Believe me, I am never, ever going to shut up about drive warning tickets in the future)
And, the thing that scares the shit out of me and my manager and part of the reason why this has been a bad week and I’m having stressful conversations: What if I had just closed that ticket instead of letting it reactivate to follow up on? What if I had just marked it as done when Charlie gave me the update? It wouldn’t have been an old-ass ticket in my queue that my manager flagged, it would have been a note in an after-action report when the client’s server crashed.
(The client has the quote now with the statement “this failing drive puts your server at risk of failure and we strongly recommend replacing” but they haven’t approved it yet because they’re really cheap so I’m going to have to send it again and say “this is a mission critical part that you need to replace; your server is at risk as long as the drive is not replaced.”)
So. The boss is asking “why is procurement taking so long” and really, now that I’m thinking about it - because he brought it up - how much of this really IS supposed to be my job?