I'm exploring different avenues of networking besides straightforward "Work on a network, stay late every day, no peace of mind, fight with higher ups to get resources.".
You may have noticed that there's a smattering of UX and HTML/Javascript coding on here. I like coding, but I'm not too knowledgeable yet. UX is what I really enjoy (And I'm even working with a major company to improve theirs).
Site Reliability Engineering seems to be a nice merging of the two. I imagine it's who you call when your eCommerce check-out services craps the bed.
Not naming any names.
Site Reliability Engineering (SRE) is "Let's make sure our systems stay up and functional for operations". Not much different than networking to the end user. "Does the service stay up long enough for me to use it? Can I depend on it to remain upward."
A good point to remember is most systems don't need to be up all of the time. You don't even want that. It may not be fine to sleep through 5 calls about an outage at 2 PM on Tuesday, but it's okay to know that some failure is normal.
There are even 'error budgets' - If a system is up 85% of the time, that works for some machines, and you can tinker a bit with new features. If something goes down, you're still within 'budget'.
Everyone needs rest - Even machines. If you're afraid to turn something off because "We might lose everything!", you have not integrated decent backup or maintenance practices.
SRE was started with a software engineering mindset, which surprised me...at first. In retrospect, it makes sense. Software developers have to make sure their programs don't have vulnerabilities that impact gathered data, underlying hardware, and operations.
A big part of SRE is "Well, is this system sufficient enough so people have time to improve it, or are they always putting out fires?"
It's a short segment, and I encourage you to read it!
Comments
Post a Comment