Most of the times, above topic directly proportional to storage arrays rather than all the devices in the SAN. I would like to share my experiences about troubleshooting SAN’s . Since it’s a huge topic; I may touch base few generic points which I have come across @ work.
Where to Start?
I believe we need to start from the host or the end device connected / zoned with the storage array. Since, storage arrays and the SAN Switches are so robust and built with N+1 design they are less prone to failures. If there are any issues with Storage arrays; they have a call home setup with the vendor and they will come to rescue of it.
How to diagnose?
1. Understand the issue / incident / problem thoroughly
2. Assess the severity, priority and business criticality
3. Identify the devices related to the incident / issue
4. Start by the end devices like servers etc…
5. Gather Logs from all affected devices and try to analyze for source of the problem occurrences
6. Install necessary tools, monitoring agents which help you in analyzing performance and stability of the hardware / devices
7. Take necessary support; from peers, leads, SME’s and importantly from vendors and support technicians
8. Log a case with vendors and get all the help you need from them
9. Follow the in place process to create necessary tickets and change requests and adhere to process and standards before making any changes to the production environment.
10. While making any changes to production, please ensure that it will not affect production. If necessary request for downtime and perform changes during less production impact period.
11. Once you have solution; don’t just leave it there. Check the kind of solution whether it’s a long term or short term. Always prefer to provide a long term solution.
12. Document each and every activity performed and please share with peers so that they will be equipped to face same sort of issues in future
We need to understand an important point here; we are making changes to the devices which are highly critical for Business and demands high level attention and necessary skill sets. If you are not aware of the situation do not jump in and start trail and errors. Always be watchful and try to get as much problem visibility as possible.