Statement about storage issues on 25.3.2018
We were performing a regular update on our main storage system with Hybrid SSD and we ran into an issue with this upgrade.
What was happening over night
We were doing an upgrade of our main storage that hosts Hybird SSD storage arrays.
We do this around every 3 months.
Due to the HA infrastructure in that scenario, there is no distribution of the system, as there are two separate processors that work independent of one another.
There was an unexpected issue with yesterday's upgrade .
The systems failed during the storage handover from one to another and one storage node was not upgraded and crashed. We have successfully conducted this scenario many times, never having any issues. As such we did not expect them this time.
How we proceeded
The on-call technical staff activated other team members and Netapp support, where we have a contract for 24/7 assistance for these kinds of events.
The team then proceeded with the rescue mission at 6 AM with the assistance of several Netapp members, including a high-level team from the USA.
We posted regular updates on status.zgroup.si's Twitter and Facebook.
Currently we own three separate Netapp systems and this is first time in many years that we have had this issue.
Even Netapp has never had this issue, which is why it took so long to diagnose.
What was affected
About half of the server in the Hybrid SSD didn't have access to the disk.
After the restart all the servers returned to normal operation as per downtime.
Procedures after the storage fix
When the technical team restored the storage at about 12, we activated all of our support staff, who assisted our clients in returning to operation.
The technical team has repaired all alarms that were in the NOC and restored operation of all managed servers.
We also had the full team available throughout the day to assist customers with their own servers. Immediately after we had access and a report of all the issues on the VPS servers from our clients, the team fixed all existing issues with servers, so they can return to operation.
We are extremely happy and grateful for your understanding and patience, which enabled us to do our job without extra pressure.
For that we have extended hosting for all our users for a period of 2 days.
Currently we are waiting for analysis from Netapp EU, which is scheduled to check what happened.
We presume that there is a bug in the upgrade procedure and we will work with them to find and fix the issue.
Personally I would like to thank our team, Team Netapp, and everybody included for their support and cooperation in resolving this issue.