How to Diagnose Virtual Machine Issues: Analyzing Disk Subsystem Performance

Virtual machines (VMs) have become an integral part of modern IT infrastructure. However, like any other complex equipment, VMs are prone to problems. One of the most common causes of VM performance degradation is insufficient disk subsystem performance. In this article, we will take a detailed look at methods for diagnosing disk-related issues in virtual machines and provide practical examples for identifying and resolving them.

We will focus on analyzing disk subsystem performance, as this area is often a bottleneck and requires careful consideration. We will examine tools and methods for monitoring, analyzing metrics, and configuring the disk system to achieve optimal performance.

Input/Output (I/O) Monitoring

How to Diagnose Virtual Machine Issues? - I/O monitoring window on a virtual machine with highlighted IOPS, throughput, and latency metrics.

Effective input/output (I/O) monitoring is a crucial step in diagnosing disk subsystem problems in VMs. Monitoring allows you to track key performance indicators such as IOPS (input/output operations per second), throughput, and latency. Analyzing these metrics allows you to identify problem areas and determine whether the disk subsystem is a bottleneck.

Several tools can be used to monitor I/O on virtual machines, including built-in operating system tools and specialized solutions for monitoring virtual infrastructure.

VPS хостинг

Виртуальные серверы с гарантированными ресурсами

Выбрать VPS

Using iostat for I/O Monitoring in Linux

The iostat utility is a powerful command-line tool available in most Linux distributions. It provides detailed information about disk performance, including IOPS, throughput, CPU utilization, and wait times. iostat allows you to track disk activity in real-time and collect statistics for further analysis.

Example 1: Basic I/O monitoring using iostat

iostat -x 1

This command runs iostat with the -x option, which outputs extended statistics. The number 1 indicates the data refresh interval in seconds. In the command output, you will see many columns, including r/s (number of read operations per second), w/s (number of write operations per second), rkB/s (kilobytes read per second), wkB/s (kilobytes written per second), await (average I/O operation wait time), and %util (percentage of time the disk was busy). Pay attention to the await and %util values, which may indicate disk performance issues.

Example 2: Monitoring a specific disk using iostat

iostat -x sda 1

This command allows you to track statistics only for the sda disk. Replace sda with the name of the disk you are interested in. This is useful when you have multiple disks in the system and you want to focus on a specific device.

Example 3: Writing iostat data to a file for later analysis

iostat -x 1 > iostat.log

This command redirects the output of iostat to the iostat.log file. This allows you to collect statistics over a long period of time and analyze the data later using a text editor or specialized log analysis tools.

Using Performance Monitor (perfmon) in Windows

Performance Monitor (perfmon) is a built-in monitoring tool in Windows that provides a wide range of performance metrics, including disk subsystem metrics. perfmon allows you to track I/O in real-time and create reports for analyzing historical data.

Example 1: Monitoring IOPS using Performance Monitor

  • Open Performance Monitor by typing «perfmon» in the Windows search bar.
  • In the left panel, select «Performance Monitor».
  • Click the «+» (Add Counters) icon on the toolbar.
  • In the «Add Counters» dialog box, select «PhysicalDisk» or «LogicalDisk» (depending on which disks you want to track).
  • Select the «% Disk Time» (percentage of time the disk was busy) and «Disk Transfers/sec» (number of I/O operations per second) counters.
  • Click «Add», then «OK».

You will now see graphs displaying the values of the selected counters in real-time. A high «% Disk Time» value (close to 100%) may indicate disk overload.

Example 2: Monitoring disk latency using Performance Monitor

  • Repeat the steps described above.
  • In the «Add Counters» dialog box, select «PhysicalDisk» or «LogicalDisk».
  • Select the «Avg. Disk sec/Transfer» counter (average time spent on one I/O operation).
  • Click «Add», then «OK».

This counter shows the average time spent on one I/O operation in seconds. A high value may indicate disk latency issues.

Example 3: Creating a disk performance report using Performance Monitor

  • Open Performance Monitor.
  • In the left panel, expand «Data Collector Sets» and select «User Defined».
  • Right-click in the right panel and select «New» -> «Data Collector Set».
  • Enter a name for the data set (e.g., «DiskPerformance»).
  • Select «Create manually (Advanced)» and click «Next».
  • Select «Create data logs» and check the box next to «Performance counter». Click «Next».
  • Click «Add» and select the counters you want to track (e.g., «% Disk Time», «Disk Transfers/sec», «Avg. Disk sec/Transfer»). Click «OK».
  • Specify the data collection interval (e.g., 1 second). Click «Next».
  • Specify the location to save the logs and click «Finish».

Perfmon will now collect data about disk performance in the specified file. You can analyze this data later by opening the file in Performance Monitor.

Latency Analysis

How to Diagnose Virtual Machine Issues? - Graph showing disk operation latency with peaks and average values, displaying problem periods.

Latency is the time it takes to complete an input/output operation. High latency is one of the main signs of disk subsystem problems and can significantly affect VM performance. Latency analysis helps identify the causes of slow disk performance and take steps to eliminate them.

Latency can be caused by various factors, including disk overload, slow disks, problems with the disk controller, network issues (if network storage is used), and virtualization issues.

Identifying Sources of Latency

The first step in latency analysis is to identify the source of the problem. It is necessary to determine whether the problem is local to the VM or related to the storage infrastructure.

Example 1: Checking latency using ping

ping -c 4 <storage_IP_address>

If the VM uses network storage (e.g., NFS or iSCSI), check the network latency using the ping command. High network latency may indicate network problems that affect disk subsystem performance. Replace <storage_IP_address> with the IP address of your network storage.

Example 2: Using traceroute to determine the route to the storage

traceroute <storage_IP_address>

The traceroute command allows you to determine the route that traffic travels from the VM to the storage. By analyzing the route, you can identify problematic network nodes that may be causing delays. Replace <storage_IP_address> with the IP address of your network storage.

Example 3: Checking disk latency using hdparm (Linux)

hdparm -tT /dev/sda

The hdparm utility can be used to test disk performance. The -tT option performs a read test from the disk and shows the read speed and caching time. Replace /dev/sda with the name of the disk being tested. Although hdparm does not measure latency directly, it can give a general idea of disk performance.

Analyzing Latency Metrics

After identifying the source of the latency, it is necessary to analyze the latency metrics to determine the cause of the problem. Pay attention to the following metrics:

  • Average read/write latency: This metric shows the average time spent performing read and write operations. A high value may indicate disk overload or slow disks.
  • Maximum read/write latency: This metric shows the maximum time spent performing read and write operations. Large latency spikes may indicate short-term performance issues.
  • Percentage of operations exceeding a certain latency threshold: This metric shows what percentage of operations takes longer than a certain time (e.g., 10 ms). A high percentage may indicate serious performance problems.

Example: Analyzing I/O monitoring logs

Analyze I/O monitoring logs (e.g., logs collected using iostat or Performance Monitor) to identify periods of high latency. Associate these periods with other events in the system (e.g., launching resource-intensive applications, performing backup tasks) to determine the causes of the problems.

Optimizing Disk Configuration

Proper disk configuration is crucial to ensuring optimal VM performance. An incorrectly configured disk subsystem can become a bottleneck, even if fast disks and powerful hardware are used.

Optimizing disk configuration includes choosing the correct type of disks (SSD vs HDD), configuring RAID, using LVM (Logical Volume Manager), and properly allocating disk space.

Choosing Disk Type: SSD vs HDD

SSD (Solid State Drive) and HDD (Hard Disk Drive) have different performance characteristics. SSDs provide significantly higher read and write speeds, as well as lower latency compared to HDDs. However, SSDs are usually more expensive than HDDs, especially for large volumes.

For VMs that require high disk subsystem performance (e.g., databases, I/O-intensive applications), it is recommended to use SSDs. For VMs used to store data that does not require fast access (e.g., archives, backups), you can use HDDs.

Example: Migrating a VM from HDD to SSD

If you find that a VM is running slowly due to low disk performance, consider migrating the VM from HDD to SSD. Most virtualization platforms (e.g., VMware vSphere, Microsoft Hyper-V) provide tools for migrating VMs between different types of storage.

Configuring RAID

RAID (Redundant Array of Independent Disks) is a technology that allows you to combine multiple physical disks into a logical volume. RAID provides fault tolerance and can improve disk subsystem performance.

Different RAID levels (e.g., RAID 0, RAID 1, RAID 5, RAID 10) have different performance and fault tolerance characteristics. The choice of RAID level depends on the specific performance and fault tolerance requirements of the VM.

Example: Choosing a RAID level for a database

For a database that requires high performance and fault tolerance, it is recommended to use RAID 10. RAID 10 provides high read and write speeds, as well as protection against data loss in the event of a failure of one or more disks.

Example: Configuring RAID using mdadm (Linux)

The mdadm (Multiple Devices Administration) utility is used to manage RAID arrays in Linux.

mdadm --create /dev/md0 --level=1 --raid-devices=2 /dev/sda1 /dev/sdb1

This command creates a RAID 1 array (mirroring) from two disks (/dev/sda1 and /dev/sdb1) and creates the /dev/md0 device. Warning: This command will cause data loss on the /dev/sda1 and /dev/sdb1 disks. Make sure you back up your data before running this command.

Using LVM (Logical Volume Manager)

LVM allows you to flexibly manage disk space. With LVM, you can create logical volumes that can be dynamically changed (increased or decreased) without having to reboot the system. LVM also provides the ability to create snapshots of logical volumes, which is useful for backups and testing.

Example: Increasing a logical volume using LVM

Suppose you have a volume group vg0 and a logical volume lv_data that needs to be increased.

lvextend -L +10G /dev/vg0/lv_data

This command increases the logical volume /dev/vg0/lv_data by 10 GB. After that, you need to resize the file system to use the new space.

resize2fs /dev/vg0/lv_data

This command resizes the ext4 file system on the logical volume /dev/vg0/lv_data so that it occupies all available space.

Identifying Storage Bottlenecks

Even with optimal disk configuration, bottlenecks in other storage infrastructure components can negatively affect VM performance. It is important to identify and eliminate these bottlenecks to ensure maximum disk subsystem performance.

Bottlenecks can occur at different levels, including CPU, memory, network, and storage controller.

Analyzing CPU Load

High CPU load can lead to delays in processing I/O operations, which in turn reduces disk subsystem performance. It is necessary to check the CPU load on the VM and on the host server to determine whether the CPU is a bottleneck.

Example: Monitoring CPU load using top (Linux)

top

The top command displays a list of processes sorted by CPU usage. Pay attention to the %CPU column, which shows the percentage of CPU used by each process. If you see processes constantly using most of the CPU, this may indicate a problem.

Example: Monitoring CPU load using Task Manager (Windows)

  • Open Task Manager by pressing Ctrl+Shift+Esc.
  • Go to the «Performance» tab.
  • Look at the «CPU Usage» graph.

If the graph shows that the CPU is constantly loaded at 90% or more, this may indicate a problem.

Analyzing Memory Usage

Lack of memory can lead to active use of the swap file, which significantly reduces disk subsystem performance. It is necessary to check memory usage on the VM to determine whether memory is a bottleneck.

Example: Monitoring memory usage using free (Linux)

free -m

The free -m command displays information about memory usage in megabytes. Pay attention to the swap column, which shows the use of the swap file. If the used value in the swap column is constantly increasing, this may indicate a lack of memory.

Example: Monitoring memory usage using Resource Monitor (Windows)

  • Open Resource Monitor by typing «resmon» in the Windows search bar.
  • Go to the «Memory» tab.
  • Look at the «Hard Faults/sec» graph.

A high «Hard Faults/sec» value indicates active use of the swap file, which may be caused by a lack of memory.

Analyzing Network Bandwidth

When using network storage, insufficient network bandwidth can become a bottleneck. It is necessary to check the network bandwidth between the VM and the storage to determine whether the network is a bottleneck.

Example: Checking network bandwidth using iperf3

iperf3 is a tool for measuring network bandwidth.

On the server (e.g., on the storage server):

iperf3 -s

On the client (VM):

iperf3 -c <server_IP_address>

Replace <server_IP_address> with the IP address of the server on which iperf3 is running in server mode. iperf3 will show the current network bandwidth.

Example: Monitoring network traffic using tcpdump (Linux)

tcpdump -i eth0 -n host <storage_IP_address>

This command captures network traffic on the eth0 interface between the VM and the storage with the IP address <storage_IP_address>. Analyzing this traffic can help identify network problems. Replace eth0 with the name of your network interface.

Expert Tip: Regularly conduct load testing of VMs to identify potential bottlenecks before they start affecting performance. Use tools to create a realistic load on the disk subsystem and track key performance metrics.