Solaris Volume Management

Basic RAID concepts:

RAID is a classification method to back up & store data on multiple disk drives. There are six levels of RAID.

The Solaris Volume Manager(SVM) software uses metadevices, which are product specific definition of logical storage volumes to implement RAID 0, RAID 1, RAID 1+0 & RAID 5.
RAID 0: Non.redundant disk array (concatenation & striping)
RAID 1: Mirrored disk array.
RAID 5: Block-interleaved striping with distributed-parity

Logical Volume: 
Solaris uses virtual disks called logical volumes to manage physical disks and their associated data. It is functionally identical to physical volume and can span multiple disk members. The logical volumes are located under /dev/md directory.

Note: In earlier versions of Solaris, the SVM software was known as Solstice DiskSuite software and logical volumes were known as metadevices.

Software Partition:
It provides mechanism for dividing large storage spaces into smaller & manageable sizes. It can be directly accessed by applications, including file systems, as long as it is not included in another volume.


RAID-0 Volumes:

It consists of slices or soft partitions. These volumes lets us expand disk storage capacity. There are three kinds of RAID-0 volumes:
1. Stripe volumes
2. Concatenation volumes
3. Concatenated stripe volumes

Note: A component refers to any devices, from slices to soft partitions, used in another logical volume.

Advantage: Allows us to quickly and simply expand disk storage capacity.
Disadvantages: They do not provide any data redundancy (unlike RAID-1 or RAID-5 volumes). If a single component fails on a RAID-0 volume, data is lost.

We can use a RAID-0 volume that contains:
1. a single slice for any file system.
2. multiple components for any file system except for root (/), /usr, swap, /var, /opt, any file system that is accessed during an operating system upgrade or installation

Note: While mirroring root (/), /usr, swap, /var, or /opt, we put the file system into a one-way concatenation or stripe (a concatenation of a single slice) that acts as a submirror. This one-way concatenation is mirrored by another submirror, which must also be a concatenation.

RAID-0 (Stripe) Volume:

It is a volume that arranges data across one or more components. Striping alternates equally-sized segments of data across two or more components, forming one logical storage unit. These segments are interleaved round-robin so that the combined space is made alternately from each component, in effect, shuffled like a deck of cards.

Striping enables multiple controllers to access data at the same time, which is also called parallel access. Parallel access can increase I/O throughput because all disks in the volume are busy most of the time servicing I/O requests.

An existing file system cannot be converted directly to a stripe. To place an existing file system on a stripe volume , you must back up the file system, create the volume, then restore the file system to the stripe volume.






 

Interlace Values for a RAID–0 (Stripe) Volume:

An interlace is the size, in Kbytes, Mbytes, or blocks, of the logical data segments on a stripe volume. Depending on the application, different interlace values can increase performance for your configuration. The performance increase comes from having several disk arms managing I/O requests. When the I/O request is larger than the interlace size, you might get better performance.

When you create a stripe volume, you can set the interlace value or use the Solaris Volume Manager default interlace value of 16 Kbytes. Once you have created the stripe volume, you cannot change the interlace value. However, you could back up the data on it, delete the stripe volume, create a new stripe volume with a new interlace value, and then restore the data.


RAID-0 (Concatenation) Volume:

It is a volume whose data is organized serially and adjacently across components, forming one logical storage unit.The total capacity of a concatenation volume is equal to the total size of all the components in the volume. If a concatenation volume contains a slice with a state database replica, the total capacity of the volume is the sum of the components minus the space that is reserved for the replica.

Advantages:
1. It provides more storage capacity by combining the capacities of several components. You can add more components to the concatenation volume as the demand for storage grows.
2. It allows to dynamically expand storage capacity and file system sizes online. A concatenation volume allows you to add components even if the other components are currently active.
3. A concatenation volume can also expand any active and mounted UFS file system without having to bring down the system. 



Note: Use a concatenation volume to encapsulate root (/), swap, /usr, /opt, or /var when mirroring these file systems.

The data blocks are written sequentially across the components, beginning with Slice A. Let us consider Slice A containing logical data blocks 1 through 4. Disk B would contain logical data blocks 5 through 8. Drive C would contain logical data blocks 9 through 12. The total capacity of volume would be the combined capacities of the three slices. If each slice were 10 Gbytes, the volume would have an overall capacity of 30 Gbytes.

RAID-1 (Mirror) Volumes:

It is a volume that maintains identical copies of the data in RAID-0 (stripe or concatenation) volumes.

We need at least twice as much disk space as the amount of data you have to mirror. Because Solaris Volume Manager must write to all submirrors, mirroring can also increase the amount of time it takes for write requests to be written to disk.

We can mirror any file system, including existing file systems. These file systems root (/), swap, and /usr. We can also use a mirror for any application, such as a database.

A mirror is composed of one or more RAID-0 volumes (stripes or concatenations) called submirrors.

A mirror can consist of up to four submirrors. However, two-way mirrors usually provide sufficient data redundancy for most applications and are less expensive in terms of disk drive costs. A third submirror enables you to make online backups without losing data redundancy while one submirror is offline for the backup.

If you take a submirror "offline", the mirror stops reading and writing to the submirror. At this point, you could access the submirror itself, for example, to perform a backup. However, the submirror is in a read-only state. While a submirror is offline, Solaris Volume Manager keeps track of all writes to the mirror. When the submirror is brought back online, only the portions of the mirror that were written while the submirror was offline (the resynchronization regions) are resynchronized. Submirrors can also be taken offline to troubleshoot or repair physical devices that have errors.

Submirrors can be attached or be detached from a mirror at any time, though at least one submirror must remain attached at all times.

Normally, you create a mirror with only a single submirror. Then, you attach a second submirror after you create the mirror.
 

The figure shows RAID-1 (Mirror) :

Diagram shows how two RAID-0 volumes are used together as a RAID-1 (mirror) volume to provide redundant storage. It shows a mirror, d20. The mirror is made of two volumes (submirrors) d21 and d22.

Solaris Volume Manager makes duplicate copies of the data on multiple physical disks, and presents one virtual disk to the application, d20 in the example. All disk writes are duplicated. Disk reads come from one of the underlying submirrors. The total capacity of mirror d20 is the size of the smallest of the submirrors (if they are not of equal size).

Providing RAID-1+0 and RAID-0+1:

Solaris Volume Manager supports both RAID-1+0 and RAID-0+1 redundancy.
RAID-1+0 redundancy constitutes a configuration of mirrors that are then striped.
RAID-0+1 redundancy constitutes a configuration of stripes that are then mirrored.

Note: Solaris Volume Manager cannot always provide RAID-1+0 functionality. However, where both submirrors are identical to each other and are composed of disk slices (and not soft partitions), RAID-1+0 is possible.

Let us consider a RAID-0+1 implementation with a two-way mirror that consists of three striped slices:

Without Solaris Volume Manager, a single slice failure could fail one side of the mirror. Assuming that no hot spares are in use, a second slice failure would fail the mirror. Using Solaris Volume Manager, up to three slices could potentially fail without failing the mirror. The mirror does not fail because each of the three striped slices are individually mirrored to their counterparts on the other half of the mirror.

The diagram shows how three of six total slices in a RAID-1 volume can potentially fail without data loss because of the RAID-1+0 implementation.


The RAID-1 volume consists of two submirrors. Each of the submirrors consist of three identical physical disks that have the same interlace value. A failure of three disks, A, B, and F, is tolerated. The entire logical block range of the mirror is still contained on at least one good disk. All of the volume's data is available.

However, if disks A and D fail, a portion of the mirror's data is no longer available on any disk. Access to these logical blocks fail. However, access to portions of the mirror where data is available still succeed. Under this situation, the mirror acts like a single disk that has developed bad blocks. The damaged portions are unavailable, but the remaining portions are available. 


Mirror resynchronization:
It ensures proper mirror operation by maintaining all submirrors with identical data, with the exception of writes in progress.

Note: A mirror resynchronization should not be bypassed. You do not need to manually initiate a mirror resynchronization. This process occurs automatically.

Full Resynchronization:
When a new submirror is attached (added) to a mirror, all the data from another submirror in the mirror is automatically written to the newly attached submirror. Once the mirror resynchronization is done, the new submirror is readable. A submirror remains attached to a mirror until it is detached.

If the system crashes while a resynchronization is in progress, the resynchronization is restarted when the system finishes rebooting.

Optimized Resynchronization:
During a reboot following a system failure, or when a submirror that was offline is brought back online, Solaris Volume Manager performs an optimized mirror resynchronization. The metadisk driver tracks submirror regions. This functionality enables the metadisk driver to know which submirror regions might be out-of-sync after a failure. An optimized mirror resynchronization is performed only on the out-of-sync regions. You can specify the order in which mirrors are resynchronized during reboot. You can omit a mirror resynchronization by setting submirror pass numbers to zero. For tasks associated with changing a pass number, see Example 11-16.
Caution   

Note: A pass number of zero should only be used on mirrors that are mounted as read-only.

 
Partial Resynchronization:

After the replacement of a slice within a submirror, SVM performs a partial mirror resynchronization of data. SVM copies the data from the remaining good slices of another submirror to the replaced slice.
 

RAID-5 Volumes:

RAID level 5 is similar to striping, but with parity data distributed across all components (disk or logical volume). If a component fails, the data on the failed component can be rebuilt from the distributed data and parity information on the other components.

A RAID-5 volume uses storage capacity equivalent to one component in the volume to store redundant information (parity). This parity information contains information about user data stored on the remainder of the RAID-5 volume's components. The parity information is distributed across all components in the volume.

Similar to a mirror, a RAID-5 volume increases data availability, but with a minimum of cost in terms of hardware and only a moderate penalty for write operations.

Note: We cannot use a RAID-5 volume for the root (/), /usr, and swap file systems, or for other existing file systems.

SVM automatically resynchronizes a RAID-5 volume when you replace an existing component. SVM also resynchronizes RAID-5 volumes during rebooting if a system failure or panic took place.

Example:
Following figure shows a RAID-5 volume that consists of four disks (components):


The first three data segments are written to Component A (interlace 1), Component B (interlace 2), and Component C (interlace 3). The next data segment that is written is a parity segment. This parity segment is written to Component D (P 1–3). This segment consists of an exclusive OR of the first three segments of data. The next three data segments are written to Component A (interlace 4), Component B (interlace 5), and Component D (interlace 6). Then, another parity segment is written to Component C (P 4–6).

This pattern of writing data and parity segments results in both data and parity being spread across all disks in the RAID-5 volume. Each drive can be read independently. The parity protects against a single disk failure. If each disk in this example were 10 Gbytes, the total capacity of the RAID-5 volume would be 60 Gbytes. One drive's worth of space(10 GB) is allocated to parity.


State Database:
  • It stores information on disk about the state of Solaris Volume Manager software. 
  • Multiple copies of the database are called replica, provides redundancy and should be distributed across multiple disks.
  • The SVM uses a majority consensus algorithm to determine which state database replica contain valid data. The algorithm requires that a majority (half+1) of the state database replicas are available before any of them are considered valid.
Creating a state database:
#metadb -a -c n -l nnnn -f ctds-of-slice
-a specifies to add a state database replica.
-f specifies to force the operation, even if no replicas exist.
-c n specifies the number of replicas to add to the specified slice.
-l nnnn specifies the size of the new replicas, in blocks.
ctds-of-slice specifies the name of the component that will hold the replica.
Use the -f flag to force the addition of the initial replicas.

Example: Creating the First State Database Replica
# metadb -a -f c0t0d0s0 c0t0d0s1 c0t0d0s4 c0t0d0s5

# metadb
        flags         first blk      block count
...
     a      u         16             8192            /dev/dsk/c0t0d0s0

     a      u         16             8192            /dev/dsk/c0t0d0s1
     a      u         16             8192            /dev/dsk/c0t0d0s4
     a      u         16             8192            /dev/dsk/c0t0d0s5

The -a option adds the additional state database replica to the system, and the -f option forces the creation of the first replica (and may be omitted when you add supplemental replicas to the system).


#metadb -a -f -c 2 c1t1d0s1 c1t1d0s2
The above command creates two replica of the slices c1t1d0s1 & c1t1d0s2.


Deleting a State Database Replica:
# metadb -d c2t4d0s7
The -d deletes all replicas that are located on the specified slice. The /etc/system file is automatically updated with the new information and the /etc/lvm/mddb.cf file is updated.


Metainit command:
This command is used to create metadevices. The syntax is as follows:

#metainit -f concat/stripe numstripes width component....

-f: Forces metainit command to continue, even if one of the slices contained a mounted file system or being used.
concat/stripe: Volume name of the concatenation/stripe being defined.
numstripes: Number of individual stripes in the metadevice. For a simple stripe, numstripes is always 1. For a concatenation, numstripes is equal to the number of slices.
width: Number of slices that make up a stripe. When width is greater than 1, the slices are striped.
component: logical name for the physical slice(partition) on a disk drive.

Example:

# metainit d30 3 1 c0t0d0s7 1 c0t2d0s7 1 c0t3d0s7
d30: Concat/Stripe is setup

The above example creates concatenation volume consisting of three slices.

Creating RAID-0 striped volume:


1. Create a striped volume using 3 slices named /dev/md/rdsk/d30 using the metainit command. We will use slices c1t0d0s7, c2t0d0s7, c1t1d0s7 as follows:

# metainit d30 1 3 c1t0d0s7 c2t0d0s7 c1t1d0s7 -i 32k
d30: Concat/Stripe is setup

2. Use the metastat command to query your new volume:

# metastat d30
d30: Concat/Stripe
    Size: 52999569 blocks (25 GB)
    Stripe 0: (interlace: 64 blocks)
        Device     Start Block  Dbase   Reloc
        c1t0d0s7      10773     Yes     Yes
        c2t0d0s7      10773     Yes     Yes
        c1t1d0s7      10773     Yes     Yes

The new striped volume, d30, consists of a single stripe (Stripe 0) made of three slices (c1t0d0s7, c2t0d0s7, c1t1d0s7). 
The -i option sets the interlace to 32KB. (The interlace cannot be less than 8KB, nor greater than 100MB.) If interlace were not specified on the command line, the striped volume would use the default of 16KB. 
When using the metastat command to verify our volume, we can see from all disks belonging to Stripe 0, that this is a stripped volume. Also, that the interlace is 32k (512 * 64 blocks) as we defined it. The total size of the stripe is 27,135,779,328 bytes (512 * 52999569 blocks).

3. Create a UFS file system using the newfs command with 8KB block size:

# newfs -i 8192 /dev/md/rdsk/d30
newfs: /dev/md/rdsk/d30 last mounted as /oracle
newfs: construct a new file system /dev/md/rdsk/d30: (y/n)? y
Warning: 1 sector(s) in last cylinder unallocated
/dev/md/rdsk/d30:        52999568 sectors in 14759 cylinders of 27 tracks, 133 sectors
        25878.7MB in 923 cyl groups (16 c/g, 28.05MB/g, 3392 i/g)
super-block backups (for fsck -F ufs -o b=#) at:
 32, 57632, 115232, 172832, 230432, 288032, 345632, 403232, 460832, 518432,
Initializing cylinder groups:
..................
super-block backups for last 10 cylinder groups at:
 52459808, 52517408, 52575008, 52632608, 52690208, 52747808, 52805408,
 52863008, 52920608, 52978208,

4. Mount the file system on /oracle as follows:

# mkdir /oracle
# mount -F ufs /dev/md/dsk/d30 /oracle

5. To ensure that this new file system is mounted each time the machine is booted, add the following line into you /etc/vfstab file:
/dev/md/dsk/d30       /dev/md/rdsk/d30      /oracle  ufs     2       yes     -



Creating RAID-0 Concatenated volume:


1. Create a concatenated volume using 3 slices named /dev/md/rdsk/d30 using the metainit command.We will be using slices c2t1d0s7, c1t2d0s7, c2t2d0s7 as follows:

# metainit d30 3 1 c2t1d0s7 1 c1t2d0s7 1 c2t2d0s7
d30: Concat/Stripe is setup

2. Use the metastat command to query the new volume:

# metastat
d30: Concat/Stripe
    Size: 53003160 blocks (25 GB)
    Stripe 0:
        Device     Start Block  Dbase   Reloc
        c2t1d0s7      10773     Yes     Yes
    Stripe 1:
        Device     Start Block  Dbase   Reloc
        c1t2d0s7      10773     Yes     Yes
    Stripe 2:
        Device     Start Block  Dbase   Reloc
        c2t2d0s7      10773     Yes     Yes

The new concatenated volume, d30, consists of three stripes (Stripe 0, Stripe 1, Stripe 2,) each made from a single slice (c2t1d0s7, c1t2d0s7, c2t2d0s7 respectively). When using the metastat command to verify our volumes, we can see this is a concatenation from the fact of having multiple Stripes. The total size of the concatenation is 27,137,617,920 bytes (512 * 53003160 blocks).

3. Create a UFS file system using the newfs command with an 8KB block size:
# newfs -i 8192 /dev/md/rdsk/d30
newfs: construct a new file system /dev/md/rdsk/d1: (y/n)? y
/dev/md/rdsk/d30:        53003160 sectors in 14760 cylinders of 27 tracks, 133 sectors
        25880.4MB in 923 cyl groups (16 c/g, 28.05MB/g, 3392 i/g)
super-block backups (for fsck -F ufs -o b=#) at:
 32, 57632, 115232, 172832, 230432, 288032, 345632, 403232, 460832, 518432,
Initializing cylinder groups:
..................
super-block backups for last 10 cylinder groups at:
 52459808, 52517408, 52575008, 52632608, 52690208, 52747808, 52805408,
 52863008, 52920608, 52978208,

4. Mount the file system on /oracle as follows:

# mkdir /oracle
# mount -F ufs /dev/md/dsk/d30 /oracle

5. To ensure that this new file system is mounted each time the machine is booted, add the following line into you /etc/vfstab file:
/dev/md/dsk/d30       /dev/md/rdsk/d30      /oracle  ufs     2       yes  








No comments:

Post a Comment