Contact

Subscribe via Email

Subscribe via RSS

Categories

Creative Commons Attribution 4.0 International License
© Rakhesh Sasidharan

3Par Course: Day 1

Got a 3Par training going on at work since today. I have no experience with 3Pars so this is my first encounter with it really. I know a bit about storage thanks to having worked with HP LeftHands (now called HP StoreVirtual) and Windows Server 2012 Storage Spaces – but 3Par is a whole another beast!

Will post more notes on today’s material tomorrow (hopefully!) but here’s some bits and pieces before I go to sleep:

  • You have disks.
  • These disks are in magazines (in the 10000 series) – up to 4 per magazines if I remember correctly. 
    • The magazines are then put into cages. 10 magazines per cage? 
  • Disks can also be in cages directly (in the non-10000 series, such as the 7200 and 7400 series). 
    • Note: I could be getting the model numbers and figures wrong so take these with a pinch of salt!
  • Important thing to remember is you have disks and these disks are in cages (either directly or as part of magazines). 
  • Let’s focus on the 7200/ 7400 series for now.
    • A cage is a 2U enclosure. 
    • In the front you have the disks (directly, no magazines here). 
    • In the rear you have two nodes. 
      • Yes, two nodes! 3Pars always work in pairs. So you need two nodes. 
      • 7200 series can only have two nodes max/ min. 
      • 7400 series can have two nodes min, four nodes max. 
      • So it’s better to get a 7400 series even if you only want two nodes now as you can upgrade later on. With a 7200 series you are stuck with what you get. 
      • 7200 series is still in the market coz it’s a bit cheaper. That’s coz it also has lower specs (coz it’s never going to do as much as a 7400 series). 
    • What else? Oh yeah, drive shelves. (Not sure if I am getting the term correct here). 
    • Drive shelves are simply cages with only drives in them. No nodes!
    • There are limits on how many shelves a node can control.
      • 7200 series has the lowest limit. 
      • Followed by a two node 7400 series.
      • Followed by a four node 7400 series. This dude has the most!
        • A four node 7400 is 4Us of nodes (on the rear side).
        • The rest of the 42U (rack size) minus 4U (node+disks) = 38U is all drive shelves!
      • Number of drives in the shelf varies if I remember correctly. As in you can have larger size drives (physical size and storage) so there’s less per shelf. 
      • Or you could have more drives but smaller size/ lower storage. 
        • Sorry I don’t have clear numbers here! Most of this is from memory. Must get the slides and plug in more details later. 
    • Speaking of nodes, these are the brains behind the operation.
      • A node contains an ASIC (Application Specific Integrated Circuit). Basically a chip that’s designed for a specific task. Cisco routers have ASICs. Brocade switches have ASICs. Many things have ASICs in them. 
      • A node contains a regular CPU – for management tasks – and also an ASIC. The ASIC does all the 3Par stuff. Like deduplication, handing metadata, optimizing traffic and writes (it skips zeroes when writing/ sending data – is a big deal). 
      • The ASIC and one more thing (TPxx – 3Par xx are the two 3Par innovations). Plus the fact that everything’s built for storage, unlike a LeftHand which is just a Proliant Server. 
      • Caching is a big deal with 3Pars. 
        • You have write caching. Which means whenever the 3Par is given a blob of data, the node that receives it (1) stores it in its cache, (2) sends to its partner, and (3) tells whoever gave it the blob that the data is now written. Note that in reality the data is only in the cache; but since both nodes have it now in their cache it can be considered somewhat safe, and so rather than wait for the disks to write data and reply with a success, the node assures whoever gave it the data that the data is written to disk.
          • Write caching obviously improves performance tremendously! So it’s a big deal. 7400 series have larger caches than 7200.
          • This also means if one node in a two-pair is down, caching won’t happen – coz now the remaining node can’t simply lie that the data is written. What happens if it too fails? There is no other node with a cached copy. So before the node replies with a confirmation it must actually write the data to disk. Hence if one node fails performance is affected.
        • And then you have read caching. When data is read additional data around it is read-ahead.  This improves performance if this additional data is required next. (If required then it’s a cache hit, else a cache miss). 
        • Caching is also involved in de-duplication. 
          • Oh, de-duplication only happens for SSDs. 
            • And it doesn’t happen if you want Adaptive Optimization (whatever that is). 
      • There is remote copying – which is like SRM for VMware. You can have data being replicated from one 3Par system to another. And this can happen between the various models.
      • Speaking of which, all 3Par models have the same OS etc. So that’s why you can do copying and such easily. And manage via a common UI. 
        • There’s a GUI. And a command line interface (CLI). The CLI commands are also available in a regular command prompt. And there’s PowerShell cmdlets now in beta testing?
  • Data is stored in chunklets. These are 1GB units. Spread across disks
  • There’s something called a CPG (Common Provisioning Group). This is something like a template or a definition. 
    • You define a CPG that says (for instance) you want RAID1 (mirroring) with 4 sets (i.e. 4 copies of the data) making use of a particular set of physical disks.
  • Then you create Logical Disks (LDs) based on a CPG. You can have multiple logical disks – based on the same or different CPGs. 
  • Finally you create volumes on these Logical Disks. These are what you export to hosts via iSCSI & friends. 
  • We are not done yet! :) There’s also regions. These are 128 MB blocks of data. 8 x 128 MB = 1024 MB (1 GB) = size of a chunklet. So a chunklet has 8 regions.
  • A region is per disk. So when I said above that a chunklet is spead across disks, what I really meant is that a chunklet is made up of 8 regions and each region is on a particular disk. That way a chunklet is spread across multiple disks. (Tada!)
    • A chunklet is to a physical disk what a region is to a logical disk. 
    • So while a chunklet is 1 GB and doesn’t mean much else, a region has properties. If a logical disk is RAID1, the region has a twin with a copy of the same data. (Or does it? I am not sure really!) 
  • And lastly … we have steps! This is 128 KB (by default for a certain class of disks, it varies for different classes, exact number doesn’t matter much). 
    • Here’s what happens: when the 3Par gets some data to be written, it writes 128 KB (a step size) to one region, then 128 KB to another region, and so on. This way the data is spread across many regions.
    • And somehow that ties in to chunklets and how 3Par is so cool and everything is redundant and super performant etc. Now that I think back I am not really sure what the step does. It was clear to me after I asked the instructor many questions and he explained it well – but now I forget it already! Sigh. 

Some day soon I will update this post or write a new one that explains all these better. But for now this will have to suffice. Baby steps!

Clusters & Quorum

I spent yesterday refreshing my knowledge of clusters and quorum. Hadn’t worked on these since Windows Server 2003! So here’s a brief intro to this topic:

There are two types of clusters – Network Load Balancing (NLB) and Server Clusters.

Network Load Balancing is “share all” in that every server has a full copy of the app (and the data too if it can be done). Each server is active, the requests are sent to each of them randomly (or using some load distributing algorithm). You can easily add/ remove servers. Examples where you’d use NLB are SMTP servers, Web servers, etc. Each server in this case is independent of the other as long as they are configured identically.

Server Clusters is “share nothing” in that only one server has a full copy of the app and is active. The other servers are in a standby mode, waiting to take over if the active one fails. A shared storage is used, which is how standby servers can take over if the active server fails.

The way clusters work is that clients see one “virtual” server (not to be confused with virtual servers of virtualization). Behind this virtual server are the physical servers (called “cluster nodes” actually) that make up the cluster. As far as clients are concerned there’s one end point – an IP address or MAC address – and what happens behind that is unknown to them. This virtual server is “created” when the cluster forms, it doesn’t exist before that. (It is important to remember this because even if the servers are in a cluster the virtual server may not be created – as we shall see later).

In the case of server clusters something called “quorum” comes into play.

Imagine you have 5 servers in a cluster. Say server 4 is the active server and it goes offline. Immediately, the other servers detect this and one of them becomes the new active server. But what if server 4 isn’t offline, it’s just disconnected from the rest of the group. Now we’ll have server 4 continuing to be active, but the other servers can’t see this, and so one of these too becomes the active server – resulting in two active servers! To prevent such scenarios you have the concept of quorum – a term you might have heard of in other contexts, such as decision making groups. A quorum is the minimum number of people required to make a decision. Say a group of 10 people are deciding something, one could stipulate that at least 6 members must be present during the decision making process else the group is not allowed to decide. A similar concept applies in the case of clusters.

In its simplest form you designate one resource (a server or a disk) as the quorum and whichever cluster contains that resource sets itself as the active cluster while the other clusters deactivate themselves. This resource also holds the cluster database, which is a database containing the state of the cluster and its nodes, and is accessed by all the nodes. In the example above, initially all 5 servers are connected and can see the quorum, so the cluster is active and one of the servers in it is active. When the split happens and server 4 (the currently active server) is separated, it can no longer see the quorum and so disables the cluster (which is just itself really) and stops being the active server. The other 4 servers can still see the quorum, so they continue being active and set a new server as the active one.

In this simple form the quorum is really like a token you hold. If your cluster holds the token it’s active; if it does not you disband (deactivate the cluster).

This simple form of quorum is usually fine, but does not scale to when you have clusters across sites. Moreover, the quorum resource is a single point of failure. Say that resource is the one that’s disconnected or offline – now no one has the quorum, and worse, the cluster database is lost. For this reason the simple form of quorum is not used nowadays. (This mode of quorum is called “Disk Only” by the way).

There are three alternatives to the Disk Only mode of quorum.

One mode is called “Node Majority” and as the name suggests it is based on majority. Here each node has a copy of the cluster database (so there’s no single point of failure) and whichever cluster has more than half the nodes of the cluster wins. So in the previous example, say the cluster of 5 servers splits into one of 3 and 2 servers each – since the first cluster has more than half of the nodes, that wins. (In practice a voting takes place to decide this. Each node has a vote. So cluster one has 1+1+1 = 3 votes; cluster two has 1+1 = 2 votes. Cluster one wins).

Quorums based on majority have a disadvantage in that if the number of nodes are even then you have a tie. That is, if the above example were a 6 node cluster and it split into two clusters of 3 nodes each, both clusters will deactivate as neither have more than half the nodes (i.e. neither have the quorum). You need a tie breaker!

It is worth noting that the effect of quorum extends to the number of servers that can fail in a cluster. Say we have a 3 node cluster. For the cluster to be valid, it must have at least 2 servers. If one server fails, the cluster still has 2 servers so it will function as usual. But if one more server fails – or these two servers are disconnected from each other – the remaining cluster does not have quorum (there’s only 1 node, which is less than more than half the nodes) and so the cluster stops and the servers deactivate. This means even though we have one server still running, and intuitively one would expect that server to continue servicing requests (as it would have in the case of an NLB cluster), it does not do so in the case of a server cluster due to quorum! This is important to remember.

Another mode is called “Node & Disk Majority” and as the name suggests it is a mix of the “Node Majority” and “Disk Only” modes. This mode is for clusters with an even number of modes (where, as we know, “Node Majority” fails) and the way it works is that the cluster with more than half the nodes and which also contains the resource (a disk, usually called a “disk witness”) designated as quorum is the active one. Essentially the disk witness essentially acts as the tie breaker. (In practice the disk witness has an extra vote. So a cluster with 6 nodes will still require more than 3 nodes to consider it active and so if it splits into 3 nodes each, when it comes to voting one of the clusters will have (3+1=) 4 votes and hence win quorum).

In “Node & Disk Majority” mode, unlike the “Disk Only” mode the cluster database is present with all the nodes and so it is not a single point of failure either.

The last mode is called “Node & File Share Majority” and this is a variant of the “Node Majority” mode. This mode too is for clusters with an even number of nodes, and it works similar to “Node & Disk Majority” except for the fact that instead of a resource a file share is used. A file share (called a “file witness” in this case) is selected on any server – not necessarily a part of the cluster – and one node in the cluster locks a file on this share, effectively telling others that it “owns” the file share. So instead of using a resource as a tie breaker, ownership of the file share is used as the tie breaker. (As in the “Node & Disk Majority” mode the node that owns the file share has an additional vote as it owns the file share). Using the previous examples, if a cluster with 6 nodes splits into 3 nodes each, whichever cluster has the node owning the file share will win quorum while the other cluster will deactivate. If the cluster with 6 nodes splits into clusters of 4 and 2 nodes each, and say the 2 node cluster owns the file share, it will still lose as more than half the nodes are in the 4 node cluster (in terms of votes the winning cluster has 4 votes, the losing cluster has 3 votes). When the cluster deactivates, the current owner will release the lock and a node in the new cluster will take ownership.

An advantage of the “Node & File Share Majority” mode is that the file share can be anywhere – even on another cluster (preferably). The file share can also be on a node in the cluster, but that’s not preferred as if the node fails you lose two votes (that of being the file share owner as well as the node itself).

Here are some good links that contain more information (Windows specific):

At work we have two HP LeftHand boxes in a SAN cluster. The only quorum mode used by this is is the “Node Majority” one, which as we know fails for an even number of nodes, and so HP supplies a virtual appliance called “Failover Manager” that is installed on the ESX hosts and is used as the tie breaker. If the Failover Manager is powered off and both LeftHands are powered on together, suppose both devices happen to come on at the same time a cluster is not formed as neither has the quorum. To avoid such situations the Failover Manager has to be present when they power on, or we have to time the powering on such that one of the LeftHands is online before the other.