Technical Introduction to DB2 PureScale CF
Speaker Adam Koile
Click to add text
Lennox
September 8, 2014 © 2009 IBM Corporation
Information Management
Agenda
Overview
What makes up a CF
CF connection pool
Configuration parameters
Multiple Host Channel adapter support
2 © 2009 IBM Corporation
Information Management
Primary CF and Secondary CF
Cluster Caching Facility (CF)
Primary CF / Primary role
– responsible for the administration and granting of lock requests to
DB2 members
– responsible for caching data pages, and is a repository for internal
communication used by DB2 members
– must exist for the DB2 SD cluster to function operationally.
Secondary CF
– maintain a copy of pertinent information for immediate recovery of
the PRIMARY role
3 © 2009 IBM Corporation
Information Management
DB2 pureScale : Architecture
Clients
Single Database View
Member Member Member Member
CS CS CS CS
Cluster Interconnect
CS CS
Secondary Log Log Log Log Primary
CF CF
Shared Storage Access
Shared Database
4 © 2009 IBM Corporation
Information Management
Understanding Start and Stop process of the CF’s
Can be started by global db2start or individual db2start
– Cluster manager starts primary role (also known as the PRIMARY state)
on preferred primary CF host
– Other CF is started as in the secondary role (also known as PEER state)
If one CF has been up and running in PRIMARY state, and the
secondary CF is started up after-wards
– Then the second CF will undergo a phase known as CATCHUP
• ensures the secondary CF has a copy of all pertinent information from the
PRIMARY CF in its (the secondary CF's) memory before it will transition into
PEER state.
– Members resume duplexing
– Members asynchronously send lock information and other states to
secondary
– Members asynchronously cast out pages from primary to disk
DB2 members needs to be stopped before stopping primary role
– If all members are stopped and it is the last CF to be stopped, the primary role is
stopped and the CF is stopped
5 © 2009 IBM Corporation
Information Management
Different CF States
■ CF states
– STOPPED - CF has been manually stopped using the
db2stop command
– RESTARTING - CF is in the process of starting, either
from the db2start command, or after a CF failure.
– BECOMING PRIMARY – CF attempts to take on the
role of the primary CF
– PRIMARY – CF is operating as the primary CF
– CATCHUP (% n) – Secondary CF is retrieving
information from primary
– PEER – CF is the secondary
– ERROR – An error occurred on the CF
6 © 2009 IBM Corporation
Information Management
CF Fail over and Duplexing
The fail-over of the PRIMARY role from one CF to a
second CF will only occur if the second CF is in PEER
state
–All other cases will result in a GROUP RESTART
Duplexing
–While the CFs are in PRIMARY and PEER state, the process for
ensuring both CFs are still in sync with each other is called
"DUPLEXING".
–Duplexing ensures that a copy of the important details of the
structures, and data contained in the structure is available in the
second CF.
–Once the second CF reaches PEER state, all structures requiring
duplexing will continually be duplexed.
7 © 2009 IBM Corporation
Information Management
Agenda
Overview
What makes up a CF
CF connection pool
Configuration parameters
Multiple Host Channel adapter support
8 © 2009 IBM Corporation
Information Management
Parts of a CF
The Group Buffer Pool (GBP) Primary copy
Completely duplexed
The Global Lock Manager (GLM) Partially duplexed
A shared communication area (SCA)
Metadata for the shared data instance
Cache Lock List Cache Lock List
Smart Smart
GBP Locks Arrays GBP Locks Arrays
SCA SCA
Primary CF Secondary CF
9 © 2009 IBM Corporation
Information Management
Locking Concepts in pureScale
Local Lock Manager (LLM)
– A component that runs on each member and manages the lock requests
made by the applications running on that member
– Maintains a list of locks held by each transaction/application
Global Lock Manager (GLM)
– A component that runs on the CF and coordinates the lock requests made by
the LLM running on each member
– Maintains a list of locks held by each member
– Only retained locks are duplexed on the secondary CF
When a lock is needed on a member that it doesn’t already hold,
the LLM coordinates with the GLM to get it
LLM and GLM communicate via requests and lock negotiation
– Set Lock State (SLS) – lock request from LLM to GLM
– Set Lock State Multiple (SLSN) – multiple lock requests batched in one
message from LLM to GLM
10 © 2009 IBM Corporation
Information Management
Cache Structure – Group Buffer Pool or GBP
Consists of pages, page registrations and meta-data
Acts as a store-in cache
Pages in the GBP may be dirty or clean.
Pages are written using WAR – Write and Register, this registers the
member’s interest in the page.
Pages are read using RAR – Read and Register. This returns the
page and registers the member’s interest in the page.
When a page is updated in the GBP all copies in “interested” local
buffer pools of members (other than the member that issued the WAR)
are invalidated by the CF using RDMA (Cross Invalidation).
Pages are flushed out of the Primary GBP by DB2 member nodes
using a castout operation.
11 © 2009 IBM Corporation
Information Management
Cross Invalidation (XI)
Every buffer pool page is associated with a page
descriptor (bpd)
The bpd we will add a new byte to represent the
validity of the buffer pool page
Interconnect between the CA and DB2 members
supports RDMA (e.g. Infiniband)
Writing a new version of page to GBP will cause
the CA to send Cross Invalidation (XI) to all db2
members who have registered interest of the
page, informing db2 members that the version of
page in their local buffer pools is out of date.
12 © 2009 IBM Corporation
Information Management
Write and Register (WAR)
Write protocol SD states the EDU that writes to the GBP-dependent must write
through the WAR interface then to disk.
Write And Register Multiple ( WARM ) api writes multiple pages to the CA
Any Pages that are WAR’ed and are not registered to the CF will cause the
member’s interest to register it.
– the transaction logs for any update must be forced to disk before any page is written to the GBP
(WAL protocol)
– This other members from seeing stale versions of the page on the same GBP
Any pages that are modified on the GBP through a WAR(M), the CA will send XIs
(cross-invalidations) to the members that had an interest in the page
WAR(M) is supplied in registered memory
WAR request may require memory resources in the GBP
EDU doing the WAR is required to have the page latch in at least U mode
Latch State Compatibility Table
S U X
S YES YES NO
U YES NO NO
X NO NO NO
13 © 2009 IBM Corporation
Information Management
Read and Register (RAR)
Page coherency protocol in SD requires that when a DB2 member
wants to bring in a GBP dependent page to local buffer pool
– Including refreshing a page marked as invalid (XIed),
– The DB2 member must register interest in the page with the GBP
– The member does so by sending a Read and Register (RAR) request.
The Read And Register Multiple (RARM) API allows the caller to
batch multiple RAR requests together into one CA call.
The EDU performing the RAR must hold the page latch in X mode
For example (assuming uDAPL is used):
– An agent A in member M1 tries to fix a page in concurrent mode and the page is not in member M1's local
buffer pool.
– Agent A gets the page latch in X mode, and then performs RAR. The page is not in the GBP either.
– Agent A is swapped out
– An agent B in member M2 tries to fix the page in X mode, the page is not in member M2's local buffer pool
– Agent B issues a RAR request and gets "data not found" back from the CA
– Agent B reads the page from disk, updates the page, then writes the new version to the GBP.
– The page is written from the GBP to disk (See LI200077 - Castout)
14 © 2009 IBM Corporation
Information Management
External APIs for RAR and WAR
RAR/WAR/WARM
– SAL_ReadAndRegisterPage
– SAL_DeregisterPage
– SAL_WriteAndRegister
– SAL_WriteAndRegisterMultiple
– SAL_ReadAndRegister
– SAL_ReadAndRegisterMultiple
15 © 2009 IBM Corporation
Information Management
Castout Operation
Pages are flushed out of the GBP by DB2 member nodes
using a castout operation
Castout entails getting a castout lock by a DB2 member
node
– writing the page to disk
– removing the page from the castout queue and then releasing the
castout lock
Once the castout lock is released, the minBuffGBP is
updated by the CA if necessary
– maintain the oldest dirty page currently in the GBP.
Castout
– SAL_ReadForCastoutMult
– SAL_ReleaseCastOutLocks
16 © 2009 IBM Corporation
Information Management
Content of GBP
GBP
– a simple global page store
– keeping the most recently updated version
of as many pages as it can
– including clean hot pages
Directory Entry
– A page metadata in the GBP
– 1 and only 1 directory entry per page
Data Element
– Page data in the GBP is stored in “data
elements”
– Each data element is 4k bytes
– 0 or more Data Elements per page
– a 32k page will use 8 data elements
– a clean page’s data may not be in GBP
17 © 2009 IBM Corporation
Information Management
A primer on two-level page buffering in pureScale
• The local bufferpool (LBP) at each member caches both read-only and
updated pages for that member
• The shared group bufferpool (GBP) at the CF contains references to every
page in all LBPs across the cluster
– References ensure consistency across members – who’s interested in which
pages, in case the pages are updated
• The GBP also contains copies of all updated pages from the LBPs
– Sent from the member at transaction commit time, etc.
– Stored in the GBP & available to other members on demand
– Saves going to disk!
– 30 µs page read request over Infiniband from the GBP can be more than 100x
faster than reading from disk
• Statistics are kept for tuning
– Pages found in LBP vs. found in GBP vs. read from disk
– Useful in tuning GBP / LBP sizes
18 © 2009 IBM Corporation
Information Management
pureScale bufferpool monitoring
• GBP hit ratio
(pool_data_gbp_l_reads – pool_data_gbp_p_reads) /
pool_data_gbp_l_reads
– A hit here is a read of a previously modified page, so hit ratios are
typically quite low
• An overall (LBP+GBP) H/R in the high 90's can correspond to a GBP
H/R in the low 80's
Decreases with greater portion of read activity
• Why? Less dependency on the GBP
19 © 2009 IBM Corporation
Information Management
Lock Structure
There are two lock structures in the CF
– Regular Lock Structure – all locks are here with exception of the GCL.
– Global Consistency Lock or GCL – stores just 1 lock, used to serialize
database activation/deactivation on different members (protects GCR, and on
deactivation final castout and marking of db clean).
Changes to lock states are done using the Set Lock State API (SLS).
– If the lock is not available, the member will be asynchronously notified when
the lock is available through a lock notification
• The member uses the Get Notification API to wait for lock notifications in a
dedicated thread
Registration of lock states in the secondary CF are done through the
Record Lock State API (RLS).
This structure is the only one where duplexing is not fully managed by
SAL: the local lock manager in some cases is responsible for invoking
the SAL API to register a lock state in the secondary CF (e.g. after a
lock notification is received from the primary CF).
20 © 2009 IBM Corporation
Information Management
Client A :
The Role of the GLM Client B :
Select from T1
Client C :
where C2=Y
Select from T1
Update T1 set C1=X
where C2=Y where C2=Y
Grants locks to members Commit
upon request Member 1 Member 2
– If not already held by another
member or held in a compatible mode
Maintains global lock state Page
Page
LSNLSNis is
old,
– What member has what lock recent,
row lock
rownot
lock
and in what mode needed
– Also maintains interest list of
pending requests for each lock
X
Lo
c
k
Grants pending lock requests
Re
q
when available q
Wr
Re
te
Loc
– Via asynchronous notification
da
i te
kR ge k
elea Pa oc
ali
Pa
se L
In v
ad S
ge
Re
nt ”
Notes
ile
“S
– When a member owns a lock,
it may grant further, locally
– “Lock Avoidance”: DB2 avoids lock requests GBP GLM SCA
when log sequence number in page header
indicates no update on the page could R32
be uncommitted R33
Page R33
M1-X
M2-S
Registry
M1 M2 R34
21 © 2009 IBM Corporation
Information Management
List Structure - Smart Arrays
The list structure is used to implement (multiple) smart arrays.
A smart array is a CF provided array of values such that each
member node indexes its own element in the array.
Member nodes update their element in the smart array and the
smart array implicitly provides synchronization services for this.
The smart array is smart in that it provides for advanced
operations such as computing the global array minimum or
maximum.
Examples of smart arrays are:
– readLSN: the oldest access to a particular table by that member
– minBuffLSN: minimum LSN value over all dirty pages in the local
buffer pool of the member.
22 © 2009 IBM Corporation
Information Management
Cache Structure – Shared Communications Area or SCA
The SCA is a read mostly area with few updates.
Contains SCA pages and meta-data associated with those
pages.
Used to store control blocks (e.g., table control blocks).
Uses WAR to update those pages and RAR to read those
pages.
Page registration is not used.
The higher level components get access to it through the
GSS (Global Synchronization Services)
GSS provides both concurrency protection (through validity
lotch) and storage (SCA).
23 © 2009 IBM Corporation
Information Management
Shared Communication Area / Global Shared Structure
Problem:
– In pureScale, control blocks that have database-wide scope
need to be maintained in an area that is consistently
viewed/accessed by all members
– Example: the list of globally-active event monitors in the cluster
Solution:
– Shared Communication Area (SCA): cluster-wide global
shared memory in CF server
– Persists until database is dropped or the CF server is stopped,
unless explicitly freed
– Size is controlled by new CF_SCA_SZ database cfg param
24 24 © 2009 IBM Corporation
Information Management
Agenda
Overview
What makes up a CF
CF connection pool
Configuration parameters
Multiple Host Channel adapter support
25 © 2009 IBM Corporation
Information Management
What is the CF Connection Pool
A group of memory areas in the DB2 member
– each containing a handle to the CF and other data.
Each connection utilizes a command connection to a single CF
from a single DB2 member.
Used by the SAL layer to provide CF services to DB2 members.
– Group Buffer Pool services
– Global Locking services
– Shared Communication Area ( SCA ) services
Connections are shared between EDUs.
An initial number is established at DB2 Start time and others are
created during runtime based on demand.
Has instance scope (shared by multiple databases).
The SAL Layer abstracts the multiple CFs aspect and the arrival
or departure of CFs to the members.
Does not include notify or management CF connections
26 © 2009 IBM Corporation
Information Management
What is the CF Connection Pool?
Simplified Diagram:
27 © 2009 IBM Corporation
Information Management
Agenda
Overview
What makes up a CF
CF connection pool
Configuration parameters
Multiple Host Channel adapter support
28 © 2009 IBM Corporation
Information Management
From CF config to DBM & DB config
29 © 2009 IBM Corporation
Information Management
The new Configuration parameters related to CF
30 © 2009 IBM Corporation
Information Management
Agenda
Overview
What makes up a CF
CF connection pool
Configuration parameters
Multiple Host Channel adapter support
31 © 2009 IBM Corporation
Information Management
Multiple HCAs for CFs (fp4)
Example
– Updating a CF to use an additional cluster interconnect network adapter ports
on an InfiniBand network.
– Before updating the CF, [Link] contains:
• 0 memberhost0 0 memberhost0-ib0
• 128 cfhost0 0 cfhost0-ib0
Note: Do not modify [Link] directly.
Running the following command:
– db2iupdt -update -cf cfhost0:cfhost0-ib0,cfhost0-ib1,cfhost0-ib2,cfhost0-ib3
The [Link] now contains:
– 0 memberhost0 0 memberhost0-ib0
– 128 cfhost0 0 cfhost0-ib0,cfhost0-ib1,cfhost0-ib2,cfhost0-ib3
What to do next
– Repeat the same procedure on the secondary CF.
– Consider adding a second switch and redistributing members between the
switches or connect additional members to them.
32 © 2009 IBM Corporation
Information Management
Multiple HCAs for CFs
Improves interconnect bandwidth and availability of the CFs
– Members still only support a single HCA port
Applicable to both AIX and SLES Linux with IB
– Not supported with RHEL or with 10Gb Ethernet on SLES
CFs can be configured to use up to four HCA ports
– Two ports on two cards or one port on four cards
– Utilizes adapter port load-balancing, distributing connections to the CF
between the multiple ports
A CF with more than one HCA port must be on its own physical host(or
LPAR)
– Isolating each CF is strongly suggested regardless
All adapter ports are monitored and the host will be deemed healthy as
long as one healthy port is available
Use db2pd –cfpools to monitor connections and display HCA port-
mapping information
33 © 2009 IBM Corporation
Information Management
Multiple InfiniBand Switches
Cluster can be configured with two InfiniBand switches for redundancy and improved network resiliency
Half of the HCA ports on a CF are cabled to each of the switches
Half of the members are cabled to one switch, half to the other
Switches must be connected to each other by two or more inter-switch links
If one switch fails
– Both CFs will remain online and available
– Those members cabled to the switch will go offline
• Left with 50% member capacity remaining
34 © 2009 IBM Corporation
Information Management
Example Multi-HCA Configurations
35 © 2009 IBM Corporation
Information Management
Multi-HCA Failure Example #1 (Single HCA Adapter Offline)
One HCA adapter offline with second adapter available and healthy on CF
Host itself is considered healthy because one of the adapters is still available
– Host remains active and CF remains running
– An alert has been raised for the host stating that there is a problem
$ db2instance –list
ID TYPE STATE HOME_HOST CURRENT_HOST ALERT PARTITION_NUMBER LOGICAL_PORT NETNAME
-- ---- ----- --------- ------------ ----- ---------------- ------------ -------
0 MEMBER STARTED coralpib77 coralpib77 NO 0 0 coralpib77-ib0
1 MEMBER STARTED coralpib78 coralpib78 NO 0 0 coralpib78-ib0
128 CF PRIMARY coralpib62 coralpib62 NO - 0 coralpib62-ib1, coralpib62-ib0
129 CF CATCHUP coralpib61 coralpib61 NO - 0 coralpib61-ib1, coralpib61-ib0
HOSTNAME STATE INSTANCE_STOPPED ALERT
-------- ----- ---------------- -----
coralpib61 ACTIVE NO NO
coralpib62 ACTIVE NO YES
coralpib77 ACTIVE NO NO
coralpib78 ACTIVE NO NO
There is currently an alert for a member, CF, or host in the data-sharing instance. For more information on the
alert, its impact, and how to clear it, run the following command: 'db2cluster -cm -list -alert'.
$ db2cluster -cm -list -alert
Alert: The host is not responding on the specified network adapter. This impacts members and cluster caching
facilities communicating on this network adapter. Host: ‘coralpib62'. Network adapter: ‘coralpib62-ib0'.
...
36 © 2009 IBM Corporation
Information Management
Multi-HCA Failure Example #2 (All HCAs for CF Offline)
No adapters are available so host is not considered healthy
– CF goes into “ERROR” state (other CF becomes primary if not already so)
– Alerts gets raised for the both the CF and the host
$ db2instance –list
ID TYPE STATE HOME_HOST CURRENT_HOST ALERT PARTITION_NUMBER LOGICAL_PORT NETNAME
-- ---- ----- --------- ------------ ----- ---------------- ------------ -------
0 MEMBER STARTED coralpib77 coralpib77 NO 0 0 coralpib77-ib0
1 MEMBER STARTED coralpib78 coralpib78 NO 0 0 coralpib78-ib0
128 CF ERROR coralpib62 coralpib62 YES - 0 coralpib62-ib1, coralpib62-ib0
129 CF PRIMARY coralpib61 coralpib61 NO - 0 coralpib61-ib1, coralpib61-ib0
HOSTNAME STATE INSTANCE_STOPPED ALERT
-------- ----- ---------------- -----
coralpib61 ACTIVE NO NO
coralpib62 ACTIVE NO YES
coralpib77 ACTIVE NO NO
coralpib78 ACTIVE NO NO
There is currently an alert for a member, CF, or host in the data-sharing instance. For more information on the
alert, its impact, and how to clear it, run the following command: 'db2cluster -cm -list -alert'.
$ db2cluster -cm -list –alert
Alert: The host is not responding on the specified network adapter. This impacts members and cluster caching
facilities communicating on this network adapter. Host: ‘coralpib62'. Network adapter: ‘coralpib62-ib0'.
...
Alert: The host is not responding on the specified network adapter. This impacts members and cluster caching
facilities communicating on this network adapter. Host: ‘coralpib62'. Network adapter: ‘coralpib62-ib1'.
...
Alert: The cluster caching facility (CF) failed to start on the host. Host: ‘coralpib62'. CF: ‘128'.
...
37 © 2009 IBM Corporation
Information Management
Sources
[Link]
[Link]
_no_backup_material.ppt
DB2 New Development : Alan Y Lee
STSM, DB2 pureScale and Optim Data Studio Performance : Steve Rees
DB2 Level 2 Service : John Gera
[Link]
page/Purescale
[Link]
%2DBancoDoBrasil%2DpureScalePresentations/Keliys%2DBancoDoBrasil%[Link]
Kelly Schlamb
Executive IT Specialist, Worldwide Information Management Technical Sales
IBM Canada Ltd.
The IBM DB2 pureScale Clustered Database Solution: Part 1
-See more at: [Link]
#[Link]
-Dev wiki: [Link]
38 © 2009 IBM Corporation