Postgresql Wal

unde & redo

undo – 回滚;

redo – 前滚恢复: 发生崩溃时, 进行数据恢复, 通过已存在的预写式日志(WAL)来重新恢复数据库.

Wal

预写式日志(WAL)是保证数据完整性的一种标准方法。

​ WAL的中心概念是数据文件(存储表 和 索引) 的修改[select操作并不会被记录]必须在这些动作被日志记录之后才会被写入, 即在描述这些改变的日志记录被刷到持久存储以后。

​ 因此我们不需要在每个事务提交时刷写数据页面到磁盘,因为我们知道发生崩溃时可以通过日志来恢复数据库。任何还没有被应用到数据页面的改变可以根据其日志记录进行重做。即前滚恢复。

如果数据库在异步提交和事务WAL记录写入之间的风险窗口期间崩溃,在该事务期间所作的修改将丢失。

后台进程(“WAL写进程”)每wal_writer_delay毫秒就会把未写入的wal记录刷写到磁盘。

风险窗口实际的最大持续时间是 wal_writer_delay的3倍,因为wal写进程被设计成倾向于在忙时一次写入所有页面。

commit_delay Sets the delay in microseconds between transaction commit and flushing WAL to disk.;

wal段: 文件命令组成:

1
2
3
4
#define XLogFilePath(path, tli, logSegNo)	\
snprintf(path, MAXPGPATH, XLOGDIR "/%08X%08X%08X", tli, \
(uint32) ((logSegNo) / XLogSegmentsPerXLogId), \
(uint32) ((logSegNo) % XLogSegmentsPerXLogId))

组成共 3 * 8 => 24位;

分别由 ThisTimeLineID, logSegNo / XLogSegmentsPerXLogId, logSegNo % XLogSegmentsPerXLogId.

构成;

时间线: 从1开始,接收到ArchiveRecoveryRequested [.conf 存在归档恢复] 时, findNewestTimeline + 1;

logSegNo:

1
static XLogSegNo openLogSegNo = 0;

时间线

​ 1

日志文件标号:

​ 5EA

日志文件段标号:

​ 9

1
printf %d 0x95FA20   ===>  9828896

为什么用6位16进制.

16^^6 => 16777216 => 16 * 1024 * 1024 => 16M;

Wal相关函数

  • pg_current_wal_lsn()
  • pg_current_wal_insert_lsn() 显示当前预写日志插入位置.
  • pg_current_wal_flush_lsn() 显示当前预写日志刷新位置;
  • pg_wal_lsn_diff() 以字节数计算两个预写日志位置之间的差别..
    • select pg_wal_lsn_diff ( pg_current_wal_insert_lsn(), pg_current_wal_lsn());
  • pg_wal_replay_pause() 立即暂停恢复. (仅限于超级用户..)
  • pg_wal_replay_resume() 如果恢复被暂停, 重启之..

pg_rman redo; -> blockrecover;

对照相应的lsn 查找对应的wal file; 按照其偏移 -

1
2
3
4
5
6
7
targetPgePtr = LSN - ( LSN % XLOG_BLCKSZ);
targetRecoff = LSN % XLOG_BLCKSZ;


readOff = ReadPageINternal_rman(state,
targetPagePtr,
Min(targetRecOff + SizeOfXlogRecord, XLOG_BLCKSZ))

Wal file 结构

img

内存布局:

  1. XLogLongPageHeaderData – 40byte

  2. prev xlog record – 16byte [First Page]

  3. Xlog Record – 24byte

  4. Xlog Record Block Header – 4byte 【0-N】

    0..N个XLogRecordBlockHeader,每个XLogRecordBlockHeader对应一个block data;
    注意:如设置了BKPBLOCK_HAS_IMAGE标记,则在XLogRecordBlockHeader结构体后跟XLogRecordBlockImageHeader结构体;

    如设置了BKPIMAGE_HAS_HOLE和 BKPIMAGE_IS_COMPRESSED, 则在XLogRecordBlockImageHeader后跟XLogRecordBlockCompressHeader结构体;

  5. XLogRecordDataHeader[Short | Long ]

    1
    2
    3
    4
    5
    typedef struct XLogRecordDataHeaderShort
    {
    uint8 id; /* XLR_BLOCK_ID_DATA_SHORT */
    uint8 data_length; /* number of payload bytes */
    } XLogRecordDataHeaderShort;

    uint8 data_length 是指main data的大小. 3字节的结构.

  6. block Data

    1. xl_heap_header

    2. Tuple data

      length = [ XLOG Record’s len ] - [ sizeof(xlog record block header) ] - [ sizeof(xlog record data header[short | long]) ] - [ xl_heap_header ] - [ main_data ]

  7. main Data

WAL 按照PageSize [8192] 进行排列.

第一个Page才会使用此struct

1
2
3
4
5
6
7
typedef struct XLogLongPageHeaderData   ==> 40byte
{
XLogPageHeaderData std; /* standard header fields */ 24byte
uint64 xlp_sysid; /* system identifier from pg_control */ 8byte
uint32 xlp_seg_size; /* just as a cross-check */ 4byte
uint32 xlp_xlog_blcksz; /* just as a cross-check */ 4byte
} XLogLongPageHeaderData;

其余Page使用…

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
typedef struct XLogPageHeaderData  ==> 24byte
{
uint16 xlp_magic; /* magic value for correctness checks */ 2byte
uint16 xlp_info; /* flag bits, see below */ 2byte
TimeLineID xlp_tli; /* TimeLineID of first record on page */ 4byte
XLogRecPtr xlp_pageaddr; /* XLOG address of this page */ 8byte

/*
* When there is not enough space on current page for whole record, we
* continue on the next page. xlp_rem_len is the number of bytes
* remaining from a previous page.
*
* Note that xl_rem_len includes backup-block data; that is, it tracks
* xl_tot_len not xl_len in the initial header. Also note that the
* continuation data isn't necessarily aligned.
*/
uint32 xlp_rem_len; /* total len of remaining data for record */ 4byte
} XLogPageHeaderData;

记录每个Record:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
/*
* The overall layout of an XLOG record is:
* Fixed-size header (XLogRecord struct)
* XLogRecordBlockHeader struct
* XLogRecordBlockHeader struct
* ...
* XLogRecordDataHeader[Short|Long] struct
* block data
* block data
* ...
* main data
*
* There can be zero or more XLogRecordBlockHeaders, and 0 or more bytes of
* rmgr-specific data not associated with a block. XLogRecord structs
* always start on MAXALIGN boundaries in the WAL files, but the rest of
* the fields are not aligned.
*
* The XLogRecordBlockHeader, XLogRecordDataHeaderShort and
* XLogRecordDataHeaderLong structs all begin with a single 'id' byte. It's
* used to distinguish between block references, and the main data structs.
*/

XLOG Record按存储的数据内容来划分,大体可以分为三类:

Record for backup block:存储full-write-page的block,这种类型Record是为了解决page部分写的问题。在checkpoint完成后第一次修改数据page,在记录此变更写入事务日志文件时整页写入(需设置相应的初始化参数,默认为打开);
Record for tuple data block:存储page中的tuple变更,使用这种类型的Record记录;
Record for Checkpoint:在checkpoint发生时,在事务日志文件中记录checkpoint信息(其中包括Redo point)。

其中XLOG Record data是存储实际数据的地方,由以下几部分组成:

0..N个XLogRecordBlockHeader,每一个XLogRecordBlockHeader对应一个block data;
XLogRecordDataHeader[Short|Long],如数据大小<256 Bytes,则使用Short格式,否则使用Long格式;
block data:full-write-page data和tuple data。对于full-write-page data,如启用了压缩,则数据压缩存储,压缩后该page相关的元数据存储在XLogRecordBlockCompressHeader中;
main data: /checkpoint等日志数据.

固定Record Header结构: – 记录 prev wal

1
2
3
4
5
6
7
8
9
10
11
12
13
typedef struct XLogRecord   ==> 24byte
{
uint32 xl_tot_len; /* total len of entire record */ 4byte
TransactionId xl_xid; /* xact id */ 4byte
XLogRecPtr xl_prev; /* ptr to previous record in log */ 8byte
uint8 xl_info; /* flag bits, see below */ 1byte
RmgrId xl_rmid; /* resource manager for this record */ 1byte
/* 2 bytes of padding here, initialize to zero */ -- +2 byte
pg_crc32c xl_crc; /* CRC for this record */ 4byte

/* XLogRecordBlockHeaders and XLogRecordDataHeader follow, no padding */

} XLogRecord;

实际Record Block Data Header存放: 数据信息头.

1
2
3
4
5
6
7
8
9
10
11
typedef struct XLogRecordBlockHeader
{
uint8 id; /* block reference ID */ 1byte
uint8 fork_flags; /* fork within the relation, and flags */ 1byte
uint16 data_length; /* number of payload bytes (not including page 2byte
* image) */

/* If BKPBLOCK_HAS_IMAGE, an XLogRecordBlockImageHeader struct follows */
/* If BKPBLOCK_SAME_REL is not set, a RelFileNode follows */
/* BlockNumber follows */
} XLogRecordBlockHeader;

Record Data Header ==> XLogRecordDataHeader[Short|Long],如数据大小<256 Bytes,则使用Short格式,否则使用Long格式;

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
/*
* XLogRecordDataHeaderShort/Long are used for the "main data" portion of
* the record. If the length of the data is less than 256 bytes, the short
* form is used, with a single byte to hold the length. Otherwise the long
* form is used.
*
* (These structs are currently not used in the code, they are here just for
* documentation purposes).
*/
typedef struct XLogRecordDataHeaderShort
{
uint8 id; /* XLR_BLOCK_ID_DATA_SHORT */
uint8 data_length; /* number of payload bytes */
} XLogRecordDataHeaderShort;

#define SizeOfXLogRecordDataHeaderShort (sizeof(uint8) * 2)

typedef struct XLogRecordDataHeaderLong
{
uint8 id; /* XLR_BLOCK_ID_DATA_LONG */
/* followed by uint32 data_length, unaligned */
} XLogRecordDataHeaderLong;

#define SizeOfXLogRecordDataHeaderLong (sizeof(uint8) + sizeof(uint32))

access/heapam_xlog.h

1
2
3
4
5
6
typedef struct xl_heap_header
{
uint16 t_infomask2; //2byte
uint16 t_infomask; //2byte
uint8 t_hoff; //1byte
} xl_heap_header;

SQL 标记操作:

img

1
2
3
4
5
6
7
8
/* This is what we need to know about insert */
typedef struct xl_heap_insert
{
OffsetNumber offnum; /* inserted tuple's offset */
uint8 flags;

/* xl_heap_header & TUPLE DATA in backup block 0 */
} xl_heap_insert;

执行INSERT数据为例, 在插入数据时的XLOG Record Data内部结构如下图.

img

[pg_waldump]PostgreSQL自带的wal解析;

连续归档

在任何时候,Postgresql维护在数据库数据目录的子目录pg_wal/【下一个子预写日志】。

这个日志文件记录了任何对数据库数据文件的更改。文件的存在主要目的是用于崩溃安全(crash-safety). 如果数据库崩溃,数据库通过”重播”从最后检查点依赖产生的日志条目可以恢复一致。

日志的存在性使人们可能以第三策略来备份数据库, 结合文件系统级别备份和wal文件备份。恢复文件系统备份,然后从备份WAL文件重新运行, 把系统恢复到当前状态。

  • 不需要一个从出发点完全一致的文件系统备份。备份中任何内部不一致的地方将会通过在日志重做被纠正。
  • 结合无限长的持续的重做WAL文件, 连续备份可以简单的归档,只要继续归档WAL文件即可。
  • 重做最终WAL条目一直到结束是没有必要的。 我们能够阻止任何点的重做, 并有一个一致的数据库快照

设置WAL归档.

正在运行的PostgreSQL系统产生无限长序列的WAL记录. 该系统物理上划分这个序列到多个WAL段文件,通常每个段只有16MB(但是在创建PostgreSQL时, 段的大小可以改变)。

该段文件给出序列数字名称来映射它们在抽象WAL序列上的位置.

当不使用WAL归档,系统就通常只创建几个段文件,然后回收它们,通过重命名不在需要的段文件到更高的段编号。 其内容在最后检查点之前则是没有使用价值和可以循环使用。

pg_start_backup

pg_start_backup 开始为制作基础备份进行准备工作. 恢复过程从重做点开始, 因此pg_start_backup必须执行检查点, 以便在制作基础备份的开始时刻显示创建一个重做点。此次检查点的位置必须保存在非pg_control的其它文件中, 因为在备份期间可能会执行多次常规检查点.

  • 强制进入整页写入模式.
  • 切换到当前的WAL段文件
  • 执行检查点.
  • 创建backup_label
backup_label 文件如下:
1
2
3
4
5
6
START WAL LOCATION: 0/2B000060 (file 00000002000000000000002B)
CHECKPOINT LOCATION: 0/2B000098
BACKUP METHOD: pg_start_backup
BACKUP FROM: master
START TIME: 2019-07-30 15:19:55 CST
LABEL: test_pitr_1
  1. 检查点位置 —— 该命令所创建检查点的LSN位置。
  2. WAL开始位置——这不是给PITR用的,而是为流复制准备的。它被命名为START WAL LOCATION,因为复制模式下的备用服务器在初始启动时只读取一次该值。
  3. 备份方法——这是用于进行此基本备份的方法,如pg_start_backup或pg_basebackup。
  4. 备份来源 —— 说明此备份是从主库还是备库拉取。
  5. 开始时间 —— 这是执行pg_start_backup时的时间戳。
  6. 备份标签 —— 这是pg_start_backup中指定的标签。

pg_stop_backup

1
2
3
4
5
6
ligang=# select pg_stop_backup ();
NOTICE: pg_stop_backup complete, all required WAL segments have been archived
pg_stop_backup
----------------
0/2F0001C8
(1 row)
  • pg_start_backup打开整页写入, 关闭整页写入.

  • 写入一条备份结束的xlog记录。

  • 切换WAL段文件

  • 创建一个备份历史记录文件 – 00000002000000000000002B.00000060.backup

    1
    2
    3
    4
    5
    6
    7
    8
    START WAL LOCATION: 0/2B000060 (file 00000002000000000000002B)
    STOP WAL LOCATION: 0/2F0001C8 (file 00000002000000000000002F)
    CHECKPOINT LOCATION: 0/2B000098
    BACKUP METHOD: pg_start_backup
    BACKUP FROM: master
    START TIME: 2019-07-30 15:19:55 CST
    LABEL: test_pitr_1
    STOP TIME: 2019-07-30 15:37:29 CST
  • 删除backup_label文件.

1
2
3
rmgr: XLOG        len (rec/tot):    106/   106, tx:          0, lsn: 0/2E000060, prev 0/2E000028, desc: CHECKPOINT_ONLINE redo 0/2E000028; tli 2; prev tli 2; fpw true; xid 0:604; oid 24601; multi 1; offset 0; oldest xid 551 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 604; online
rmgr: XLOG len (rec/tot): 34/ 34, tx: 0, lsn: 0/2E0000D0, prev 0/2E000060, desc: BACKUP_END 0/2E000028
rmgr: XLOG len (rec/tot): 24/ 24, tx: 0, lsn: 0/2E0000F8, prev 0/2E0000D0, desc: SWITCH

PITR 时间点恢复(Point In Time Recovery)

PITR 时间点恢复(Point In Time Recovery)与 常规恢复过程区别:

  1. 从哪里读取WAL段文件
    1. 正常恢复模式 – 来自基础目录下pg_wal子目录
    2. PITR模式 – 来自配置参数archive_command的归档目录
  2. 从哪里读取检查点设置.
    1. 正常恢复模式 – 来自pg_control文件
    2. PITR模式 – 来自backup_label
PITR恢复操作:

1.进入解压得到的data目录,删除pg_xlog文件夹,创建pg_xlog/archive_status文件夹:
rm -rf pg_xlog
mkdir -p pg_xlog/archive_status

  1. 从/usr/local/postgres-9.3.5/share目录下,拷贝一份recovery.conf:
    cp /usr/local/postgres-9.3.5/share/recovery.conf.sample recovery.conf
  2. 修改recovery.conf
    vi recovery.conf: 添加: restore_command = ‘cp /home/postgres/archive/%f %p’
PITR流程:
  1. Postgresql 使用内部函数read_backup_label从backup_label文件中读取CHECKPOINT LOCATION值。
  2. PostgreSQL 从recover.conf中读取一些参数值。 restore_command 和 recovery_target_time.
  3. Postgresql 开始从重放点重放WAL数据, 重做点的位置可以简单的从CHECKPOINT LOCATION的值中获得,PostgreSQL执行restore_command配置的命令, 将归档命令从归档区域复制到临时区域, 并从中读取WAL数据, 复制到临时区域的日志文件会在使用后被删除.
  4. 当恢复完成后, 会在pg_wal子目录创建时间线历史文件, 如xxxx.history. 如果启用了日志归档功能, 则还会在归档目录中创建相同的命名文件.

Redo Pointer 验证

1566549837206

1566782984259

1
2
3
4
#define BackupHistoryFilePath(path, tli, logSegNo, offset)	\
snprintf(path, MAXPGPATH, XLOGDIR "/%08X%08X%08X.%08X.backup", tli, \
(uint32) ((logSegNo) / XLogSegmentsPerXLogId), \
(uint32) ((logSegNo) % XLogSegmentsPerXLogId), offset)
阅读代码

src/backend/access/transam/xlog.c

1
2
3
static XLogRecord*
ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
bool fetching_ckpt)

src/backend/access/transam/xlogreader.c

1
2
XLogRecord *
XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)

代码获取chpt_start_recptr

src/backend/access/transam/xlog.c

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
ckpt_start_recptr = GetXLogReplayRecPtr(NULL);

/*
* Get latest redo apply position.
*
* Exported to allow WALReceiver to read the pointer directly.
*/
XLogRecPtr
GetXLogReplayRecPtr(TimeLineID *replayTLI)
{
XLogRecPtr recptr;
TimeLineID tli;

SpinLockAcquire(&XLogCtl->info_lck);
recptr = XLogCtl->lastReplayedEndRecPtr;
tli = XLogCtl->lastReplayedTLI;
SpinLockRelease(&XLogCtl->info_lck);

if (replayTLI)
*replayTLI = tli;
return recptr;
}


/*
* GetInsertRecPtr -- Returns the current insert position.
*
* NOTE: The value *actually* returned is the position of the last full
* xlog page. It lags behind the real insert position by at most 1 page.
* For that, we don't need to scan through WAL insertion locks, and an
* approximation is enough for the current usage of this function.
*/
XLogRecPtr
GetInsertRecPtr(void)
{
XLogRecPtr recptr;

SpinLockAcquire(&XLogCtl->info_lck);
recptr = XLogCtl->LogwrtRqst.Write;
SpinLockRelease(&XLogCtl->info_lck);

return recptr;
}

未完成….

补充

lsn是全局唯一值, 用于wal记录中记录总偏移位置. [pg_waldump]

pg_controldata –> 使用 $(PGDATA)/global/pg_control

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
pg_control version number:            1002
Catalog version number: 201707211
Database system identifier: 6721851427625463280
Database cluster state: in production
pg_control last modified: Tue 06 Aug 2019 08:56:48 AM CST
Latest checkpoint location: 0/170F548
Prior checkpoint location: 0/170F270
Latest checkpoint's REDO location: 0/170F548
Latest checkpoint's REDO WAL file: 000000010000000000000001
Latest checkpoint's TimeLineID: 1
Latest checkpoint's PrevTimeLineID: 1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID: 0:558
Latest checkpoint's NextOID: 13901
Latest checkpoint's NextMultiXactId: 1
Latest checkpoint's NextMultiOffset: 0
Latest checkpoint's oldestXID: 551
Latest checkpoint's oldestXID's DB: 1
Latest checkpoint's oldestActiveXID: 0
Latest checkpoint's oldestMultiXid: 1
Latest checkpoint's oldestMulti's DB: 1
Latest checkpoint's oldestCommitTsXid:0
Latest checkpoint's newestCommitTsXid:0
Time of latest checkpoint: Tue 06 Aug 2019 08:56:43 AM CST
Fake LSN counter for unlogged rels: 0/1
Minimum recovery ending location: 0/0
Min recovery ending loc's timeline: 0
Backup start location: 0/0
Backup end location: 0/0
End-of-backup record required: no
wal_level setting: replica
wal_log_hints setting: off
max_connections setting: 100
max_worker_processes setting: 10
max_prepared_xacts setting: 0
max_locks_per_xact setting: 64
track_commit_timestamp setting: off
Maximum data alignment: 8
Database block size: 8192
Blocks per segment of large relation: 131072
WAL block size: 8192
Bytes per WAL segment: 16777216
Maximum length of identifiers: 64
Maximum columns in an index: 32
Maximum size of a TOAST chunk: 1996
Size of a large-object chunk: 2048
Date/time type storage: 64-bit integers
Float4 argument passing: by value
Float8 argument passing: by value
Data page checksum version: 0
Mock authentication nonce: 9d1637d4a6729f17d02cc91eac480e274342cd29ca69add6ca7b960e94c9c7b2
欣赏此文? 求鼓励,求支持!