Nicksxs's Blog

What hurts more, the pain of hard work or the pain of regret?

0%

因为传说中的出身问题,我以前写的是PHP,在使用 swoole 之前,基本的应用调试手段就是简单粗暴的 var_dump,exit,对于单进程模型的 PHP 也是简单有效,技术栈换成 Java 之后,就变得没那么容易,一方面是需要编译,另一方面是一般都是基于 spring 的项目,如果问题定位比较模糊,那框架层的是很难靠简单的 System.out.println 或者打 log 解决,(PS:我觉得可能我写的东西比较适合从 PHP 这种弱类型语言转到 Java 的小白同学)这个时候一方面因为是 Java,有了非常好用的 idea IDE 的支持,可以各种花式调试,条件断点尤其牛叉,但是又因为有 Spring+Java 的双重原因,有些情况下单步调试可以把手按废掉,这也是我之前一直比较困惑苦逼的点,后来随着慢慢精(jiang)进(you)之后,比如对于一个 oom 的情况,我们可以通过启动参数加上-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=xx/xx 来配置溢出时的堆dump 日志,获取到这个文件后,我们可以通过像 Memory Analyzer (MAT)[https://www.eclipse.org/mat/] (The Eclipse Memory Analyzer is a fast and feature-rich Java heap analyzer that helps you find memory leaks and reduce memory consumption.)来查看诊断问题所在,之前用到的时候是因为有个死循环一直往链表里塞数据,属于比较简单的,后来一次是由于运维进行应用迁移时按默认的统一配置了堆内存大小,导致内存的确不够用,所以溢出了,
今天想说的其实主要是我们的 thread dump,这也是我最近才真正用的一个方法,可能真的很小白了,用过 ide 的单步调试其实都知道会有一个一层层的玩意,比如函数从 A,调用了 B,再从 B 调用了 C,一直往下(因为是 Java,所以还有很多🤦‍♂️),这个其实也是大部分语言的调用模型,利用了栈这个数据结构,通过这个结构我们可以知道代码的调用链路,由于对于一个 spring 应用,在本身框架代码量非常庞大的情况下,外加如果应用代码也是非常多的时候,有时候通过单步调试真的很难短时间定位到问题,需要非常大的耐心和仔细观察,当然不是说完全不行,举个例子当我的应用经常启动需要非常长的时间,因为本身应用有非常多个 bean,比较难说究竟是 bean 的加载的确很慢还是有什么异常原因,这种时候就可以使用 thread dump 了,具体怎么操作呢

如果在idea 中运行或者调试时,可以直接点击这个照相机一样的按钮,右边就会出现了左边会显示所有的线程,右边会显示线程栈,

1
2
3
4
5
6
7
"main@1" prio=5 tid=0x1 nid=NA runnable
java.lang.Thread.State: RUNNABLE
at TreeDistance.treeDist(TreeDistance.java:64)
at TreeDistance.treeDist(TreeDistance.java:65)
at TreeDistance.treeDist(TreeDistance.java:65)
at TreeDistance.treeDist(TreeDistance.java:65)
at TreeDistance.main(TreeDistance.java:45)

这就是我们主线程的堆栈信息了,main 表示这个线程名,prio表示优先级,默认是 5,tid 表示线程 id,nid 表示对应的系统线程,后面的runnable 表示目前线程状态,因为是被我打了断点,所以是就许状态,然后下面就是对应的线程栈内容了,在TreeDistance类的 treeDist方法中,对应的文件行数是 64 行。
这里使用 thread dump一般也不会是上面我截图代码里的这种代码量很少的,一般是大型项目,有时候跑着跑着没反应,又不知道跑到哪了,特别是一些刚接触的大项目或者需要定位一个大项目的一个疑难问题,一时没思路时,可以使用这个方法,个人觉得非常有帮助。

前面说了mysql数据库的事务相关的,那事务是用来干嘛的,这里得补一下ACID,

ACID,是指数据库管理系统DBMS)在写入或更新资料的过程中,为保证事务(transaction)是正确可靠的,所必须具备的四个特性:原子性(atomicity,或称不可分割性)、一致性(consistency)、隔离性(isolation,又称独立性)、持久性(durability)。

  • Atomicity(原子性):一个事务(transaction)中的所有操作,或者全部完成,或者全部不完成,不会结束在中间某个环节。事务在执行过程中发生错误,会被回滚(Rollback)到事务开始前的状态,就像这个事务从来没有执行过一样。即,事务不可分割、不可约简。[1]

  • Consistency(一致性):在事务开始之前和事务结束以后,数据库的完整性没有被破坏。这表示写入的资料必须完全符合所有的预设约束触发器级联回滚等。[1]

  • Isolation(隔离性):数据库允许多个并发事务同时对其数据进行读写和修改的能力,隔离性可以防止多个事务并发执行时由于交叉执行而导致数据的不一致。事务隔离分为不同级别,包括未提交读(Read uncommitted)、提交读(read committed)、可重复读(repeatable read)和串行化(Serializable)。[1]

  • Durability(持久性):事务处理结束后,对数据的修改就是永久的,即便系统故障也不会丢失。[1]

在mysql中,借助于MVCC,各种级别的锁,日志等特性来实现了事务的ACID,但是这个我们通常是对于一个数据库服务的定义,常见的情况下我们的数据库随着业务发展也会从单实例变成多实例,组成主从Master-Slave架构,这个时候其实会有一些问题随之出现,比如说主从同步延迟,假如在业务代码中做了读写分离,对于一些敏感度较低的数据其实问题不是很大,只要主从延迟不到特别夸张的地步一般都是可以忍受的,但是对于一些核心的业务数据,比如订单之类的,不能忍受数据不一致,下了单了,付了款了,一刷订单列表,发现这个订单还没支付,甚至订单都没在,这对于用户来讲是恨不能容忍的错误,那么这里就需要一些措施,要不就不读写分离,要不就在 redis 这类缓存下订单,或者支付后加个延时等,这些都是一些补偿措施,并且这也是一个不太切当的例子,比较合适的例子也可以用这个下单来说,一般在电商平台下单会有挺多要做的事情,比如像下面这个图

下单的是后要冻结核销优惠券,如果账户里有钱要冻结扣除账户里的钱,如果使用了J 豆也一样,可能还有 E 卡,忽略我借用的平台,因为目前一般后台服务化之后,可能每一项都是对应的一个后台服务,我们期望的执行过程是要不全成功,要不就全保持执行前状态,不能是部分扣减核销成功了,部分还不行,所以我们处理这种情况会引入一些通用的方案,第一种叫二阶段提交,

二阶段提交(英语:Two-phase Commit)是指在计算机网络以及数据库领域内,为了使基于分布式系统架构下的所有节点在进行事务提交时保持一致性而设计的一种算法。通常,二阶段提交也被称为是一种协议(Protocol)。在分布式系统中,每个节点虽然可以知晓自己的操作时成功或者失败,却无法知道其他节点的操作的成功或失败。当一个事务跨越多个节点时,为了保持事务的ACID特性,需要引入一个作为协调者的组件来统一掌控所有节点(称作参与者)的操作结果并最终指示这些节点是否要把操作结果进行真正的提交(比如将更新后的数据写入磁盘等等)。因此,二阶段提交的算法思路可以概括为: 参与者将操作成败通知协调者,再由协调者根据所有参与者的反馈情报决定各参与者是否要提交操作还是中止操作。

对于上面的例子,我们将整个过程分成两个阶段,首先是提交请求阶段,这个阶段大概需要做的是确定资源存在,锁定资源,可能还要做好失败后回滚的准备,如果这些都 ok 了那么就响应成功,这里其实用到了一个叫事务的协调者的角色,类似于裁判员,每个节点都反馈第一阶段成功后,开始执行第二阶段,也就是实际执行操作,这里也是需要所有节点都反馈成功后才是执行成功,要不就是失败回滚。其实常用的分布式事务的解决方案主要也是基于此方案的改进,比如后面介绍的三阶段提交,有三阶段提交就是因为二阶段提交比较尴尬的几个点,

  • 第一是对于两阶段提交,其中默认只有协调者有超时时间,当一个参与者进入卡死状态时只能依赖协调者的超时来结束任务,这中间的时间参与者都是锁定着资源
  • 第二是协调者的单点问题,万一挂了,参与者就会在那傻等着

所以三阶段提交引入了各节点的超时机制和一个准备阶段,首先是一个can commit阶段,询问下各个节点有没有资源,能不能进行操作,这个阶段不阻塞,只是提前做个摸底,这个阶段其实人畜无害,但是能提高成功率,在这个阶段如果就有节点反馈是不接受的,那就不用执行下去了,也没有锁资源,然后第二阶段是 pre commit ,这个阶段做的事情跟原来的 第一阶段比较类似,然后是第三阶段do commit,其实三阶段提交我个人觉得只是加了个超时,和准备阶段,好像木有根本性的解决的两阶段提交的问题,后续可以再看看一些论文来思考讨论下。

看完前面两篇水文之后,感觉不得不来分析下 mysql 的锁了,其实前面说到幻读的时候是有个前提没提到的,比如一个select * from table1 where id = 1这种查询语句其实是不会加传说中的锁的,当然这里是指在 RR 或者 RC 隔离级别下,
看一段 mysql官方文档

SELECT ... FROM is a consistent read, reading a snapshot of the database and setting no locks unless the transaction isolation level is set to SERIALIZABLE. For SERIALIZABLE level, the search sets shared next-key locks on the index records it encounters. However, only an index record lock is required for statements that lock rows using a unique index to search for a unique row.

纯粹的这种一致性读,实际读取的是快照,也就是基于 read view 的读取方式,除非当前隔离级别是SERIALIZABLE
但是对于以下几类

  • select * from table where ? lock in share mode;
  • select * from table where ? for update;
  • insert into table values (...);
  • update table set ? where ?;
  • delete from table where ?;

除了第一条是 S 锁之外,其他都是 X 排他锁,这边在顺带下,S 锁表示共享锁, X 表示独占锁,同为 S 锁之间不冲突,S 与 X,X 与 S,X 与 X 之间都冲突,也就是加了前者,后者就加不上了
我们知道对于 RC 级别会出现幻读现象,对于 RR 级别不会出现,主要的区别是 RR 级别下对于以上的加锁读取都根据情况加上了 gap 锁,那么是不是 RR 级别下以上所有的都是要加 gap 锁呢,当然不是
举个例子,RR 事务隔离级别下,table1 有个主键id 字段
select * from table1 where id = 10 for update
这条语句要加 gap 锁吗?
答案是不需要,这里其实算是我看了这么久的一点自己的理解,啥时候要加 gap 锁,判断的条件是根据我查询的数据是否会因为不加 gap 锁而出现数量的不一致,我上面这条查询语句,在什么情况下会出现查询结果数量不一致呢,只要在这条记录被更新或者删除的时候,有没有可能我第一次查出来一条,第二次变成两条了呢,不可能,因为是主键索引。
再变更下这个题的条件,当 id 不是主键,但是是唯一索引,这样需要怎么加锁,注意问题是怎么加锁,不是需不需要加 gap 锁,这里呢就是稍微延伸一下,把聚簇索引(主键索引)和二级索引带一下,当 id 不是主键,说明是个二级索引,但是它是唯一索引,体会下,首先对于 id = 10这个二级索引肯定要加锁,要不要锁 gap 呢,不用,因为是唯一索引,id = 10 只可能有这一条记录,然后呢,这样是不是就好了,还不行,因为啥,因为它是二级索引,对应的主键索引的记录才是真正的数据,万一被更新掉了咋办,所以在 id = 10 对应的主键索引上也需要加上锁(默认都是 record lock行锁),那主键索引上要不要加 gap 呢,也不用,也是精确定位到这一条记录
最后呢,当 id 不是主键,也不是唯一索引,只是个普通的索引,这里就需要大名鼎鼎的 gap 锁了,
是时候画个图了

其实核心的目的还是不让这个 id=10 的记录不会出现幻读,那么就需要在 id 这个索引上加上三个 gap 锁,主键索引上就不用了,在 id 索引上已经控制住了id = 10 不会出现幻读,主键索引上这两条对应的记录已经锁了,所以就这样 OK 了

上一篇聊了mysql 的 innodb 引擎基于 read view 实现的 mvcc 和事务隔离级别,可能有些细心的小伙伴会发现一些问题,第一个是在 RC 级别下的事务提交后的可见性,这里涉及到了三个参数,m_low_limit_id,m_up_limit_id,m_ids,之前看到知乎的一篇写的非常不错的文章,但是就在这一点上似乎有点疑惑,这里基于源码和注释来解释下这个问题

1
2
3
4
5
6
7
8
9
10
11
/**
Opens a read view where exactly the transactions serialized before this
point in time are seen in the view.
@param id Creator transaction id */

void ReadView::prepare(trx_id_t id) {
ut_ad(mutex_own(&trx_sys->mutex));

m_creator_trx_id = id;

m_low_limit_no = m_low_limit_id = m_up_limit_id = trx_sys->max_trx_id;

m_low_limit_id赋的值是trx_sys->max_trx_id,代表的是当前系统最小的未分配的事务 id,所以呢,举个例子,当前有三个活跃事务,事务 id 分别是 100,200,300,而 m_up_limit_id = 100, m_low_limit_id = 301,当事务 id 是 200 的提交之后,它的更新就是可以被 100 和 300 看到,而不是说 m_ids 里没了 200,并且 200 比 100 大就应该不可见了

幻读

还有一个问题是幻读的问题,这貌似也是个高频面试题,啥意思呢,或者说跟它最常拿来比较的脏读,脏读是指读到了别的事务未提交的数据,因为未提交,严格意义上来讲,不一定是会被最后落到库里,可能会回滚,也就是在 read uncommitted 级别下会出现的问题,但是幻读不太一样,幻读是指两次查询的结果数量不一样,比如我查了第一次是 select * from table1 where id < 10 for update,查出来了一条结果 id 是 5,然后再查一下发现出来了一条 id 是 5,一条 id 是 7,那是不是有点尴尬了,其实呢这个点我觉得脏读跟幻读也比较是从原理层面来命名,如果第一次接触的同学发觉有点不理解也比较正常,因为从逻辑上讲总之都是数据不符合预期,但是基于源码层面其实是不同的情况,幻读是在原先的 read view 无法完全解决的,怎么解决呢,简单的来说就是锁咯,我们知道innodb 是基于 record lock 行锁的,但是貌似没有办法解决这种问题,那么 innodb 就引入了 gap lock 间隙锁,比如上面说的情况下,id 小于 10 的情况下,是都应该锁住的,gap lock 其实是基于索引结构来锁的,因为索引树除了树形结构之外,还有一个next record 的指针,gap lock 也是基于这个来锁的
看一下 mysql 的文档

SELECT … FOR UPDATE sets an exclusive next-key lock on every record the search encounters. However, only an index record lock is required for statements that lock rows using a unique index to search for a unique row.

对于一个 for update 查询,在 RR 级别下,会设置一个 next-key lock在每一条被查询到的记录上,next-lock 又是啥呢,其实就是 gap 锁和 record 锁的结合体,比如我在数据库里有 id 是 1,3,5,7,10,对于上面那条查询,查出来的结果就是 1,3,5,7,那么按照文档里描述的,对于这几条记录都会加上next-key lock,也就是(-∞, 1], (1, 3], (3, 5], (5, 7], (7, 10) 这些区间和记录会被锁起来,不让插入,再唠叨一下呢,就是其实如果是只读的事务,光 read view 一致性读就够了,如果是有写操作的呢,就需要锁了。

很久以前,有位面试官问到,你知道 mysql 的事务隔离级别吗,“额 O__O …,不太清楚”,完了之后我就去网上找相关的文章,找到了这篇MySQL 四种事务隔离级的说明, 文章写得特别好,看了这个就懂了各个事务隔离级别都是啥,不过看了这个之后多思考一下的话还是会发现问题,这么神奇的事务隔离级别是怎么实现的呢

其中 innodb 的事务隔离用到了标题里说到的 mvcc,Multiversion concurrency control, 直译过来就是多版本并发控制,先不讲这个究竟是个啥,考虑下如果纯猜测,这个事务隔离级别应该会是怎么样实现呢,愚钝的我想了下,可以在事务开始的时候拷贝一个表,这个可以支持 RR 级别,RC 级别就不支持了,而且要是个非常大的表,想想就不可行

腆着脸说虽然这个不可行,但是思路是对的,具体实行起来需要做一系列(肥肠多)的改动,首先根据我的理解,其实这个拷贝一个表是变成拷贝一条记录,但是如果有多个事务,那就得拷贝多次,这个问题其实可以借助版本管理系统来解释,在用版本管理系统,git 之类的之前,很原始的可能是开发完一个功能后,就打个压缩包用时间等信息命名,然后如果后面要找回这个就直接用这个压缩包的就行了,后来有了 svn,git 中心式和分布式的版本管理系统,它的一个特点是粒度可以控制到文件和代码行级别,对应的我们的 mysql 事务是不是也可以从一开始预想的表级别细化到行的级别,可能之前很多人都了解过,数据库的一行记录除了我们用户自定义的字段,还有一些额外的字段,去源码data0type.h里捞一下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
/* Precise data types for system columns and the length of those columns;
NOTE: the values must run from 0 up in the order given! All codes must
be less than 256 */
#define DATA_ROW_ID 0 /* row id: a 48-bit integer */
#define DATA_ROW_ID_LEN 6 /* stored length for row id */

/** Transaction id: 6 bytes */
constexpr size_t DATA_TRX_ID = 1;

/** Transaction ID type size in bytes. */
constexpr size_t DATA_TRX_ID_LEN = 6;

/** Rollback data pointer: 7 bytes */
constexpr size_t DATA_ROLL_PTR = 2;

/** Rollback data pointer type size in bytes. */
constexpr size_t DATA_ROLL_PTR_LEN = 7;

一个是 DATA_ROW_ID,这个是在数据没指定主键的时候会生成一个隐藏的,如果用户有指定主键就是主键了

一个是 DATA_TRX_ID,这个表示这条记录的事务 ID

还有一个是 DATA_ROLL_PTR 指向回滚段的指针

指向的回滚段其实就是我们常说的 undo log,这里面的具体结构就是个链表,在 mvcc 里会使用到这个,还有就是这个 DATA_TRX_ID,每条记录都记录了这个事务 ID,表示的是这条记录的当前值是被哪个事务修改的,下面就扯回事务了,我们知道 Read Uncommitted, 其实用不到隔离,直接读取当前值就好了,到了 Read Committed 级别,我们要让事务读取到提交过的值,mysql 使用了一个叫 read view 的玩意,它里面有这些值是我们需要注意的,

m_low_limit_id, 这个是 read view 创建时最大的活跃事务 id

m_up_limit_id, 这个是 read view 创建时最小的活跃事务 id

m_ids, 这个是 read view 创建时所有的活跃事务 id 数组

m_creator_trx_id 这个是当前记录的创建事务 id

判断事务的可见性主要的逻辑是这样,

  1. 当记录的事务 id 小于最小活跃事务 id,说明是可见的,
  2. 如果记录的事务 id 等于当前事务 id,说明是自己的更改,可见
  3. 如果记录的事务 id 大于最大的活跃事务 id, 不可见
  4. 如果记录的事务 id 介于 m_low_limit_idm_up_limit_id 之间,则要判断它是否在 m_ids 中,如果在,不可见,如果不在,表示已提交,可见
    具体的代码捞一下看看
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    /** Check whether the changes by id are visible.
    @param[in] id transaction id to check against the view
    @param[in] name table name
    @return whether the view sees the modifications of id. */
    bool changes_visible(trx_id_t id, const table_name_t &name) const
    MY_ATTRIBUTE((warn_unused_result)) {
    ut_ad(id > 0);

    if (id < m_up_limit_id || id == m_creator_trx_id) {
    return (true);
    }

    check_trx_id_sanity(id, name);

    if (id >= m_low_limit_id) {
    return (false);

    } else if (m_ids.empty()) {
    return (true);
    }

    const ids_t::value_type *p = m_ids.data();

    return (!std::binary_search(p, p + m_ids.size(), id));
    }
    剩下来一点是啥呢,就是 Read CommittedRepeated Read 也不一样,那前面说的 read view 都能支持吗,又是怎么支持呢,假如这个 read view 是在事务一开始就创建,那好像能支持的只是 RR 事务隔离级别,其实呢,这是通过创建 read view的时机,对于 RR 级别,就是在事务的第一个 select 语句是创建,对于 RC 级别,是在每个 select 语句执行前都是创建一次,那样就可以保证能读到所有已提交的数据

LRU

说完了过期策略再说下淘汰策略,redis 使用的策略是近似的 lru 策略,为什么是近似的呢,先来看下什么是 lru,看下 wiki 的介绍
,图中一共有四个槽的存储空间,依次访问顺序是 A B C D E D F,
当第一次访问 D 时刚好占满了坑,并且值是 4,这个值越小代表越先被淘汰,当 E 进来时,看了下已经存在的四个里 A 是最小的,代表是最早存在并且最早被访问的,那就先淘汰它了,E 占领了 A 的位置,并设置值为 4,然后又访问 D 了,D 已经存在了,不过又被访问到了,得更新值为 5,然后是 F 进来了,这时 B 是最老的且最近未被访问,所以就淘汰它了。以上是一个 lru 的简要说明,但是 redis 没有严格按照这个去执行,理由跟前面过期策略一致,最严格的过期策略应该是每个 key 都有对应的定时器,当超时时马上就能清除,但是问题是这样的cpu 消耗太大,所换来的内存效率不太值得,淘汰策略也是这样,类似于上图,要维护所有 key 的一个有序 lru 值,并且遍历将最小的淘汰,redis 采用的是抽样的形式,最初的实现方式是随机从 dict 抽取 5 个 key,淘汰一个 lru 最小的,这样子勉强能达到淘汰的目的,但是效果不是特别好,后面在 redis 3.0开始,将随机抽取改成了维护一个 pool,pool 的大小默认是 16,每次放入的都是按lru 值有序排列好,每一次放入的必须是 lru小于 pool 中最小的 lru 才允许放入,直到放满,后面再有新的就会将大的踢出。
redis 针对这个策略的改进做了一个实验,这里借用下图

首先背景是这图中的所有点都对应一个 redis 的 key,灰色部分加入后被顺序访问过一遍,然后又加入了绿色部分,那么按照理论的 lru 算法,应该是图左上中,浅灰色部分全都被淘汰,那么对比来看看图右上,左下和右下,左下表示 2.8 版本就是随机抽样 5 个 key,淘汰其中 lru 最小的一个,发现是灰色和浅灰色的都有被淘汰的,右下的 3.0 版本抽样数量不变的情况下,稍好一些,当 3.0 版本的抽样数量调整成 10 后,已经较为接近理论上的 lru 策略了,通过代码来简要分析下

1
2
3
4
5
6
7
8
9
typedef struct redisObject {
unsigned type:4;
unsigned encoding:4;
unsigned lru:LRU_BITS; /* LRU time (relative to global lru_clock) or
* LFU data (least significant 8 bits frequency
* and most significant 16 bits access time). */
int refcount;
void *ptr;
} robj;

对于 lru 策略来说,lru 字段记录的就是redisObj 的LRU time,
redis 在访问数据时,都会调用lookupKey方法

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
/* Low level key lookup API, not actually called directly from commands
* implementations that should instead rely on lookupKeyRead(),
* lookupKeyWrite() and lookupKeyReadWithFlags(). */
robj *lookupKey(redisDb *db, robj *key, int flags) {
dictEntry *de = dictFind(db->dict,key->ptr);
if (de) {
robj *val = dictGetVal(de);

/* Update the access time for the ageing algorithm.
* Don't do it if we have a saving child, as this will trigger
* a copy on write madness. */
if (!hasActiveChildProcess() && !(flags & LOOKUP_NOTOUCH)){
if (server.maxmemory_policy & MAXMEMORY_FLAG_LFU) {
// 这个是后面一节的内容
updateLFU(val);
} else {
// 对于这个分支,访问时就会去更新 lru 值
val->lru = LRU_CLOCK();
}
}
return val;
} else {
return NULL;
}
}
/* This function is used to obtain the current LRU clock.
* If the current resolution is lower than the frequency we refresh the
* LRU clock (as it should be in production servers) we return the
* precomputed value, otherwise we need to resort to a system call. */
unsigned int LRU_CLOCK(void) {
unsigned int lruclock;
if (1000/server.hz <= LRU_CLOCK_RESOLUTION) {
// 如果服务器的频率server.hz大于 1 时就是用系统预设的 lruclock
lruclock = server.lruclock;
} else {
lruclock = getLRUClock();
}
return lruclock;
}
/* Return the LRU clock, based on the clock resolution. This is a time
* in a reduced-bits format that can be used to set and check the
* object->lru field of redisObject structures. */
unsigned int getLRUClock(void) {
return (mstime()/LRU_CLOCK_RESOLUTION) & LRU_CLOCK_MAX;
}

redis 处理命令是在这里processCommand

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
/* If this function gets called we already read a whole
* command, arguments are in the client argv/argc fields.
* processCommand() execute the command or prepare the
* server for a bulk read from the client.
*
* If C_OK is returned the client is still alive and valid and
* other operations can be performed by the caller. Otherwise
* if C_ERR is returned the client was destroyed (i.e. after QUIT). */
int processCommand(client *c) {
moduleCallCommandFilters(c);



/* Handle the maxmemory directive.
*
* Note that we do not want to reclaim memory if we are here re-entering
* the event loop since there is a busy Lua script running in timeout
* condition, to avoid mixing the propagation of scripts with the
* propagation of DELs due to eviction. */
if (server.maxmemory && !server.lua_timedout) {
int out_of_memory = freeMemoryIfNeededAndSafe() == C_ERR;
/* freeMemoryIfNeeded may flush slave output buffers. This may result
* into a slave, that may be the active client, to be freed. */
if (server.current_client == NULL) return C_ERR;

/* It was impossible to free enough memory, and the command the client
* is trying to execute is denied during OOM conditions or the client
* is in MULTI/EXEC context? Error. */
if (out_of_memory &&
(c->cmd->flags & CMD_DENYOOM ||
(c->flags & CLIENT_MULTI &&
c->cmd->proc != execCommand &&
c->cmd->proc != discardCommand)))
{
flagTransaction(c);
addReply(c, shared.oomerr);
return C_OK;
}
}
}

这里只摘了部分,当需要清理内存时就会调用, 然后调用了freeMemoryIfNeededAndSafe

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
/* This is a wrapper for freeMemoryIfNeeded() that only really calls the
* function if right now there are the conditions to do so safely:
*
* - There must be no script in timeout condition.
* - Nor we are loading data right now.
*
*/
int freeMemoryIfNeededAndSafe(void) {
if (server.lua_timedout || server.loading) return C_OK;
return freeMemoryIfNeeded();
}
/* This function is periodically called to see if there is memory to free
* according to the current "maxmemory" settings. In case we are over the
* memory limit, the function will try to free some memory to return back
* under the limit.
*
* The function returns C_OK if we are under the memory limit or if we
* were over the limit, but the attempt to free memory was successful.
* Otehrwise if we are over the memory limit, but not enough memory
* was freed to return back under the limit, the function returns C_ERR. */
int freeMemoryIfNeeded(void) {
int keys_freed = 0;
/* By default replicas should ignore maxmemory
* and just be masters exact copies. */
if (server.masterhost && server.repl_slave_ignore_maxmemory) return C_OK;

size_t mem_reported, mem_tofree, mem_freed;
mstime_t latency, eviction_latency;
long long delta;
int slaves = listLength(server.slaves);

/* When clients are paused the dataset should be static not just from the
* POV of clients not being able to write, but also from the POV of
* expires and evictions of keys not being performed. */
if (clientsArePaused()) return C_OK;
if (getMaxmemoryState(&mem_reported,NULL,&mem_tofree,NULL) == C_OK)
return C_OK;

mem_freed = 0;

if (server.maxmemory_policy == MAXMEMORY_NO_EVICTION)
goto cant_free; /* We need to free memory, but policy forbids. */

latencyStartMonitor(latency);
while (mem_freed < mem_tofree) {
int j, k, i;
static unsigned int next_db = 0;
sds bestkey = NULL;
int bestdbid;
redisDb *db;
dict *dict;
dictEntry *de;

if (server.maxmemory_policy & (MAXMEMORY_FLAG_LRU|MAXMEMORY_FLAG_LFU) ||
server.maxmemory_policy == MAXMEMORY_VOLATILE_TTL)
{
struct evictionPoolEntry *pool = EvictionPoolLRU;

while(bestkey == NULL) {
unsigned long total_keys = 0, keys;

/* We don't want to make local-db choices when expiring keys,
* so to start populate the eviction pool sampling keys from
* every DB. */
for (i = 0; i < server.dbnum; i++) {
db = server.db+i;
dict = (server.maxmemory_policy & MAXMEMORY_FLAG_ALLKEYS) ?
db->dict : db->expires;
if ((keys = dictSize(dict)) != 0) {
evictionPoolPopulate(i, dict, db->dict, pool);
total_keys += keys;
}
}
if (!total_keys) break; /* No keys to evict. */

/* Go backward from best to worst element to evict. */
for (k = EVPOOL_SIZE-1; k >= 0; k--) {
if (pool[k].key == NULL) continue;
bestdbid = pool[k].dbid;

if (server.maxmemory_policy & MAXMEMORY_FLAG_ALLKEYS) {
de = dictFind(server.db[pool[k].dbid].dict,
pool[k].key);
} else {
de = dictFind(server.db[pool[k].dbid].expires,
pool[k].key);
}

/* Remove the entry from the pool. */
if (pool[k].key != pool[k].cached)
sdsfree(pool[k].key);
pool[k].key = NULL;
pool[k].idle = 0;

/* If the key exists, is our pick. Otherwise it is
* a ghost and we need to try the next element. */
if (de) {
bestkey = dictGetKey(de);
break;
} else {
/* Ghost... Iterate again. */
}
}
}
}

/* volatile-random and allkeys-random policy */
else if (server.maxmemory_policy == MAXMEMORY_ALLKEYS_RANDOM ||
server.maxmemory_policy == MAXMEMORY_VOLATILE_RANDOM)
{
/* When evicting a random key, we try to evict a key for
* each DB, so we use the static 'next_db' variable to
* incrementally visit all DBs. */
for (i = 0; i < server.dbnum; i++) {
j = (++next_db) % server.dbnum;
db = server.db+j;
dict = (server.maxmemory_policy == MAXMEMORY_ALLKEYS_RANDOM) ?
db->dict : db->expires;
if (dictSize(dict) != 0) {
de = dictGetRandomKey(dict);
bestkey = dictGetKey(de);
bestdbid = j;
break;
}
}
}

/* Finally remove the selected key. */
if (bestkey) {
db = server.db+bestdbid;
robj *keyobj = createStringObject(bestkey,sdslen(bestkey));
propagateExpire(db,keyobj,server.lazyfree_lazy_eviction);
/* We compute the amount of memory freed by db*Delete() alone.
* It is possible that actually the memory needed to propagate
* the DEL in AOF and replication link is greater than the one
* we are freeing removing the key, but we can't account for
* that otherwise we would never exit the loop.
*
* AOF and Output buffer memory will be freed eventually so
* we only care about memory used by the key space. */
delta = (long long) zmalloc_used_memory();
latencyStartMonitor(eviction_latency);
if (server.lazyfree_lazy_eviction)
dbAsyncDelete(db,keyobj);
else
dbSyncDelete(db,keyobj);
latencyEndMonitor(eviction_latency);
latencyAddSampleIfNeeded("eviction-del",eviction_latency);
latencyRemoveNestedEvent(latency,eviction_latency);
delta -= (long long) zmalloc_used_memory();
mem_freed += delta;
server.stat_evictedkeys++;
notifyKeyspaceEvent(NOTIFY_EVICTED, "evicted",
keyobj, db->id);
decrRefCount(keyobj);
keys_freed++;

/* When the memory to free starts to be big enough, we may
* start spending so much time here that is impossible to
* deliver data to the slaves fast enough, so we force the
* transmission here inside the loop. */
if (slaves) flushSlavesOutputBuffers();

/* Normally our stop condition is the ability to release
* a fixed, pre-computed amount of memory. However when we
* are deleting objects in another thread, it's better to
* check, from time to time, if we already reached our target
* memory, since the "mem_freed" amount is computed only
* across the dbAsyncDelete() call, while the thread can
* release the memory all the time. */
if (server.lazyfree_lazy_eviction && !(keys_freed % 16)) {
if (getMaxmemoryState(NULL,NULL,NULL,NULL) == C_OK) {
/* Let's satisfy our stop condition. */
mem_freed = mem_tofree;
}
}
} else {
latencyEndMonitor(latency);
latencyAddSampleIfNeeded("eviction-cycle",latency);
goto cant_free; /* nothing to free... */
}
}
latencyEndMonitor(latency);
latencyAddSampleIfNeeded("eviction-cycle",latency);
return C_OK;

cant_free:
/* We are here if we are not able to reclaim memory. There is only one
* last thing we can try: check if the lazyfree thread has jobs in queue
* and wait... */
while(bioPendingJobsOfType(BIO_LAZY_FREE)) {
if (((mem_reported - zmalloc_used_memory()) + mem_freed) >= mem_tofree)
break;
usleep(1000);
}
return C_ERR;
}

这里就是根据具体策略去淘汰 key,首先是要往 pool 更新 key,更新key 的方法是evictionPoolPopulate

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
void evictionPoolPopulate(int dbid, dict *sampledict, dict *keydict, struct evictionPoolEntry *pool) {
int j, k, count;
dictEntry *samples[server.maxmemory_samples];

count = dictGetSomeKeys(sampledict,samples,server.maxmemory_samples);
for (j = 0; j < count; j++) {
unsigned long long idle;
sds key;
robj *o;
dictEntry *de;

de = samples[j];
key = dictGetKey(de);

/* If the dictionary we are sampling from is not the main
* dictionary (but the expires one) we need to lookup the key
* again in the key dictionary to obtain the value object. */
if (server.maxmemory_policy != MAXMEMORY_VOLATILE_TTL) {
if (sampledict != keydict) de = dictFind(keydict, key);
o = dictGetVal(de);
}

/* Calculate the idle time according to the policy. This is called
* idle just because the code initially handled LRU, but is in fact
* just a score where an higher score means better candidate. */
if (server.maxmemory_policy & MAXMEMORY_FLAG_LRU) {
idle = estimateObjectIdleTime(o);
} else if (server.maxmemory_policy & MAXMEMORY_FLAG_LFU) {
/* When we use an LRU policy, we sort the keys by idle time
* so that we expire keys starting from greater idle time.
* However when the policy is an LFU one, we have a frequency
* estimation, and we want to evict keys with lower frequency
* first. So inside the pool we put objects using the inverted
* frequency subtracting the actual frequency to the maximum
* frequency of 255. */
idle = 255-LFUDecrAndReturn(o);
} else if (server.maxmemory_policy == MAXMEMORY_VOLATILE_TTL) {
/* In this case the sooner the expire the better. */
idle = ULLONG_MAX - (long)dictGetVal(de);
} else {
serverPanic("Unknown eviction policy in evictionPoolPopulate()");
}

/* Insert the element inside the pool.
* First, find the first empty bucket or the first populated
* bucket that has an idle time smaller than our idle time. */
k = 0;
while (k < EVPOOL_SIZE &&
pool[k].key &&
pool[k].idle < idle) k++;
if (k == 0 && pool[EVPOOL_SIZE-1].key != NULL) {
/* Can't insert if the element is < the worst element we have
* and there are no empty buckets. */
continue;
} else if (k < EVPOOL_SIZE && pool[k].key == NULL) {
/* Inserting into empty position. No setup needed before insert. */
} else {
/* Inserting in the middle. Now k points to the first element
* greater than the element to insert. */
if (pool[EVPOOL_SIZE-1].key == NULL) {
/* Free space on the right? Insert at k shifting
* all the elements from k to end to the right. */

/* Save SDS before overwriting. */
sds cached = pool[EVPOOL_SIZE-1].cached;
memmove(pool+k+1,pool+k,
sizeof(pool[0])*(EVPOOL_SIZE-k-1));
pool[k].cached = cached;
} else {
/* No free space on right? Insert at k-1 */
k--;
/* Shift all elements on the left of k (included) to the
* left, so we discard the element with smaller idle time. */
sds cached = pool[0].cached; /* Save SDS before overwriting. */
if (pool[0].key != pool[0].cached) sdsfree(pool[0].key);
memmove(pool,pool+1,sizeof(pool[0])*k);
pool[k].cached = cached;
}
}

/* Try to reuse the cached SDS string allocated in the pool entry,
* because allocating and deallocating this object is costly
* (according to the profiler, not my fantasy. Remember:
* premature optimizbla bla bla bla. */
int klen = sdslen(key);
if (klen > EVPOOL_CACHED_SDS_SIZE) {
pool[k].key = sdsdup(key);
} else {
memcpy(pool[k].cached,key,klen+1);
sdssetlen(pool[k].cached,klen);
pool[k].key = pool[k].cached;
}
pool[k].idle = idle;
pool[k].dbid = dbid;
}
}

Redis随机选择maxmemory_samples数量的key,然后计算这些key的空闲时间idle time,当满足条件时(比pool中的某些键的空闲时间还大)就可以进poolpool更新之后,就淘汰pool中空闲时间最大的键。

estimateObjectIdleTime用来计算Redis对象的空闲时间:

1
2
3
4
5
6
7
8
9
10
11
/* Given an object returns the min number of milliseconds the object was never
* requested, using an approximated LRU algorithm. */
unsigned long long estimateObjectIdleTime(robj *o) {
unsigned long long lruclock = LRU_CLOCK();
if (lruclock >= o->lru) {
return (lruclock - o->lru) * LRU_CLOCK_RESOLUTION;
} else {
return (lruclock + (LRU_CLOCK_MAX - o->lru)) *
LRU_CLOCK_RESOLUTION;
}
}

空闲时间第一种是 lurclock 大于对象的 lru,那么就是减一下乘以精度,因为 lruclock 有可能是已经预生成的,所以会可能走下面这个

LFU

上面介绍了LRU 的算法,但是考虑一种场景

1
2
3
4
~~~~~A~~~~~A~~~~~A~~~~A~~~~~A~~~~~A~~|
~~B~~B~~B~~B~~B~~B~~B~~B~~B~~B~~B~~B~|
~~~~~~~~~~C~~~~~~~~~C~~~~~~~~~C~~~~~~|
~~~~~D~~~~~~~~~~D~~~~~~~~~D~~~~~~~~~D|

可以发现,当采用 lru 的淘汰策略的时候,D 是最新的,会被认为是最值得保留的,但是事实上还不如 A 跟 B,然后 antirez 大神就想到了LFU (Least Frequently Used) 这个算法, 显然对于上面的四个 key 的访问频率,保留优先级应该是 B > A > C = D
那要怎么来实现这个 LFU 算法呢,其实像LRU,理想的情况就是维护个链表,把最新访问的放到头上去,但是这个会影响访问速度,注意到前面代码的应该可以看到,redisObject 的 lru 字段其实是两用的,当策略是 LFU 时,这个字段就另作他用了,它的 24 位长度被分成两部分

1
2
3
4
      16 bits      8 bits
+----------------+--------+
+ Last decr time | LOG_C |
+----------------+--------+

前16位字段是最后一次递减时间,因此Redis知道 上一次计数器递减,后8位是 计数器 counter。
LFU 的主体策略就是当这个 key 被访问的次数越多频率越高他就越容易被保留下来,并且是最近被访问的频率越高。这其实有两个事情要做,一个是在访问的时候增加计数值,在一定长时间不访问时进行衰减,所以这里用了两个值,前 16 位记录上一次衰减的时间,后 8 位记录具体的计数值。
Redis4.0之后为maxmemory_policy淘汰策略添加了两个LFU模式:

volatile-lfu:对有过期时间的key采用LFU淘汰策略
allkeys-lfu:对全部key采用LFU淘汰策略
还有2个配置可以调整LFU算法:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
lfu-log-factor 10
lfu-decay-time 1
```
`lfu-log-factor` 可以调整计数器counter的增长速度,lfu-log-factor越大,counter增长的越慢。

`lfu-decay-time`是一个以分钟为单位的数值,可以调整counter的减少速度
这里有个问题是 8 位大小够计么,访问一次加 1 的话的确不够,不过大神就是大神,才不会这么简单的加一。往下看代码
```C
/* Low level key lookup API, not actually called directly from commands
* implementations that should instead rely on lookupKeyRead(),
* lookupKeyWrite() and lookupKeyReadWithFlags(). */
robj *lookupKey(redisDb *db, robj *key, int flags) {
dictEntry *de = dictFind(db->dict,key->ptr);
if (de) {
robj *val = dictGetVal(de);

/* Update the access time for the ageing algorithm.
* Don't do it if we have a saving child, as this will trigger
* a copy on write madness. */
if (!hasActiveChildProcess() && !(flags & LOOKUP_NOTOUCH)){
if (server.maxmemory_policy & MAXMEMORY_FLAG_LFU) {
// 当淘汰策略是 LFU 时,就会调用这个updateLFU
updateLFU(val);
} else {
val->lru = LRU_CLOCK();
}
}
return val;
} else {
return NULL;
}
}

updateLFU 这个其实个入口,调用了两个重要的方法

1
2
3
4
5
6
7
8
/* Update LFU when an object is accessed.
* Firstly, decrement the counter if the decrement time is reached.
* Then logarithmically increment the counter, and update the access time. */
void updateLFU(robj *val) {
unsigned long counter = LFUDecrAndReturn(val);
counter = LFULogIncr(counter);
val->lru = (LFUGetTimeInMinutes()<<8) | counter;
}

首先来看看LFUDecrAndReturn,这个方法的作用是根据上一次衰减时间和系统配置的 lfu-decay-time 参数来确定需要将 counter 减去多少

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
/* If the object decrement time is reached decrement the LFU counter but
* do not update LFU fields of the object, we update the access time
* and counter in an explicit way when the object is really accessed.
* And we will times halve the counter according to the times of
* elapsed time than server.lfu_decay_time.
* Return the object frequency counter.
*
* This function is used in order to scan the dataset for the best object
* to fit: as we check for the candidate, we incrementally decrement the
* counter of the scanned objects if needed. */
unsigned long LFUDecrAndReturn(robj *o) {
// 右移 8 位,拿到上次衰减时间
unsigned long ldt = o->lru >> 8;
// 对 255 做与操作,拿到 counter 值
unsigned long counter = o->lru & 255;
// 根据lfu_decay_time来算出过了多少个衰减周期
unsigned long num_periods = server.lfu_decay_time ? LFUTimeElapsed(ldt) / server.lfu_decay_time : 0;
if (num_periods)
counter = (num_periods > counter) ? 0 : counter - num_periods;
return counter;
}

然后是加,调用了LFULogIncr

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
/* Logarithmically increment a counter. The greater is the current counter value
* the less likely is that it gets really implemented. Saturate it at 255. */
uint8_t LFULogIncr(uint8_t counter) {
// 最大值就是 255,到顶了就不加了
if (counter == 255) return 255;
// 生成个随机小数
double r = (double)rand()/RAND_MAX;
// 减去个基础值,LFU_INIT_VAL = 5,防止刚进来就被逐出
double baseval = counter - LFU_INIT_VAL;
// 如果是小于 0,
if (baseval < 0) baseval = 0;
// 如果 baseval 是 0,那么 p 就是 1了,后面 counter 直接加一,如果不是的话,得看系统参数lfu_log_factor,这个越大,除出来的 p 越小,那么 counter++的可能性也越小,这样子就把前面的疑问给解决了,不是直接+1 的
double p = 1.0/(baseval*server.lfu_log_factor+1);
if (r < p) counter++;
return counter;
}

大概的变化速度可以参考

1
2
3
4
5
6
7
8
9
10
11
+--------+------------+------------+------------+------------+------------+
| factor | 100 hits | 1000 hits | 100K hits | 1M hits | 10M hits |
+--------+------------+------------+------------+------------+------------+
| 0 | 104 | 255 | 255 | 255 | 255 |
+--------+------------+------------+------------+------------+------------+
| 1 | 18 | 49 | 255 | 255 | 255 |
+--------+------------+------------+------------+------------+------------+
| 10 | 10 | 18 | 142 | 255 | 255 |
+--------+------------+------------+------------+------------+------------+
| 100 | 8 | 11 | 49 | 143 | 255 |
+--------+------------+------------+------------+------------+------------+

简而言之就是 lfu_log_factor 越大变化的越慢

总结

总结一下,redis 实现了近似的 lru 淘汰策略,通过增加了淘汰 key 的池子(pool),并且增大每次抽样的 key 的数量来将淘汰效果更进一步地接近于 lru,这是 lru 策略,但是对于前面举的一个例子,其实 lru 并不能保证 key 的淘汰就如我们预期,所以在后期又引入了 lfu 的策略,lfu的策略比较巧妙,复用了 redis 对象的 lru 字段,并且使用了factor 参数来控制计数器递增的速度,防止 8 位的计数器太早溢出。

这一篇不再是数据结构介绍了,大致的数据结构基本都介绍了,这一篇主要是查漏补缺,或者说讲一些重要且基本的概念,也可能是经常被忽略的,很多讲 redis 的系列文章可能都会忽略,学习 redis 的时候也会,因为觉得源码学习就是讲主要的数据结构和“算法”学习了就好了。
redis 的主要应用就是拿来作为高性能的缓存,那么缓存一般有些啥需要注意的,首先是访问速度,如果取得跟数据库一样快,那就没什么存在的意义,第二个是缓存的字面意思,我只是为了让数据读取快一些,通常大部分的场景这个是需要更新过期的,这里就把我要讲的第一点引出来了(真累,

redis过期策略

redis 是如何过期缓存的,可以猜测下,最无脑的就是每个设置了过期时间的 key 都设个定时器,过期了就删除,这种显然消耗太大,清理地最及时,还有的就是 redis 正在采用的懒汉清理策略和定期清理
懒汉策略就是在使用的时候去检查缓存是否过期,比如 get 操作时,先判断下这个 key 是否已经过期了,如果过期了就删掉,并且返回空,如果没过期则正常返回
主要代码是

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
/* This function is called when we are going to perform some operation
* in a given key, but such key may be already logically expired even if
* it still exists in the database. The main way this function is called
* is via lookupKey*() family of functions.
*
* The behavior of the function depends on the replication role of the
* instance, because slave instances do not expire keys, they wait
* for DELs from the master for consistency matters. However even
* slaves will try to have a coherent return value for the function,
* so that read commands executed in the slave side will be able to
* behave like if the key is expired even if still present (because the
* master has yet to propagate the DEL).
*
* In masters as a side effect of finding a key which is expired, such
* key will be evicted from the database. Also this may trigger the
* propagation of a DEL/UNLINK command in AOF / replication stream.
*
* The return value of the function is 0 if the key is still valid,
* otherwise the function returns 1 if the key is expired. */
int expireIfNeeded(redisDb *db, robj *key) {
if (!keyIsExpired(db,key)) return 0;

/* If we are running in the context of a slave, instead of
* evicting the expired key from the database, we return ASAP:
* the slave key expiration is controlled by the master that will
* send us synthesized DEL operations for expired keys.
*
* Still we try to return the right information to the caller,
* that is, 0 if we think the key should be still valid, 1 if
* we think the key is expired at this time. */
if (server.masterhost != NULL) return 1;

/* Delete the key */
server.stat_expiredkeys++;
propagateExpire(db,key,server.lazyfree_lazy_expire);
notifyKeyspaceEvent(NOTIFY_EXPIRED,
"expired",key,db->id);
return server.lazyfree_lazy_expire ? dbAsyncDelete(db,key) :
dbSyncDelete(db,key);
}

/* Check if the key is expired. */
int keyIsExpired(redisDb *db, robj *key) {
mstime_t when = getExpire(db,key);
mstime_t now;

if (when < 0) return 0; /* No expire for this key */

/* Don't expire anything while loading. It will be done later. */
if (server.loading) return 0;

/* If we are in the context of a Lua script, we pretend that time is
* blocked to when the Lua script started. This way a key can expire
* only the first time it is accessed and not in the middle of the
* script execution, making propagation to slaves / AOF consistent.
* See issue #1525 on Github for more information. */
if (server.lua_caller) {
now = server.lua_time_start;
}
/* If we are in the middle of a command execution, we still want to use
* a reference time that does not change: in that case we just use the
* cached time, that we update before each call in the call() function.
* This way we avoid that commands such as RPOPLPUSH or similar, that
* may re-open the same key multiple times, can invalidate an already
* open object in a next call, if the next call will see the key expired,
* while the first did not. */
else if (server.fixed_time_expire > 0) {
now = server.mstime;
}
/* For the other cases, we want to use the most fresh time we have. */
else {
now = mstime();
}

/* The key expired if the current (virtual or real) time is greater
* than the expire time of the key. */
return now > when;
}
/* Return the expire time of the specified key, or -1 if no expire
* is associated with this key (i.e. the key is non volatile) */
long long getExpire(redisDb *db, robj *key) {
dictEntry *de;

/* No expire? return ASAP */
if (dictSize(db->expires) == 0 ||
(de = dictFind(db->expires,key->ptr)) == NULL) return -1;

/* The entry was found in the expire dict, this means it should also
* be present in the main dict (safety check). */
serverAssertWithInfo(NULL,key,dictFind(db->dict,key->ptr) != NULL);
return dictGetSignedIntegerVal(de);
}

这里有几点要注意的,第一是当惰性删除时会根据lazyfree_lazy_expire这个参数去判断是执行同步删除还是异步删除,另外一点是对于 slave,是不需要执行的,因为会在 master 过期时向 slave 发送 del 指令。
光采用这个策略会有什么问题呢,假如一些key 一直未被访问,那这些 key 就不会过期了,导致一直被占用着内存,所以 redis 采取了懒汉式过期加定期过期策略,定期策略是怎么执行的呢

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
/* This function handles 'background' operations we are required to do
* incrementally in Redis databases, such as active key expiring, resizing,
* rehashing. */
void databasesCron(void) {
/* Expire keys by random sampling. Not required for slaves
* as master will synthesize DELs for us. */
if (server.active_expire_enabled) {
if (server.masterhost == NULL) {
activeExpireCycle(ACTIVE_EXPIRE_CYCLE_SLOW);
} else {
expireSlaveKeys();
}
}

/* Defrag keys gradually. */
activeDefragCycle();

/* Perform hash tables rehashing if needed, but only if there are no
* other processes saving the DB on disk. Otherwise rehashing is bad
* as will cause a lot of copy-on-write of memory pages. */
if (!hasActiveChildProcess()) {
/* We use global counters so if we stop the computation at a given
* DB we'll be able to start from the successive in the next
* cron loop iteration. */
static unsigned int resize_db = 0;
static unsigned int rehash_db = 0;
int dbs_per_call = CRON_DBS_PER_CALL;
int j;

/* Don't test more DBs than we have. */
if (dbs_per_call > server.dbnum) dbs_per_call = server.dbnum;

/* Resize */
for (j = 0; j < dbs_per_call; j++) {
tryResizeHashTables(resize_db % server.dbnum);
resize_db++;
}

/* Rehash */
if (server.activerehashing) {
for (j = 0; j < dbs_per_call; j++) {
int work_done = incrementallyRehash(rehash_db);
if (work_done) {
/* If the function did some work, stop here, we'll do
* more at the next cron loop. */
break;
} else {
/* If this db didn't need rehash, we'll try the next one. */
rehash_db++;
rehash_db %= server.dbnum;
}
}
}
}
}
/* Try to expire a few timed out keys. The algorithm used is adaptive and
* will use few CPU cycles if there are few expiring keys, otherwise
* it will get more aggressive to avoid that too much memory is used by
* keys that can be removed from the keyspace.
*
* Every expire cycle tests multiple databases: the next call will start
* again from the next db, with the exception of exists for time limit: in that
* case we restart again from the last database we were processing. Anyway
* no more than CRON_DBS_PER_CALL databases are tested at every iteration.
*
* The function can perform more or less work, depending on the "type"
* argument. It can execute a "fast cycle" or a "slow cycle". The slow
* cycle is the main way we collect expired cycles: this happens with
* the "server.hz" frequency (usually 10 hertz).
*
* However the slow cycle can exit for timeout, since it used too much time.
* For this reason the function is also invoked to perform a fast cycle
* at every event loop cycle, in the beforeSleep() function. The fast cycle
* will try to perform less work, but will do it much more often.
*
* The following are the details of the two expire cycles and their stop
* conditions:
*
* If type is ACTIVE_EXPIRE_CYCLE_FAST the function will try to run a
* "fast" expire cycle that takes no longer than EXPIRE_FAST_CYCLE_DURATION
* microseconds, and is not repeated again before the same amount of time.
* The cycle will also refuse to run at all if the latest slow cycle did not
* terminate because of a time limit condition.
*
* If type is ACTIVE_EXPIRE_CYCLE_SLOW, that normal expire cycle is
* executed, where the time limit is a percentage of the REDIS_HZ period
* as specified by the ACTIVE_EXPIRE_CYCLE_SLOW_TIME_PERC define. In the
* fast cycle, the check of every database is interrupted once the number
* of already expired keys in the database is estimated to be lower than
* a given percentage, in order to avoid doing too much work to gain too
* little memory.
*
* The configured expire "effort" will modify the baseline parameters in
* order to do more work in both the fast and slow expire cycles.
*/

#define ACTIVE_EXPIRE_CYCLE_KEYS_PER_LOOP 20 /* Keys for each DB loop. */
#define ACTIVE_EXPIRE_CYCLE_FAST_DURATION 1000 /* Microseconds. */
#define ACTIVE_EXPIRE_CYCLE_SLOW_TIME_PERC 25 /* Max % of CPU to use. */
#define ACTIVE_EXPIRE_CYCLE_ACCEPTABLE_STALE 10 /* % of stale keys after which
we do extra efforts. */
void activeExpireCycle(int type) {
/* Adjust the running parameters according to the configured expire
* effort. The default effort is 1, and the maximum configurable effort
* is 10. */
unsigned long
effort = server.active_expire_effort-1, /* Rescale from 0 to 9. */
config_keys_per_loop = ACTIVE_EXPIRE_CYCLE_KEYS_PER_LOOP +
ACTIVE_EXPIRE_CYCLE_KEYS_PER_LOOP/4*effort,
config_cycle_fast_duration = ACTIVE_EXPIRE_CYCLE_FAST_DURATION +
ACTIVE_EXPIRE_CYCLE_FAST_DURATION/4*effort,
config_cycle_slow_time_perc = ACTIVE_EXPIRE_CYCLE_SLOW_TIME_PERC +
2*effort,
config_cycle_acceptable_stale = ACTIVE_EXPIRE_CYCLE_ACCEPTABLE_STALE-
effort;

/* This function has some global state in order to continue the work
* incrementally across calls. */
static unsigned int current_db = 0; /* Last DB tested. */
static int timelimit_exit = 0; /* Time limit hit in previous call? */
static long long last_fast_cycle = 0; /* When last fast cycle ran. */

int j, iteration = 0;
int dbs_per_call = CRON_DBS_PER_CALL;
long long start = ustime(), timelimit, elapsed;

/* When clients are paused the dataset should be static not just from the
* POV of clients not being able to write, but also from the POV of
* expires and evictions of keys not being performed. */
if (clientsArePaused()) return;

if (type == ACTIVE_EXPIRE_CYCLE_FAST) {
/* Don't start a fast cycle if the previous cycle did not exit
* for time limit, unless the percentage of estimated stale keys is
* too high. Also never repeat a fast cycle for the same period
* as the fast cycle total duration itself. */
if (!timelimit_exit &&
server.stat_expired_stale_perc < config_cycle_acceptable_stale)
return;

if (start < last_fast_cycle + (long long)config_cycle_fast_duration*2)
return;

last_fast_cycle = start;
}

/* We usually should test CRON_DBS_PER_CALL per iteration, with
* two exceptions:
*
* 1) Don't test more DBs than we have.
* 2) If last time we hit the time limit, we want to scan all DBs
* in this iteration, as there is work to do in some DB and we don't want
* expired keys to use memory for too much time. */
if (dbs_per_call > server.dbnum || timelimit_exit)
dbs_per_call = server.dbnum;

/* We can use at max 'config_cycle_slow_time_perc' percentage of CPU
* time per iteration. Since this function gets called with a frequency of
* server.hz times per second, the following is the max amount of
* microseconds we can spend in this function. */
timelimit = config_cycle_slow_time_perc*1000000/server.hz/100;
timelimit_exit = 0;
if (timelimit <= 0) timelimit = 1;

if (type == ACTIVE_EXPIRE_CYCLE_FAST)
timelimit = config_cycle_fast_duration; /* in microseconds. */

/* Accumulate some global stats as we expire keys, to have some idea
* about the number of keys that are already logically expired, but still
* existing inside the database. */
long total_sampled = 0;
long total_expired = 0;

for (j = 0; j < dbs_per_call && timelimit_exit == 0; j++) {
/* Expired and checked in a single loop. */
unsigned long expired, sampled;

redisDb *db = server.db+(current_db % server.dbnum);

/* Increment the DB now so we are sure if we run out of time
* in the current DB we'll restart from the next. This allows to
* distribute the time evenly across DBs. */
current_db++;

/* Continue to expire if at the end of the cycle more than 25%
* of the keys were expired. */
do {
unsigned long num, slots;
long long now, ttl_sum;
int ttl_samples;
iteration++;

/* If there is nothing to expire try next DB ASAP. */
if ((num = dictSize(db->expires)) == 0) {
db->avg_ttl = 0;
break;
}
slots = dictSlots(db->expires);
now = mstime();

/* When there are less than 1% filled slots, sampling the key
* space is expensive, so stop here waiting for better times...
* The dictionary will be resized asap. */
if (num && slots > DICT_HT_INITIAL_SIZE &&
(num*100/slots < 1)) break;

/* The main collection cycle. Sample random keys among keys
* with an expire set, checking for expired ones. */
expired = 0;
sampled = 0;
ttl_sum = 0;
ttl_samples = 0;

if (num > config_keys_per_loop)
num = config_keys_per_loop;

/* Here we access the low level representation of the hash table
* for speed concerns: this makes this code coupled with dict.c,
* but it hardly changed in ten years.
*
* Note that certain places of the hash table may be empty,
* so we want also a stop condition about the number of
* buckets that we scanned. However scanning for free buckets
* is very fast: we are in the cache line scanning a sequential
* array of NULL pointers, so we can scan a lot more buckets
* than keys in the same time. */
long max_buckets = num*20;
long checked_buckets = 0;

while (sampled < num && checked_buckets < max_buckets) {
for (int table = 0; table < 2; table++) {
if (table == 1 && !dictIsRehashing(db->expires)) break;

unsigned long idx = db->expires_cursor;
idx &= db->expires->ht[table].sizemask;
dictEntry *de = db->expires->ht[table].table[idx];
long long ttl;

/* Scan the current bucket of the current table. */
checked_buckets++;
while(de) {
/* Get the next entry now since this entry may get
* deleted. */
dictEntry *e = de;
de = de->next;

ttl = dictGetSignedIntegerVal(e)-now;
if (activeExpireCycleTryExpire(db,e,now)) expired++;
if (ttl > 0) {
/* We want the average TTL of keys yet
* not expired. */
ttl_sum += ttl;
ttl_samples++;
}
sampled++;
}
}
db->expires_cursor++;
}
total_expired += expired;
total_sampled += sampled;

/* Update the average TTL stats for this database. */
if (ttl_samples) {
long long avg_ttl = ttl_sum/ttl_samples;

/* Do a simple running average with a few samples.
* We just use the current estimate with a weight of 2%
* and the previous estimate with a weight of 98%. */
if (db->avg_ttl == 0) db->avg_ttl = avg_ttl;
db->avg_ttl = (db->avg_ttl/50)*49 + (avg_ttl/50);
}

/* We can't block forever here even if there are many keys to
* expire. So after a given amount of milliseconds return to the
* caller waiting for the other active expire cycle. */
if ((iteration & 0xf) == 0) { /* check once every 16 iterations. */
elapsed = ustime()-start;
if (elapsed > timelimit) {
timelimit_exit = 1;
server.stat_expired_time_cap_reached_count++;
break;
}
}
/* We don't repeat the cycle for the current database if there are
* an acceptable amount of stale keys (logically expired but yet
* not reclained). */
} while ((expired*100/sampled) > config_cycle_acceptable_stale);
}

elapsed = ustime()-start;
server.stat_expire_cycle_time_used += elapsed;
latencyAddSampleIfNeeded("expire-cycle",elapsed/1000);

/* Update our estimate of keys existing but yet to be expired.
* Running average with this sample accounting for 5%. */
double current_perc;
if (total_sampled) {
current_perc = (double)total_expired/total_sampled;
} else
current_perc = 0;
server.stat_expired_stale_perc = (current_perc*0.05)+
(server.stat_expired_stale_perc*0.95);
}

执行定期清除分成两种类型,快和慢,分别由beforeSleepdatabasesCron调用,快版有两个限制,一个是执行时长由ACTIVE_EXPIRE_CYCLE_FAST_DURATION限制,另一个是执行间隔是 2 倍的ACTIVE_EXPIRE_CYCLE_FAST_DURATION,另外这还可以由配置的server.active_expire_effort参数来控制,默认是 1,最大是 10

1
2
onfig_cycle_fast_duration = ACTIVE_EXPIRE_CYCLE_FAST_DURATION +
ACTIVE_EXPIRE_CYCLE_FAST_DURATION/4*effort

然后会从一定数量的 db 中找出一定数量的带过期时间的 key(保存在 expires中),这里的数量是由

1
2
3
4
5
6
7
8
config_keys_per_loop = ACTIVE_EXPIRE_CYCLE_KEYS_PER_LOOP +
ACTIVE_EXPIRE_CYCLE_KEYS_PER_LOOP/4*effort
```
控制,慢速的执行时长是
```C
config_cycle_slow_time_perc = ACTIVE_EXPIRE_CYCLE_SLOW_TIME_PERC +
2*effort
timelimit = config_cycle_slow_time_perc*1000000/server.hz/100;

这里还有一个额外的退出条件,如果当前数据库的抽样结果已经达到我们所允许的过期 key 百分比,则下次不再处理当前 db,继续处理下个 db

在Java8的stream之前,将对象进行排序的时候,可能需要对象实现Comparable接口,或者自己实现一个Comparator,

比如这样子

我的对象是Entity

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
public class Entity {

private Long id;

private Long sortValue;

public Long getId() {
return id;
}

public void setId(Long id) {
this.id = id;
}

public Long getSortValue() {
return sortValue;
}

public void setSortValue(Long sortValue) {
this.sortValue = sortValue;
}
}

Comparator

1
2
3
4
5
6
7
8
9
10
11
12
13
14
public class MyComparator implements Comparator {
@Override
public int compare(Object o1, Object o2) {
Entity e1 = (Entity) o1;
Entity e2 = (Entity) o2;
if (e1.getSortValue() < e2.getSortValue()) {
return -1;
} else if (e1.getSortValue().equals(e2.getSortValue())) {
return 0;
} else {
return 1;
}
}
}

比较代码

1
2
3
4
5
6
7
8
9
10
11
12
13
private static MyComparator myComparator = new MyComparator();

public static void main(String[] args) {
List<Entity> list = new ArrayList<Entity>();
Entity e1 = new Entity();
e1.setId(1L);
e1.setSortValue(1L);
list.add(e1);
Entity e2 = new Entity();
e2.setId(2L);
e2.setSortValue(null);
list.add(e2);
Collections.sort(list, myComparator);

看到这里的e2的排序值是null,在Comparator中如果要正常运行的话,就得判空之类的,这里有两点需要,一个是不想写这个MyComparator,然后也没那么好排除掉list里排序值,那么有什么办法能解决这种问题呢,应该说java的这方面真的是很强大

看一下nullsFirst的实现

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
final static class NullComparator<T> implements Comparator<T>, Serializable {
private static final long serialVersionUID = -7569533591570686392L;
private final boolean nullFirst;
// if null, non-null Ts are considered equal
private final Comparator<T> real;

@SuppressWarnings("unchecked")
NullComparator(boolean nullFirst, Comparator<? super T> real) {
this.nullFirst = nullFirst;
this.real = (Comparator<T>) real;
}

@Override
public int compare(T a, T b) {
if (a == null) {
return (b == null) ? 0 : (nullFirst ? -1 : 1);
} else if (b == null) {
return nullFirst ? 1: -1;
} else {
return (real == null) ? 0 : real.compare(a, b);
}
}

核心代码就是下面这段,其实就是帮我们把前面要做的事情做掉了,是不是挺方便的,小记一下哈

最近做 docker 系列,会经常进到 docker 内部,如上一篇介绍的,这些镜像一般都有用 ubuntu 或者alpine 这样的 Linux 系统作为底包,如果构建镜像的时候没有替换源的话,因为你懂的原因,在内部想编辑下东西要安装 vim 就会很慢很慢,thousand years later~
而且如果在容器内部想改源配置的话也要编辑器,就陷入了一个鸡生蛋,跟蛋生鸡的问题中,对于 linux 大神来说应该有一万种方法解决这个问题,对于我这个渣渣来说可能只想到了这个土方法,先 cp backup 一下 sources.list, 再 echo “xxx” > sources.list, 这里就碰到了一个问题,这个 sources.list 一般不止一行,直接 echo 的话就解析不了了,不过 echo 可以支持”\n”转义,就是加-e
看一下解释和示例

还有一点也是在这个时候要安装 vim 之类的,得知道是什么镜像底包,如果是用 uname 指令,其实看到的是宿主机的系统,得用cat /etc/issue

这里稍稍记一下

运行第一个 Dockerfile

上一篇的 Dockerfile 我们停留在构建阶段,现在来把它跑起来

1
2
docker run -d -p 80 --name static_web nicksxs/static_web \
nginx -g "daemon off;"

这里的-d表示以分离模型运行docker (detached),然后-p 是表示将容器的 80 端口开放给宿主机,然后容器名就叫 static_web,使用了我们上次构建的 static_web 镜像,后面的是让 nginx 在前台运行

可以看到返回了个容器 id,但是具体情况没出现,也没连上去,那我们想看看怎么访问在 Dockerfile 里写的静态页面,我们来看下docker 进程

发现为我们随机分配了一个宿主机的端口,32768,去服务器的防火墙把这个外网端口开一下,看看是不是符合我们的预期呢

好像不太对额,应该是 ubuntu 安装的 nginx 的默认工作目录不对,我们来进容器看看,再熟悉下命令docker exec -it 4792455ca2ed /bin/bash
记得容器 id 换成自己的,进入容器后得找找 nginx 的配置文件,通常在/etc/nginx,/usr/local/etc等目录下,然后找到我们的目录是在这

所以把刚才的内容复制过去再试试

目标达成,give me five✌️

第二个 Dockerfile

然后就想来动态一点的,毕竟写过 PHP,就来试试 PHP
再建一个目录叫 dynamic_web,里面创建 src 目录,放一个 index.php
内容是

1
2
<?php
echo "Hello World!";

然后在 dynamic_web 目录下创建 Dockerfile,

1
2
3
FROM trafex/alpine-nginx-php7:latest
COPY src/ /var/www/html
EXPOSE 80

Dockerfile 虽然只有三行,不过要着重说明下,这个底包其实不是docker 官方的,有两点考虑,一点是官方的基本都是 php apache 的镜像,还有就是 alpine这个,截取一段中文介绍

Alpine 操作系统是一个面向安全的轻型 Linux 发行版。它不同于通常 Linux 发行版,Alpine 采用了 musl libc 和 busybox 以减小系统的体积和运行时资源消耗,但功能上比 busybox 又完善的多,因此得到开源社区越来越多的青睐。在保持瘦身的同时,Alpine 还提供了自己的包管理工具 apk,可以通过 https://pkgs.alpinelinux.org/packages 网站上查询包信息,也可以直接通过 apk 命令直接查询和安装各种软件。
Alpine 由非商业组织维护的,支持广泛场景的 Linux发行版,它特别为资深/重度Linux用户而优化,关注安全,性能和资源效能。Alpine 镜像可以适用于更多常用场景,并且是一个优秀的可以适用于生产的基础系统/环境。

Alpine Docker 镜像也继承了 Alpine Linux 发行版的这些优势。相比于其他 Docker 镜像,它的容量非常小,仅仅只有 5 MB 左右(对比 Ubuntu 系列镜像接近 200 MB),且拥有非常友好的包管理机制。官方镜像来自 docker-alpine 项目。

目前 Docker 官方已开始推荐使用 Alpine 替代之前的 Ubuntu 做为基础镜像环境。这样会带来多个好处。包括镜像下载速度加快,镜像安全性提高,主机之间的切换更方便,占用更少磁盘空间等。

一方面在没有镜像的情况下,拉取 docker 镜像还是比较费力的,第二个就是也能节省硬盘空间,所以目前有大部分的 docker 镜像都将 alpine 作为基础镜像了
然后再来构建下

这里还有个点,就是上面的那个镜像我们也是 EXPOSE 80端口,然后外部宿主机会随机映射一个端口,为了偷个懒,我们就直接指定外部端口了
docker run -d -p 80:80 dynamic_web打开浏览器发现访问不了,咋回事呢
因为我们没看trafex/alpine-nginx-php7:latest这个镜像说明,它内部的服务是 8080 端口的,所以我们映射的暴露端口应该是 8080,再用docker run -d -p 80:8080 dynamic_web这个启动,