记得去年面试广州的Ruby公司的时候,面试官问了我一个问题,在ActiveRecord数据库查询里面joinsincludes有什么区别,当时的我哑口无言。估计面试官会琢磨,这货,怎么编译器都开发出来了,这么简单的问题都不懂。肯定是抄别人的编译器。实则真不是,只是Rails相关的工作做得少,做过的都是页面层,数据库层面的东西更是接触得少,连这种基础问题也回答不出来,实在尴尬。

joins主要用于联表查询

还是用博客系统的表结构来说明,假设博客系统的文章(Post)都分属于某一个分类(Category),那么如果我们想获取某个分类下的所有文章,那么其实可以这样去做

> Category.first
  Category Load (0.4ms)  SELECT "categories".* FROM "categories" ORDER BY "categories"."id" ASC LIMIT $1  [["LIMIT", 1]]
=> #<Category id: 1, key: "blogs", name: "Blog", created_at: "2020-02-26 04:36:42", updated_at: "2020-02-26 04:36:42">
Post.where("category_id = 1")
  Post Load (1.9ms)  SELECT "posts".* FROM "posts" WHERE (category_id = 1) LIMIT $1  [["LIMIT", 11]]
=> #<ActiveRecord::Relation [#<Post id: 42, title: "事务与锁在Rails中的体现", body: "在用Rails写真实业务逻辑 ...

我们也可以通过分类名来进行查询

Post.where("categories.name = 'Blog'")
  Post Load (1.7ms)  SELECT "posts".* FROM "posts" WHERE (categories.name = 'Blog') LIMIT $1  [["LIMIT", 11]]
Traceback (most recent call last): ...

哦,报错了。因为这条语句并没有categories表的信息。posts表跟categories是相关联的,于是要做这种联表的查询需要使用joins语句

Post.joins(:category).where("categories.name = 'Blog'")
  Post Load (1.9ms)  SELECT "posts".* FROM "posts" INNER JOIN "categories" ON "categories"."id" = "posts"."category_id" WHERE (categories.name = 'Blog') LIMIT $1  [["LIMIT", 11]]
=> #<ActiveRecord::Relation [#<Post id: 42, title: "事务与锁在Rails中的体现", body: "在用Rai...

关键在于

... INNER JOIN "categories" ON "categories"."id" = "posts"."category_id" ...

两个表连结的纽带是categories表的id列与posts表的category_id列。

includes经常用于解决N + 1查询问题

只要编写Rails代码一段时间,难免遇到N+1查询问题。一般来说N+1查询问题是这样的

irb(main):036:0> Post.limit(10).each {|p| p.category.name }
  Post Load (1.5ms)  SELECT "posts".* FROM "posts" LIMIT $1  [["LIMIT", 10]]
  Category Load (0.1ms)  SELECT "categories".* FROM "categories" WHERE "categories"."id" = $1 LIMIT $2  [["id", 2], ["LIMIT", 1]]
  Category Load (0.1ms)  SELECT "categories".* FROM "categories" WHERE "categories"."id" = $1 LIMIT $2  [["id", 1], ["LIMIT", 1]]
  Category Load (0.1ms)  SELECT "categories".* FROM "categories" WHERE "categories"."id" = $1 LIMIT $2  [["id", 1], ["LIMIT", 1]]
  Category Load (0.1ms)  SELECT "categories".* FROM "categories" WHERE "categories"."id" = $1 LIMIT $2  [["id", 2], ["LIMIT", 1]]
  Category Load (0.1ms)  SELECT "categories".* FROM "categories" WHERE "categories"."id" = $1 LIMIT $2  [["id", 1], ["LIMIT", 1]]
  Category Load (0.1ms)  SELECT "categories".* FROM "categories" WHERE "categories"."id" = $1 LIMIT $2  [["id", 2], ["LIMIT", 1]]
  Category Load (0.1ms)  SELECT "categories".* FROM "categories" WHERE "categories"."id" = $1 LIMIT $2  [["id", 2], ["LIMIT", 1]]
  Category Load (0.1ms)  SELECT "categories".* FROM "categories" WHERE "categories"."id" = $1 LIMIT $2  [["id", 3], ["LIMIT", 1]]
  Category Load (0.1ms)  SELECT "categories".* FROM "categories" WHERE "categories"."id" = $1 LIMIT $2  [["id", 2], ["LIMIT", 1]]
  Category Load (0.1ms)  SELECT "categories".* FROM "categories" WHERE "categories"."id" = $1 LIMIT $2  [["id", 2], ["LIMIT", 1]]
=> [#<Post id: 5, title: "【译】这福利是给谁的?", body: "这福利是给谁的?翻译....

我们首先查询出文章数据,然后通过each语句去访问每篇文章对应的分类数据,然而分类数据是在另一个表里面,因此需要查询分类表。于是乎系统在每次我们使用分类信息的时候都用category_id查询文章对应的分类信息。假设文章数据有10条。第一条查询SELECT "posts" xxxx就是查询文章列表的。后面还有10条查询是单独查询分类信息的。可怕的是分类的查询语句都是类似的....这就是所谓的N+1查询,其实个人感觉称之为1+N查询会更好一些,毕竟我们还是先查询那个1,然后再进行剩余的N次查询。

如何规避这种问题呢?既然后面的N条语句查询结构都是一样的,那么我们没有必要查询N次,查询一次并且复用即可。我们可以使用Rails的includes语句

irb(main):038:0> Post.includes(:category).limit(10).each {|p| p.category.name }
  Post Load (1.1ms)  SELECT "posts".* FROM "posts" LIMIT $1  [["LIMIT", 10]]
  Category Load (0.3ms)  SELECT "categories".* FROM "categories" WHERE "categories"."id" IN ($1, $2, $3)  [["id", 2], ["id", 1], ["id", 3]]
=> [#<Post id: 5, title: "【译】这福利是给谁的?", body: "这福利是给谁的?翻译....

可见原来的11条语句现在简化成2条。有时候还真别小看这种优化,对于一个列表而言,当这种关联数据很多的时候,处理得好,往往能给列表性能带来质的飞跃,而这种飞跃所需要投入的努力并不算大。特别是在Rails里面恰当使用includes语句就能解决很多问题。

注意事项

个人感觉includesjoins并不是对立的东西,而是可以相互协助的工具。比方说写出这种查询:

irb(main):041:0> Post.includes(:category).joins(:category).where("categories.name = 'Blog'").each {|p| p.category.name }
  SQL (1.9ms)  SELECT "posts"."id" AS t0_r0, "posts"."title" AS t0_r1, "posts"."body" AS t0_r2, "posts"."slug" AS t0_r3, "posts"."created_at" AS t0_r4, "posts"."updated_at" AS t0_r5, "posts"."excerpt" AS t0_r6, "posts"."category_id" AS t0_r7, "posts"."draft" AS t0_r8, "categories"."id" AS t1_r0, "categories"."key" AS t1_r1, "categories"."name" AS t1_r2, "categories"."created_at" AS t1_r3, "categories"."updated_at" AS t1_r4 FROM "posts" INNER JOIN "categories" ON "categories"."id" = "posts"."category_id" WHERE (categories.name = 'Blog') LIMIT $1  [["LIMIT", 11]]
=> #<ActiveRecord::Relation [#<Post id: 42, title: "事务 ....

虽然说构造出来的SQL语句会复杂些,不过这些都是Rails系统的事情,咱们不用理会太多。我们只需要把重心放在如何把代码写得清晰可维护,容易理解即可。另外还要注意在联表查询的时候最好养成加上表前缀的习惯,比方说:

Post.includes(:category).joins(:category).where("categories.id = 1").each {|p| p.category.name }

而不是

Post.includes(:category).joins(:category).where("id = 1").each {|p| p.category.name }

ActiveRecord::StatementInvalid (PG::AmbiguousColumn: ERROR:  column reference "id" is ambiguous)
LINE 1: ... ON "categories"."id" = "posts"."category_id" WHERE (id = 1)

两个表都有名为id列,这个基于id的查询,数据库系统无法定位是哪一个表。要是联表查询的时候关联的表比较少,那这个问题还好说,比较容易发现。然而随着业务越来越复杂,数据表肯定会越来越多,联接多个表又加上各种复杂的查询封装,很容易就出现PG::AmbiguousColumn这种异常,到时候不得不在多个表之间一一排查问题工作量也不少。还不如一开始就习惯性加上前缀。要是担心自己的表名会变(其实一般也不会),这样写弹性不足的话则可以考虑,这样

Post.includes(:category).joins(:category).where("#{Category.table_name}.id = 1").each {|p| p.category.name }

效果是一样的,只不过这样写代码稍微健壮一丢丢。

尾声

简单总结了一下在Rails里面includes以及joins的用法,也提到一些使用的注意事项。也不知道这么粗浅的回答当年那面试官是否能够接受。