Day 3 - MongoDB 的 Aggregation 教學

本系列文章也刊登在我的部落格中，如有更新將以部落格為主～

前言

本篇文章是系列文的第三篇，要介紹 MongoDB 中較進階的 Aggregation 用法。

文章的架構如下：

Aggregation 基本概念：什麼是 Pipeline 與 Stage ？
常見的資料操作方法（ㄧ）： $match 、 $sort 、 $limit
常見的資料操作方法（二）： $project 、 $group 、 $unwind
Aggregation 綜合練習

你也可以在上一篇文章中複習 MongoDB 的 CRUD 操作。

Aggregation 基本概念

MongoDB 的文件裡寫到：

Aggregation operations process data records and return computed results.

意思是： Aggregation 是幫助我們在 MongoDB server 端進行「資料處理」的工具。

你可能會好奇在 MongoDB server 端進行資料處理的好處是什麼？為什麼不在應用程式端處理就好？
舉個例子，假如我們有一個名為 products 的 collection，裡頭裡存放了所有商品的資訊。當我們想要知道「最貴的商品是什麼」時，我們有兩種做法：

把所有的商品都查詢出來，再在應用程式中找出價錢最高的商品
透過撰寫 Aggregation 指令，直接在 MongoDB server 端找出價錢最高的商品

可以看出資料量龐大時，比起把所有資料都拿回應用程式端做處理，使用 Aggregation 更有效率些～

什麼是 Pipeline 跟 Stage

如果我們把「資料處理」比喻成「罐頭加工」的過程，那麼：
・存在 mongoDB 中的原始資料就是「罐頭的原物料」
・Pipeline 是罐頭加工廠的「生產線」
・Stage 是生產線中的「一道手續」

我們透過描述一連串的 stages (手續)來組成 pipeline（生產線），並對原始資料（原物料）進行 aggregate（加工），最終變成我們想要的成果（罐頭）。如同生產線中手續的「優先順序」很重要一樣，pipeline 中 stages 的順序是很重要的，因為每一個 stage 的 input 都是上一個 stage 處理後的 output。

舉例來說，我們可以透過由兩個 stage 組成的 pipeline 找出資料庫中「最貴的科技商品」是什麼。要注意兩個 stage 的順序不能調換：

// 針對 products collection 使用 aggregate
db.products.aggregate([
   找出所有「科技類」商品, // 第一個 stage
   找到其中「價錢最高」的商品 // 第二個 stage
])

再來我們會介紹幾個常見的 stage 以及應用方法。

常見的資料操作方法（ㄧ）

首先要介紹 $match 、 $sort 、 $limit 這三個方法。

`$match`

用來找出符合需求條件的資料。用法就跟上一篇介紹的 find 很像，是最常用到 Aggregation 方法之一。

假如有一個 articles collection 資料如下：

{ "author" : "dave", "score" : 80, "views" : 100 , "_id" : ObjectId("512bc95fe835e68f199c8686") }
{ "author" : "dave", "score" : 85, "views" : 521 , "_id" : ObjectId("512bc962e835e68f199c8687") }
{ "author" : "ahn", "score" : 60, "views" : 1000 , "_id" : ObjectId("55f5a192d4bede9ac365b257") }
{ "author" : "li", "score" : 55, "views" : 5000 , "_id" : ObjectId("55f5a192d4bede9ac365b258") }
{ "author" : "annT", "score" : 60, "views" : 50 , "_id" : ObjectId("55f5a1d3d4bede9ac365b259") }
{ "author" : "li", "score" : 94, "views" : 999 , "_id" : ObjectId("55f5a1d3d4bede9ac365b25a") }
{ "author" : "ty", "score" : 95, "views" : 1000 , "_id" : ObjectId("55f5a1d3d4bede9ac365b25b") }

我們可以用 $match 方法找出「作者是 dave」的文章：

// 針對 articles collection 使用 aggregate
db.articles.aggregate(
    // 只有一個 $match stage 的 pipeline
    [ { $match : { author : "dave" } } ]
);

會得出 2 筆結果：

{ "author" : "dave", "score" : 80, "views" : 100 , "_id" : ObjectId("512bc95fe835e68f199c8686") }
{ "author" : "dave", "score" : 85, "views" : 521 , "_id" : ObjectId("512bc962e835e68f199c8687") }

搭配邏輯符號

在使用 $match 時也常搭配上邏輯符號，如 $or 、 $gt 、 $lt ，能夠更精準的描述想要的資料。

下面的範例在 articles collection 中找出「分數大於 80、小於 95」或「觀看次數大於（等於） 1000」的文章：

db.articles.aggregate( [
  { $match: { $or: [ { score: { $gt: 80, $lt: 95 } }, { views: { $gte: 1000 } } ] } },
] );

會得出 5 筆結果：

{ "author" : "dave", "score" : 85, "views" : 521 , "_id" : ObjectId("512bc962e835e68f199c8687") }
{ "author" : "ahn", "score" : 60, "views" : 1000 , "_id" : ObjectId("55f5a192d4bede9ac365b257") }
{ "author" : "li", "score" : 55, "views" : 5000 , "_id" : ObjectId("55f5a192d4bede9ac365b258") }
{ "author" : "li", "score" : 94, "views" : 999 , "_id" : ObjectId("55f5a1d3d4bede9ac365b25a") }
{ "author" : "ty", "score" : 95, "views" : 1000 , "_id" : ObjectId("55f5a1d3d4bede9ac365b25b") }

`$sort` 與 `$limit`

$sort 用來將 documents 依據指定欄位排序，$limit 則是限定 documents 的數量。
這兩個 stage 常常搭配在一起使用。

同樣以上面 articles collection 為例子，我們可以用 $sort 與 $limit 找出「分數最高的三篇文章」：

db.articles.aggregate( [
  // 依照分數「由高而低」排序
  {"$sort": {
    "score": -1,
  }},
  // 只要分數「前三高」的 document
  {"$limit": 3}
] );

會得出 3 筆結果：

{ "author" : "ty", "score" : 95, "views" : 1000 , "_id" : ObjectId("55f5a1d3d4bede9ac365b25b") }
{ "author" : "li", "score" : 94, "views" : 999 , "_id" : ObjectId("55f5a1d3d4bede9ac365b25a") }
{ "author" : "dave", "score" : 85, "views" : 521 , "_id" : ObjectId("512bc962e835e68f199c8687") }

常見的資料操作方法（二）

再來要介紹 $project 、 $unwind 、 $group 這三個方法。

`$project`

$project 可以用來篩選或排除 document 已經存在的欄位、也可以用來創造出新的欄位。

假如有一個 books collection 資料如下：

{
  "_id" : 1,
  title: "abc123",
  isbn: "0001122223334",
  author: { last: "zzz", first: "aaa" },
  copies: 5,
  lastModified: "2016-07-28"
}
{
  "_id" : 2,
  title: "Baked Goods",
  isbn: "9999999999999",
  author: { last: "xyz", first: "abc", middle: "" },
  copies: 2,
  lastModified: "2017-07-21"
}
{
  "_id" : 3,
  title: "Ice Cream Cakes",
  isbn: "8888888888888",
  author: { last: "xyz", first: "abc", middle: "mmm" },
  copies: 5,
  lastModified: "2017-07-22"
}

我們可以用 $project 來篩選需要的兩個欄位 title、lastModified，並創造出一個新的欄位 authorName。其中的 authorName 用了字串的 $concat 方法，把 author 的 first name 跟 last name 串連起來：

db.books.aggregate( [
   {
      $project: {
         title: true, // true 表示要留下這個欄位
         lastModified: true,
         authorName: { // 產生一個新的欄位 authorName
           $concat: [
             "$author.first",
              ' ',
              "$author.last"
              ]
         }
      }
   }
] )

產出結果如下：

{ "_id" : 1, "title" : "abc123", lastModified: "2016-07-28", authorName: "aaa zzz"  }
{ "_id" : 2, "title" : "Baked Goods", lastModified: "2017-07-21", authorName: "abc xyz" }
{ "_id" : 3, "title" : "Ice Cream Cakes", lastModified: "2017-07-21", authorName: "abc xyz" }

值得一提的是 _id 這個欄位：由於 _id 是 document 的主鍵，除非「特別排除」這個欄位，否則使用 $project 時預設都會保留下來。如果想要排除 _id ，只需要再加上一行 _id: false 就行～

`$unwind`

$unwind 可以把 document 中的陣列資料「攤平」。聽起來好像很抽象，不如直接來看個例子。

假如有一個紀錄產品庫存的 inventory collection 資料如下：

{ "_id" : 1, "item" : "shirt", "sizes": [ "S", "M", "L"] }
{ "_id" : 2, "item" : "shoes", "sizes": ["M"] }

我們嘗試針對 sizes 這個陣列欄位使用 $unwind ：

db.inventory.aggregate( [ { $unwind: "$sizes" } ] )

因為第一筆 document 中 sizes 陣列有三個值，所以攤平後會得到三筆資料。結果如下：

{ "_id" : 1, "item" : "shirt", "sizes" : "S" }
{ "_id" : 1, "item" : "shirt", "sizes" : "M" }
{ "_id" : 1, "item" : "shirt", "sizes" : "L" }
{ "_id" : 2, "item" : "shoes", "sizes" : "M" }

$unwind 時常放在 pipeline 中間，幫助我們更直觀的處理陣列資料。

`$group`

$group 可以把 document 「分組」，還可以根據分組結果做數學運算。是非常好用的工具之一。

假如有一個記錄銷售的 sales collection 資料如下：

{ "_id" : 1, "item" : "abc", "quantity" : "2", "date" : 2014-03-01}
{ "_id" : 2, "item" : "jkl", "quantity" : "1", "date" : 2014-03-01}
{ "_id" : 3, "item" : "xyz", "quantity" :  "1", "date" : 2014-03-15}
{ "_id" : 4, "item" : "xyz", "quantity" :  "20", "date" : 2014-04-04}
{ "_id" : 5, "item" : "abc", "quantity" : "10", "date" : 2014-04-04}
{ "_id" : 6, "item" : "def", "quantity": "5", "date" : 2015-06-04}
{ "_id" : 7, "item" : "def", "quantity": "10", "date" : 2015-09-10}
{ "_id" : 8, "item" : "abc", "quantity" : "5" , "date" : 2016-02-06}

我們使用 $group 方法，計算出每一個 item 各別賣出了多少數量：
・_id：要做分組的欄位。範例中我們把相同 item 的資料 group 在一組。
・totalSaleQuantity：我們新加上的欄位。透過 $sum 把相同 item 的 quantity 相加

db.sales.aggregate([
  {
    $group : {
       _id : "$item", // 用 item 欄位做分組
       totalSaleQuantity: { $sum: "$quantity" } // 使用 $sum 把同個 item 的 quantity 相加
    }
  }
 ])

得出 4 種不同 item 以及各別賣出的總數量：

{ "_id" : "abc", "totalSaleQuantity" : 17 }
{ "_id" : "jkl", "totalSaleQuantity" : 1 }
{ "_id" : "xyz", "totalSaleQuantity" : 21 }
{ "_id" : "def", "totalSaleQuantity" : 15 }

除了範例裡用的 $sum 之外，$group 還時常搭配 $count、$avg、$max 等運算符使用。
可以在文件中看到完整的用法。

Aggregation 綜合練習

我們來試試把多個 stage 組成 pipeline 的綜合練習。

假如有一個訂單的 order collection 如下。每筆 order 都包含 顧客 id、訂單日期 跟 金額 三個欄位。

我們的目標是找出在 2020 年間，每位顧客的：

第一筆訂單時間
訂單的總數
訂單的總金額

  {
    "customer_id": "elise_smith@myemail.com",
    "orderdate": ISODate("2020-05-30T08:35:52Z"),
    "value": NumberDecimal("231.43"),
  },
  {
    "customer_id": "elise_smith@myemail.com",
    "orderdate": ISODate("2020-01-13T09:32:07Z"),
    "value": NumberDecimal("99.99"),
  },
  {
    "customer_id": "oranieri@warmmail.com",
    "orderdate": ISODate("2020-01-01T08:25:37Z"),
    "value": NumberDecimal("63.13"),
  },
  {
    "customer_id": "tj@wheresmyemail.com",
    "orderdate": ISODate("2019-05-28T19:13:32Z"),
    "value": NumberDecimal("2.01"),
  },
  {
    "customer_id": "tj@wheresmyemail.com",
    "orderdate": ISODate("2020-11-23T22:56:53Z"),
    "value": NumberDecimal("187.99"),
  },
  {
    "customer_id": "tj@wheresmyemail.com",
    "orderdate": ISODate("2020-08-18T23:04:48Z"),
    "value": NumberDecimal("4.59"),
  },
  {
    "customer_id": "elise_smith@myemail.com",
    "orderdate": ISODate("2020-12-26T08:55:46Z"),
    "value": NumberDecimal("48.50"),
  },
  {
    "customer_id": "tj@wheresmyemail.com",
    "orderdate": ISODate("2021-02-29T07:49:32Z"),
    "value": NumberDecimal("1024.89"),
  },
  {
    "customer_id": "elise_smith@myemail.com",
    "orderdate": ISODate("2020-10-03T13:49:44Z"),
    "value": NumberDecimal("102.24"),
  }

首先，我們只需要 2020 年間的訂單，所以可以用 $match 寫出第一個 stage ：

{
  "$match": {
    "orderdate": {
      "$gte": ISODate("2020-01-01T00:00:00Z"), // 時間大於等於 2020/1/1
      "$lt": ISODate("2021-01-01T00:00:00Z"), // 時間小於 2021/1/1
    },
  }
},

再來，由於我們要找出「每個使用者」的「第一筆訂單時間」，可以先使用 $sort 把所有訂單依照日期「由先而後」排序：

{
  "$sort": {
    "orderdate": 1, // 依照 orderdate 將時間由小而大排序
  }
}

最後，我們使用 $group 把訂單依照 顧客 id 做分組，並搭配：
・$first：找出每個顧客的第一筆訂單。因為先前已經 sort 過，所以第一筆訂單就是「時間最早」的訂單
・$sum：計算出訂單總數、訂單總金額

{
  "$group": {
    "_id": "$customer_id", // 依照 customer_id 做分組
    "first_purchase_date": {"$first": "$orderdate"}, // 找出第一筆（也是最早的） orderdate
    "total_value": {"$sum": "$value"}, // 使用 sum 將每筆 order 的金額加總
    "total_orders": {"$sum": 1}, // 使用 sum 計算總共有幾筆 order
  }
}

我們把上述三個 stage 組裝成 pipeline，對 order collection 進行 aggregate 操作：

db.orders.aggregate([
  // Match only orders made in 2020
  {"$match": {
    "orderdate": {
      "$gte": ISODate("2020-01-01T00:00:00Z"),
      "$lt": ISODate("2021-01-01T00:00:00Z"),
    },
  }},
  // Sort by order date ascending
  {"$sort": {
    "orderdate": 1,
  }},
  // Group by customer
  {"$group": {
    "_id": "$customer_id",
    "first_purchase_date": {"$first": "$orderdate"},
    "total_value": {"$sum": "$value"},
    "total_orders": {"$sum": 1},
  }},
]);

得出結果：

{
  customer_id: 'elise_smith@myemail.com',
  first_purchase_date: ISODate('2020-01-13T09:32:07.000Z'),
  total_value: NumberDecimal('482.16'),
  total_orders: 4
},
{
  customer_id: 'oranieri@warmmail.com',
  first_purchase_date: ISODate('2020-01-01T08:25:37.000Z'),
  total_value: NumberDecimal('63.13'),
  total_orders: 1
},
{
  customer_id: 'tj@wheresmyemail.com',
  first_purchase_date: ISODate('2020-08-18T23:04:48.000Z'),
  total_value: NumberDecimal('192.58'),
  total_orders: 2
}

總結

如果把 CRUD 比喻成格鬥遊戲裡的基本攻防招數，那 Aggregation 就是格鬥遊戲裡的連續技，若能好好善用會是很強大的武器～
關於 Aggregation 更詳細的用法可以參考官方文件與 Practical MongoDB Aggregations 電子書。

下篇文章我們要介紹能夠加速 mongoDB 效能的 Indexes。

Day 3 - MongoDB 的 Aggregation 教學

前言

Aggregation 基本概念

什麼是 Pipeline 跟 Stage

常見的資料操作方法（ㄧ）

`$match`

搭配邏輯符號

`$sort` 與 `$limit`

常見的資料操作方法（二）

`$project`

`$unwind`

`$group`

Aggregation 綜合練習

總結

pmc12thsuki

你可能感興趣的文章

留言討論

Day 3 - MongoDB 的 Aggregation 教學

前言

Aggregation 基本概念

什麼是 Pipeline 跟 Stage

常見的資料操作方法（ㄧ）

$match

搭配邏輯符號

$sort 與 $limit

常見的資料操作方法（二）

$project

$unwind

$group

Aggregation 綜合練習

總結

pmc12thsuki

你可能感興趣的文章

DAY25：Detect Pangram

Leetcode JS 2627. Debounce

Day 159

留言討論

`$match`

`$sort` 與 `$limit`

`$project`

`$unwind`

`$group`