基于cheerio实现的github contributions crawler

GitHub top language

前言

起因是想在博客上加一个类似 github contribution heatmap 的小组件，用来展示博客更新频率，但社区没有这种高度定制（集成 github api）的组件，大体符合需求的开源项目，只有githubchart-api，但它只能生成 heatmap 的图片 url，然后只能以 img 元素的形式插入到博客 DOM 中，缺乏交互能力，而且也不够灵活。

githubchart-api 获取 contributions 依赖于githubstats——一个 ruby 实现的爬取 github contribution 数据的库，核心就是访问https://github.com/users/{user_name}/contributions

总之多方调研发现并没有可以直接满足需求的类库，所以尝试自己写一个开源类库，核心功能就是爬取 github contributions，至于 heatmap 已经有很多成熟实现了，可以直接拿来用。

思路

构思了以下几套方案：

cheerio.js：访问上述的 URL，拿到 html，通过 cheerio 解析并获取相关的数据；
nodejs + ruby：通过 node.js child_process 模块执行 ruby 指令，直接复用 gem module githubstats；
octokit.js：通过 github 开源的 octokit.js 直接调用 Github RESTful Api；

获取数据后可以：

组件运行时调用，直接获取动态数据；
在后台生成静态 json 文件，组件获取静态数据；

可行性分析

对于稳定性来说，方案 1、2，都存在一个问题，那就是以爬虫的形式去获取数据，一旦源文档改变结构，爬虫程序就要跟着改，不够稳定，而方案 3 直接从 github server 获取数据显然是最稳定的。

对于开发成本来说，方案 1 需要实现一整套代码，方案 2 存在 ruby 的学习成本，方案 3 成本最少。

对于入参来说，方案 1、2 都只需要 username，方案 3 却需要 github personal token，显然是不够方便的。

调研过程中发现的其他相关细节：

github.com 设置了严格的 CSP，只要是现代浏览器，从一个跨源的站点对 github.com 的资源发起请求，这个请求就会被浏览器拦截，这就意味着爬虫方案只能实现在 Nodejs 或服务器等非现代浏览器环境中。
国内使用，还需要实现一个 proxy，否则不能保证每次都能连接到 github.com
Github Api 没有提供获取 contributions 的接口，contributions 派生自 commits、opened issues、created repository 等，同理 octokit 就得实现一套同样的 contributions 计算逻辑。
Github Api 有速率限制，则 octokit 在浏览器端的实时获取数据就有问题了，而且实时获取的话，token 也存在安全问题。

综合来看，octokit 是最优的，但由于必须传入 token，那就不能写进组件里在运行时获取数据了，这是因为别人打开 devtools 直接就能从 authentication 首部拿到我的 token，虽然可以设置该 token 的权限为 readonly，但由于 Github Api 速率限制，拿到我的 token 就意味着可以调用 Github Api，进而直接耗光调用次数。

至此可以确定，这三套方案都只能实现在 Node.js 或服务器上，获取数据后只能先持久化为静态数据。

考虑到使用上，octokit 需要 token，其他方案不需要，显然对于用户来说更容易接受其他方案，既然要开源，就要优先考虑用户习惯，所以直接排除方案 3.

ruby 方案看似简单，实则实现过程有很多坑，首先就是githubstats库没有提供设置 proxy 的接口，本地调试不易，其次 ruby 与我学过的语言语法差别很大，学习成本并不像其宣称的那么低，排除方案 2.

最终敲定方案 1：访问上述的 URL，拿到 html，通过 cheerio 解析并获取相关的数据，持久化为静态数据；

当前场景特化

数据的更新比较简单，contributions 时效性要求不高，可以起一个任务每天跑一次就行，至于数据怎么传递到组件是有必要考虑的：

高成本：数据持久化到数据库中，起一个服务器，定时 job 更新数据，实现一个获取数据的 api；
低成本：数据持久化为 json 文件，利用 Github Action 定时执行脚本以更新数据，json 可以推送到任意地方；

由于不想买服务器，我只好考虑低成本方案 QAQ，当前场景下，我已经实现博客 push 后 github action 自动 build 并推送到 github page，只需要在 build 时把获取数据集成进来就行，然后生成的 json 文件同样推送到 github page，组件本身也在 github page，这样也不存在跨域问题。

实施

开发环境脚手架基本配置：

webpack
ts

需要使用的开源类库：

cheerio（解析 html）
arg（解析命令行参数，主要是为了构建一个简单的 cli 程序，便于集成到 hexo 里）
node-fetch（替换 nodejs 原生 fetch，搭配 http-proxy-agent 实现请求代理，便于本地调试）

Crawler 核心逻辑

contributions 的 html 结构如下，暂称之为数据项（省略祖先节点 table tbody tr）：

<td
  tabindex="0"
  data-ix="0"
  aria-selected="false"
  style="width: 11px"
  class="ContributionCalendar-day"
  data-date="2022-07-24"
  data-level="0"
>
  <span class="sr-only">19 contributions on Sunday, July 24, 2022</span>
</td>

<!-- or -->

<td
  tabindex="0"
  data-ix="0"
  aria-selected="false"
  style="width: 11px"
  class="ContributionCalendar-day"
  data-date="2022-07-24"
  data-level="0"
>
  <span class="sr-only">No contributions on Sunday, July 24, 2022</span>
</td>

还存在干扰项：

<td class="ContributionCalendar-label" style="position: relative">
  <span class="sr-only">Sunday</span>
  <span
    aria-hidden="true"
    style="clip-path: Circle(0); position: absolute; bottom: -4px"
  >
    Sun
  </span>
</td>

cheerio 的 load 方法加载并解析 html，然后选择器定为table tbody tr td，其实 cheerio 选择器就是通过 css selector 实现，这里直接看作 css 选择器就可以了。

由于这里会选中干扰项——标明周几的单元格，所以需要根据二者差异过滤掉干扰项，这里选用数据属性data-ix，通过 cheerio element 的data方法获取数据属性，注意省略data-前缀，这点与 Element 原生属性dataset行为一致。

接下来就是从 span 元素中提取第一个数字作为 contributions 值，提取第一个数字只需要一个简单的正则表达式/^\d*/，再从 td 的data-date提取日期。

最后注意由于文档流中表格 cell 的顺序并非按时间顺序排列，还需要做一次排序。

function extractContributions(html: string): ContributionItem[] {
  const pattern = /^\d*/;
  const $ = load(html);
  const contributions: ContributionItem[] = $("table tbody tr td")
    .filter((_, el) => $(el).data("ix") !== undefined)
    .map((_, el) => {
      const date = $(el).data("date") as string;
      const matchedValue = $(el, "span").text().match(pattern);
      const value =
        matchedValue && matchedValue[0] !== "" ? matchedValue[0] : "0";
      return {
        date: new Date(date),
        value,
      };
    })
    .toArray();

  // default sort as asc
  sortByDate(contributions, (item) => item.date);

  return contributions;
}

fetch 替换及注入 Proxy

由于以下两点

发布 npm module 时，不要把无关依赖打到一个包里
不是所有人都需要请求代理才能正常获取 github.com 数据

fetch 不能直接使用 node-fetch，这里我的实现不算优雅，加一个 Monkey patch，调用fetchHtml之前通过injectFetch用node-fetch替换原生fetch.

let fetchFunc: FetchFunc = fetch;
export function injectFetch(_fetch: FetchFunc): void {
  fetchFunc = _fetch || fetch;
}

export const __fetch: FetchFunc = (input, init) => {
  return fetchFunc(input, init);
};

async function fetchHtml(url: string) {
  const res = await __fetch(url);
  return res.text();
}

node-fetch中设置 proxy，这里是一种代理模式的实现。

import { HttpsProxyAgent } from "https-proxy-agent";
import fetch from "node-fetch";

function applyProxyAgent() {
  const agent = new HttpsProxyAgent("http://127.0.0.1:7890");

  function myFetch(input: any, init: any) {
    return fetch(input, {
      ...init,
      agent,
    });
  }
  injectFetch(myFetch as any);
}

如此一来，可以把node-fetch, https-proxy-agent排除到 dependencies 之外，只在本地调试时执行（即 devDependencies），打包时也可以完全排除掉（按 webpack 的概念讲，连 externals 都不算）。

CLI 实现

目标是写一个简单的 CLI 应用，调用 crawler 并把数据持久化为 json 文件。

首行注释#!/usr/bin/env node声明执行环境为 nodejs.

然后，参数指令解析比较麻烦，没必要造轮子，小脚本也没必要使用commander，通过轻量级的arg实现即可。定义了三个指令--username --years --path，分别是 github 用户名、时间范围及 json 文件路径。

为了简化--years指令的参数格式，约定以逗号,分割每一年，这样可以把-y 2021 -y 2022 -y 2023简化为-y 2021,2022,2023.

#!/usr/bin/env node
const { run } = require("../dist/scripts.js");
const arg = require("arg");

// parse the arguments passed by user
const args = arg(
  {
    // Types
    "--username": String,
    "--years": String,
    "--path": String,

    // Aliases
    "-u": "--username",
    "-y": "--years",
    "-p": "--path",
  },
  { argv: process.argv }
);

if (!args["--username"]) {
  throw new Error("missing required argument: --username or -u");
}

args["--years"] = args["--years"]?.split(",");

const { ["--username"]: username, ["--years"]: years, ["--path"]: path } = args;

run({ username, years, path });

在package.json中定义 js 脚本文件与 CLI 指令的映射。

1
2
3

"bin": {
    "crawl": "bin/cli.js"
},

OK，全局安装的情况下就可以通过crawl -u "user-name"来执行该脚本了，本地安装则需要配置 npm script，这是 nodejs 查找模块、命令的规则限制的。

1
2
3

"scripts": {
    "crawl": "crawl",
},

具体来说，本地安装的模块，package.json 的 bin 字段会指示 npm 在当前项目的 node_modules/.bin 中添加一条命令，但直接调用crawl命令却是从全局的 node_modules 中查找的，显然是找不到的，而编写进 npm script 后，执行npm run crawl，此时查找范围就是当前项目的 node_modules，所以可以找到。

Webpack config

将项目发布为 npm module 之前需要做一些额外工作：

依赖分析；
排除外部依赖；
确定打包后资源的模块类型；
入口分离，chunk 分割；
打包成生产版本；

为了实现目标 1，引入了webpack-bundle-analyzer插件；

为了实现目标 2，做以下配置，把 dependencies 统统排除，并且指定这些外部依赖的模块类型externalsType为commonjs，这个会影响打包后资源中导入这些模块的方式，由于运行在 nodejs，直接定义为commonjs即可；externalsPresets.node用于标明 nodejs built-in module 为外部依赖，运行时导入即可，无需打包。

externals: ["arg", "cheerio", "signale"],
externalsType: "commonjs",
externalsPresets: {
    node: true,
},

为了实现目标 3，做以下配置，注意 index 是一切核心逻辑的入口，scripts 是 CLI 所需逻辑的入口，是在核心逻辑的基础上构建的，因此指定 dependOn 为 index，避免两个入口打包后的资源中各包含一份重复的核心逻辑；另外对于以发布 lib 为目标的项目，必须指定 library 的 type，否则默认为"var"，也就是导出模块会被视为变量，显然在 nodejs 中就没办法通过 require 导入了，因为按照 commonjs 规范定义，导出模块的标准语句是exports。

顺带说一下初始的commonjs规范只定义了 exports，而 nodejs 及很多其他 commonjs 的实现都引入了 module.exports，而只把 exports 实现为 module.exports 的引用，这就是为什么 webpack 加了一种commonjs2的构建类型，因为这二者在 build 后的代码是截然不同的，具体可以看这个例子

entry: {
    index: {
      import: "./src/index.ts",
    },
    // split run function as single chunk
    scripts: {
      import: "./src/scripts/index.ts",
      dependOn: "index", // remove shard dependencies
      library: {
        type: "commonjs2", // export run function as module.exports.run
      },
    },
  };

剩下的没什么可说的，都是基础用法，最后看一下github-contribution的项目依赖图，相对来说是比较小巧的。

github-contribution dependencies report