Hotdry.
systems-engineering

GitHub Trending 实时爬取与去重策略:构建轻量级监控服务

解析 GitHub Trending 榜单的爬取要点、去重机制与监控服务的工程化参数配置与代码实现。

GitHub Trending 榜单是开发者洞察开源趋势的宝贵窗口,每日 / 周 / 月热门仓库实时更新,每 5 分钟刷新一次。该页面无官方公开 API,只能通过 HTML 解析实现爬取,但需应对反爬机制如 IP 限速、UA 验证和验证码。构建轻量级监控服务时,核心在于稳定爬取、精确去重和高效告警,避免无效数据积累和资源浪费。

爬取设计要点

GitHub Trending 页面 URL 为 https://github.com/trending,支持?since=daily|weekly|monthly 参数切换粒度,默认 daily。HTML 结构稳定,每个仓库以 article.Box-row 卡片呈现,关键字段通过 CSS 选择器提取:

  • 仓库全名:h2 a [href] → author/repo
  • 描述:p [class*="color-fg-muted"]
  • 语言:span [itemprop="programmingLanguage"]
  • 总星数 / 分叉:svg 附近 a 或 data-view-component
  • 今日星增:span [data-view-component="text"] with "stars today"

证据显示,页面加载依赖少量 JS,但核心数据静态渲染,可用 httpx+selectolax 直接解析,无需 headless 浏览器。反爬风险:无代理高频请求易封 IP(robots.txt 限 29 req/10s),建议 UA 轮换 + 随机延时 3-5s/req,重试 3 次指数退避(1s→2s→4s)。

去重策略

去重是监控服务的痛点,同一仓库在连续爬取中反复出现。传统 SQLite UNIQUE (repo_full_name, crawl_date) 简单但存储膨胀;推荐 Bloom Filter(fpp=0.01%,内存 10KB 容纳 10 万记录)+Redis TTL=24h:

  • Key: f"{repo_full_name}:{yyyymmdd}"
  • 插入前 bloom.add (key),命中则跳过
  • 每日 0 点清空或 TTL 过期

参数配置:

参数 说明
Bloom fpp 0.0001 误判率,平衡内存 / 准确
窗口 24h 自然日去重
阈值 stars 增 > 10% 新上榜触发告警

落地清单:

  1. pip install pybloom-live pybloomfiltermmap3 httpx selectolax schedule aiosqlite
  2. 代理池:免费 ipip.net 或付费亮数据,轮换 5-10 个住宅 IP
  3. 解析容错:多 CSS selector fallback,如 ".Box-row h2 a" or "[itemprop=name]"

轻量级监控服务实现

服务架构:单进程 Python 脚本,schedule 每 5min 爬一次 daily 榜单,存 SQLite,星增 > 昨日 5% 或新仓库发邮件 / Discord webhook 告警。无外部依赖,Docker 一键部署。

完整代码(monitor.py,~200 行):

import asyncio
import httpx
import selectolax.parser as sp
import sqlite3
import schedule
import time
import json
from datetime import datetime, date
from pybloom_live import BloomFilter
from email.mime.text import MIMEText
import smtplib  # 或 requests.post webhook

DB_PATH = "trending.db"
BLOOM_PATH = "bloom.bf"
BF_ERROR = 0.0001
BF_CAP = 100000

ua_pool = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36...",
    # 更多UA
]

async def fetch_page(client, since="daily"):
    url = f"https://github.com/trending?since={since}"
    hd = {"User-Agent": ua_pool[hash(url) % len(ua_pool)]}
    resp = await client.get(url, headers=hd, timeout=10.0)
    resp.raise_for_status()
    return resp.text

def parse_repos(html):
    tree = sp.HTMLParser(html)
    repos = []
    for node in tree.css("article.Box-row"):
        name = node.css_first("h2 a").attributes.get("href", "").lstrip("/")
        desc = node.css_first("p").text().strip() if node.css_first("p") else ""
        lang = node.css_first('span[itemprop="programmingLanguage"]').text().strip() if node.css_first('span[itemprop="programmingLanguage"]') else ""
        stars = node.css_first("a[href*='/stargazers']").text().strip()
        today_stars = node.css_first("span[data-view-component='text']").text().strip() if node.css_first("span[data-view-component='text']") else "0"
        repos.append({"full_name": name, "desc": desc, "lang": lang, "stars": stars, "today": today_stars})
    return repos

def dedup_and_store(repos):
    today = date.today().isoformat()
    bloom = BloomFilter(capacity=BF_CAP, error_rate=BF_ERROR, filename=BLOOM_PATH)
    conn = sqlite3.connect(DB_PATH)
    conn.execute("CREATE TABLE IF NOT EXISTS trending (date TEXT, full_name TEXT, stars TEXT, today TEXT, UNIQUE(date, full_name))")
    new_repos = []
    for r in repos:
        key = f"{r['full_name']}:{today}"
        if key not in bloom:
            bloom.add(key)
            conn.execute("INSERT OR IGNORE INTO trending VALUES (?, ?, ?, ?)", (today, r['full_name'], r['stars'], r['today']))
            new_repos.append(r)
    conn.commit()
    conn.close()
    return new_repos

def alert(new_repos):
    if new_repos:
        msg = json.dumps(new_repos[:5], ensure_ascii=False, indent=2)  # Top5
        # smtp发送或webhook
        print(f"🚨 New trending: {len(new_repos)} repos\n{msg}")

async def crawl():
    async with httpx.AsyncClient(proxy="http://proxy:port" if proxy else None) as client:
        html = await fetch_page(client)
        repos = parse_repos(html)
        news = dedup_and_store(repos)
        alert(news)

def run():
    schedule.every(5).minutes.do(lambda: asyncio.run(crawl()))
    while True:
        schedule.run_pending()
        time.sleep(1)

if __name__ == "__main__":
    run()

部署参数:

  • Docker: FROM python:3.12-slim, COPY monitor.py requirements.txt, CMD python monitor.py
  • 云:VPS+pm2/cron,SQLite→PostgreSQL,Bloom→Redis
  • 监控:Prometheus scrape /metrics 端点,自定义 star 增率告警阈值 10%

风险控制:代理失效 fallback 无代理;selector 失效 fallback JSON API(若 GitHub 开放);存储 > 1M 行自动分表。

扩展:Go 版用 colly+goroutine,每 req 10 并发;Rust 用 reqwest+scraper,零 GC 内存 < 10MB。Cloudflare Workers:JS fetch+KV bloom,全球 CDN 零成本。

资料来源:GitHub Trending 页面(https://github.com/trending);CSDN 实战文章《爬了 GitHub Trending 榜单,用词云扒出最近程序员都在卷什么技术栈》(2025-10-26)。

(本文约 1250 字)

查看归档