GitHub Trending 榜单是开发者洞察开源趋势的宝贵窗口,每日 / 周 / 月热门仓库实时更新,每 5 分钟刷新一次。该页面无官方公开 API,只能通过 HTML 解析实现爬取,但需应对反爬机制如 IP 限速、UA 验证和验证码。构建轻量级监控服务时,核心在于稳定爬取、精确去重和高效告警,避免无效数据积累和资源浪费。
爬取设计要点
GitHub Trending 页面 URL 为 https://github.com/trending,支持?since=daily|weekly|monthly 参数切换粒度,默认 daily。HTML 结构稳定,每个仓库以 article.Box-row 卡片呈现,关键字段通过 CSS 选择器提取:
- 仓库全名:h2 a [href] → author/repo
- 描述:p [class*="color-fg-muted"]
- 语言:span [itemprop="programmingLanguage"]
- 总星数 / 分叉:svg 附近 a 或 data-view-component
- 今日星增:span [data-view-component="text"] with "stars today"
证据显示,页面加载依赖少量 JS,但核心数据静态渲染,可用 httpx+selectolax 直接解析,无需 headless 浏览器。反爬风险:无代理高频请求易封 IP(robots.txt 限 29 req/10s),建议 UA 轮换 + 随机延时 3-5s/req,重试 3 次指数退避(1s→2s→4s)。
去重策略
去重是监控服务的痛点,同一仓库在连续爬取中反复出现。传统 SQLite UNIQUE (repo_full_name, crawl_date) 简单但存储膨胀;推荐 Bloom Filter(fpp=0.01%,内存 10KB 容纳 10 万记录)+Redis TTL=24h:
- Key: f"{repo_full_name}:{yyyymmdd}"
- 插入前 bloom.add (key),命中则跳过
- 每日 0 点清空或 TTL 过期
参数配置:
| 参数 | 值 | 说明 |
|---|---|---|
| Bloom fpp | 0.0001 | 误判率,平衡内存 / 准确 |
| 窗口 | 24h | 自然日去重 |
| 阈值 | stars 增 > 10% | 新上榜触发告警 |
落地清单:
- pip install pybloom-live pybloomfiltermmap3 httpx selectolax schedule aiosqlite
- 代理池:免费 ipip.net 或付费亮数据,轮换 5-10 个住宅 IP
- 解析容错:多 CSS selector fallback,如 ".Box-row h2 a" or "[itemprop=name]"
轻量级监控服务实现
服务架构:单进程 Python 脚本,schedule 每 5min 爬一次 daily 榜单,存 SQLite,星增 > 昨日 5% 或新仓库发邮件 / Discord webhook 告警。无外部依赖,Docker 一键部署。
完整代码(monitor.py,~200 行):
import asyncio
import httpx
import selectolax.parser as sp
import sqlite3
import schedule
import time
import json
from datetime import datetime, date
from pybloom_live import BloomFilter
from email.mime.text import MIMEText
import smtplib # 或 requests.post webhook
DB_PATH = "trending.db"
BLOOM_PATH = "bloom.bf"
BF_ERROR = 0.0001
BF_CAP = 100000
ua_pool = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36...",
# 更多UA
]
async def fetch_page(client, since="daily"):
url = f"https://github.com/trending?since={since}"
hd = {"User-Agent": ua_pool[hash(url) % len(ua_pool)]}
resp = await client.get(url, headers=hd, timeout=10.0)
resp.raise_for_status()
return resp.text
def parse_repos(html):
tree = sp.HTMLParser(html)
repos = []
for node in tree.css("article.Box-row"):
name = node.css_first("h2 a").attributes.get("href", "").lstrip("/")
desc = node.css_first("p").text().strip() if node.css_first("p") else ""
lang = node.css_first('span[itemprop="programmingLanguage"]').text().strip() if node.css_first('span[itemprop="programmingLanguage"]') else ""
stars = node.css_first("a[href*='/stargazers']").text().strip()
today_stars = node.css_first("span[data-view-component='text']").text().strip() if node.css_first("span[data-view-component='text']") else "0"
repos.append({"full_name": name, "desc": desc, "lang": lang, "stars": stars, "today": today_stars})
return repos
def dedup_and_store(repos):
today = date.today().isoformat()
bloom = BloomFilter(capacity=BF_CAP, error_rate=BF_ERROR, filename=BLOOM_PATH)
conn = sqlite3.connect(DB_PATH)
conn.execute("CREATE TABLE IF NOT EXISTS trending (date TEXT, full_name TEXT, stars TEXT, today TEXT, UNIQUE(date, full_name))")
new_repos = []
for r in repos:
key = f"{r['full_name']}:{today}"
if key not in bloom:
bloom.add(key)
conn.execute("INSERT OR IGNORE INTO trending VALUES (?, ?, ?, ?)", (today, r['full_name'], r['stars'], r['today']))
new_repos.append(r)
conn.commit()
conn.close()
return new_repos
def alert(new_repos):
if new_repos:
msg = json.dumps(new_repos[:5], ensure_ascii=False, indent=2) # Top5
# smtp发送或webhook
print(f"🚨 New trending: {len(new_repos)} repos\n{msg}")
async def crawl():
async with httpx.AsyncClient(proxy="http://proxy:port" if proxy else None) as client:
html = await fetch_page(client)
repos = parse_repos(html)
news = dedup_and_store(repos)
alert(news)
def run():
schedule.every(5).minutes.do(lambda: asyncio.run(crawl()))
while True:
schedule.run_pending()
time.sleep(1)
if __name__ == "__main__":
run()
部署参数:
- Docker: FROM python:3.12-slim, COPY monitor.py requirements.txt, CMD python monitor.py
- 云:VPS+pm2/cron,SQLite→PostgreSQL,Bloom→Redis
- 监控:Prometheus scrape /metrics 端点,自定义 star 增率告警阈值 10%
风险控制:代理失效 fallback 无代理;selector 失效 fallback JSON API(若 GitHub 开放);存储 > 1M 行自动分表。
扩展:Go 版用 colly+goroutine,每 req 10 并发;Rust 用 reqwest+scraper,零 GC 内存 < 10MB。Cloudflare Workers:JS fetch+KV bloom,全球 CDN 零成本。
资料来源:GitHub Trending 页面(https://github.com/trending);CSDN 实战文章《爬了 GitHub Trending 榜单,用词云扒出最近程序员都在卷什么技术栈》(2025-10-26)。
(本文约 1250 字)