膜乎备份抢救

某些链接已经失效,但是之前点击量大的链接还有

参照 https://github.com/PincongBot/mohu/

2 Likes

备份时请开启电脑上的全局代理

备份article

# -*- coding:utf-8 -*-
import re
import io
import os
import json
import requests
import time
import traceback
base_url = 'https://webcache.googleusercontent.com/search?q=cache:https://mohu.rocks/'
base_url2 = 'https://webcache.googleusercontent.com/search?q=cache:https://www.mohu.rocks/'
headers = {
    'User-Agent' : 'User-Agent: Mozilla/5.0 (Windows NT 10.0; rv:91.0) Gecko/20100101 Firefox/91.0', # Tor browser
    'Referrer' : 'https://google.com'
}
data={}
proxies = {
    'https' : '127.0.0.1:9910'
}
startpage = 6200
endpage = 6214
for page in range(startpage,endpage):
    time.sleep(5)
    try:
        if not os.path.exists('article/'+str(page)+'.json'):
            req = requests.get(base_url + 'article/' + str(page), headers = headers, proxies = proxies)
            webpage = req.text
            title = re.findall(r'<title>(.*?)</title>',webpage)
            title = ''.join(title)
            uid = re.findall('data-id="\d{1,4}"',webpage)[-1]
            uid = re.findall('\d+', uid)
            uid = "".join(uid)
            uid = int(uid)
            contents = re.findall(r'<div class="content markitup-box">([\s\S]*?)</div>', webpage)
            contents = "".join(contents)
            contents = contents.strip()
            topics = re.findall(r'class="topic-tag" data-id="(.*?)">',webpage)
            topics = [int(i) for i in topics]
            date = re.findall(r'[\d]{4}-[\d]{2}-[\d]{2}',webpage)[0]
            agreeCount = re.findall(r'<b class="count">(.*?)</b>',webpage)[0]
            agreeCount = int(agreeCount)
            discussionCount = re.findall(r'<h2>\d+.[\u4e00-\u9fa5]{3}</h2>',webpage)
            discussionCount = "".join(discussionCount)
            discussionCount = discussionCount[4:discussionCount.index(" ")]
            discussionCount = int(discussionCount)
            data['type'] = "article"
            data['id'] = page
            data['title'] = title
            data['uid'] = uid
            data['topics'] = topics
            data['contents'] = contents
            data['date'] = date
            data['agreeCount'] = agreeCount
            data['discussionCount'] = discussionCount
            jsonString = json.dumps(data,ensure_ascii=False)
            with io.open('article/'+str(page)+'.json',"w",encoding="utf-8") as f:
                f.write(jsonString)
    except Exception as HTTPError:
        traceback.print_exc()

1 Like

备份article comment

# -*- coding:utf-8 -*-
import re
import io
import os
import json
import time
import requests
import traceback
base_url = 'https://webcache.googleusercontent.com/search?q=cache:https://mohu.rocks/'
base_url2 = 'https://webcache.googleusercontent.com/search?q=cache:https://www.mohu.rocks/'
headers = {
    'User-Agent' : 'User-Agent: Mozilla/5.0 (Windows NT 10.0; rv:91.0) Gecko/20100101 Firefox/91.0', # Tor browser
    'Referrer' : 'https://google.com'
}
data={}
proxies = {
    'https' : '127.0.0.1:9910'
}
startpage=6200
endpage=6214
for page in range(startpage,endpage):
    time.sleep(1)
    try:
        
        req = requests.get(base_url + 'article/' + str(page), headers = headers, proxies = proxies)

        webpage=req.text
        comment=re.findall(r'<div class="mod-body aw-feed-list aw-replies">([\s\S]*?)<div class="mod-body aw-feed-list aw-replies-fold">',webpage)
        comment="".join(comment).strip()
        total_discussion=comment.count('markitup-box')
        data_item_id=re.findall(r'data-item-id="\d+"',comment)
        data_item_id=[re.findall(r'\d+',i) for i in data_item_id]

        agree_count=re.findall(r'<b class="count">\d+</b>',comment)
        agree_count=[re.findall(r'\d+',i) for i in agree_count]
        date=re.findall(r'[\d]{4}-[\d]{2}-[\d]{2}',comment)
        user_id=re.findall(r'<a class="aw-user-name" data-id="\d+" data-reputation',comment)
        user_id=[re.findall(r'\d+',i) for i in user_id]
        contents=re.findall(r'<div class="markitup-box">([\s\S]*?)</div>',comment)
        contents=[i.strip() for i in contents]
        for i in range(total_discussion):
            data['type']="article_comment"
            tmp_id="".join(data_item_id[i])
            data['id']=int(tmp_id)
            data['parentType']="article"
            data['parentId']=page
            data['uid']=int("".join(user_id[i]))
            data['contents']=contents[i]
            data['date']=date[i]
            data['agreeCount']=int("".join(agree_count[i]))
            data['discussionCount']=0
            jsonString = json.dumps(data,ensure_ascii=False)
            with io.open('article-comment/'+tmp_id+'.json',"w",encoding="utf-8") as f:
                f.write(jsonString)
                f.close()
    except Exception as HTTPError:
        traceback.print_exc()



1 Like

你可以运行你的代码,把数据完整备份下来吗。

正在修改代码

我粗看了一下你的备份代码,有几个问题要提一下。
1.代码里只考虑到1层page,也就是页面的一级分页,二级分页没考虑,就是讨论话题,如果有二级分页。则可能没采集到
2.代码里只有考虑到图片的抢救
3.代码里没有延迟执行,正常访问的情况下google都容易出验证码,要是批量抓取,则可能采集下来的页面只有google验证码。

4.数据只是丢失最近半年的话,确定膜乎有多少数据的情况下,可以只采最近更新的。当然全量备份也是可以的,如果没有google验证码的话,可以全抓取下来。

  1. 主要是有些东西现在也打不开了

https://webcache.googleusercontent.com/search?q=cache:https://mohu.rocks/article/6210

3.正在加sleep

功能建议补充,主要是针对google验证码的思考。

2.图片可以等批量采集之后,再从下载到的数据里批量提取地址和下来。

只要你的确定采集下来的非google验证码页面就行。

3.如果你有google账号,可以把你的google账号cookie和UA提取出来抓取的时候使用。可以减少验证码概率。

5.关于错误处理,google快照如果遇到错误,则可能是遇到验证码,抓取代码必须和浏览器里同步cookie,然后可以直接退出程序,或者暂停程序,把当前要访问的页面复制到浏览器里打开,验证码通过之后再返回到抓取代码,继续运行。或者找一个文件数据库,临时保存一下成功抓取了多少,还有多少没有抓取,通过验证码之后再从没有抓取的那部分数据继续抓取。

给你以及所有抓google快照的人提个醒:抓到页面之后,把页面存下来,先存而不要parse。

万一parse代码写错了,可以从已经存储的页面重新parse。如果在抓页面的过程中parse,parse错了要重新抓,google不会给你那么多机会的。