操作思路#
- 在Web of Science或者其他学术搜索引擎上查找所需要的文献,然后将全体检索结果的信息导出成Excel(包括作者、标题、出版年份、期刊、DOI号等等)
- 以DOI号为检索条件,到Sci-Hub下载文献,将这一过程写成爬虫进行批量处理
- 导出DOI号序列,写成循环来逐个爬取
- 以DOI号检索文献,进入下载页面,查找到保存按钮对应的元素,下载到本地
- 将下载的PDF文件按照自己的标准重命名
- 手动补全无法在Sci-Hub上得到的文献
注意事项#
自动下载覆盖率#
这个方法不可能自动获得100%的文献,因为:
- Sci-Hub据说对主要出版商的文献覆盖率超过90%,右——这些数据都只是道听途说,总之就是说在批量下载百篇以上文献的时候,肯定会遇到Sci-Hub未收录的情况,实际覆盖率还和学科领域、发表年份、文献语种之类的因素有关(实际操作中发现,Sci-Hub对西班牙语之类的小语种文献覆盖率没有英语文献那么高)
- Web of Science检索结果不一定包含DOI号(抑或PMID号),这些文章自然也就没无法在Sci-Hub上检索
除了上述硬性的限制,一些文献还可能因为其他原因下载失败:
- 网络条件不好,下载时间过长
- 频繁请求导致本机IP被Sci-Hub给ban了
解决思路如下。
下载时间过长#
打开网页之后,点击save
按钮就会自动开始下载PDF文件,通常在几秒内就能完成,但偶尔也会遇到下载特别久的情况(比如网卡了)。但是网页端保持打开的时长是确定的,我经过调整设置成了20秒。也就是说,点击save
按钮20秒之后,不管文件有没有下载完成,下载过程都会中断。只下了一半的文件就会破损。在使用Chrome Driver的时候,在下载文件夹里可能会残留一个后缀为.crdownload
的临时性文件。
对于这些临时性文件,可以通过检查文件名后缀的方式来辨别——只有后缀为.pdf
的文件,我们才把它视为下载成功的完整文献,并进入到重命名保存的工作;否则就删除文件并标记为下载失败。
为了减少以上情况出现的概率,需要设置一个足够长的等待时间,具体根据文件大小和网络速度而定。但是也没有必要等得太长,不然也会降低爬虫的运行速度。建议在开始正式运行之前可以先下载两到三个文献尝试一下,然后确定适合的时长。
IP被Sci-Hub拒绝#
因为高频率重复相同的访问和下载请求,Sci-Hub有可能将我判定为机器人(事实上我也确实是一个机器人),并禁止我的IP访问其网站。想到了3个解决办法:
-
用sleep()
控制爬虫的暂停时间,适当延长两次访问之间的时间间隔。但是这样做的代价就是牺牲爬虫的工作速度,需要做好二者的权衡。我这里设置成了5秒。
-
切换访问域名。Sci-Hub本身拥有多个域名,目前有https://sci-hub.ru、https://sci-hub.st、https://sci-hub.se这三个。在每次访问的时候切换到不同的域名。
-
如果Ke Xue上网的话,频频被Sci-Hub禁的时候可以换一条线路。(只要我改IP,你就ban不到我)
这个问题也在提醒我们,频繁用爬虫确实会给网站服务器带来过大的负担。我们在用爬虫给自己省力气的同时也还要体谅体谅网站。更何况是Sci-Hub这种在学术出版商的围堵中坚持为爱发电的伟大工程(呜呜呜不说了,给他们捐钱去了)
重复运行提高覆盖率#
在我的实际操作中,网络条件良好的情况下,第一次全部爬取,得到了60%左右的文献;
第二次针对缺失的文献爬取,将文献覆盖率提高到了80%左右;
第三次爬取之后,覆盖率接近90%,可以确定剩下的文献都是Sci-Hub没有收录的,或者在检索的时DOI号就缺失的。对于这些文献,只能写一段代码输出其DOI号或者文章标题,手动补全。
整体代码#
A crawler to do mass downloading based on reference table (from Web of Science) and Sci-Hub database.
1
2
3
4
5
|
import pandas as pd
from selenium import webdriver
import time
import os
import random
|
1
2
3
4
5
6
7
8
9
10
11
12
13
|
table = pd.read_excel('Full record.xlsx',sheet_name='savedrecs')
dois = table['DOI'].copy(deep=True)
for i in dois.index:
dois[i] = str(dois[i])
finished = [0 for i in range(len(dois))]
auths = table['Authors'].copy(deep=True)
for i in auths.index:
name = auths[i].split(';')[0]
auths[i] = name.split(',')[0]
years = table['Publication Year']
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
|
def scihub_get(dois, i):
chromeOptions = webdriver.ChromeOptions()
prefs = {"download.default_directory" : "D:/Chrome Downloads/Articles"}
chromeOptions.add_experimental_option("prefs",prefs)
chrome_driver_path = "D:/Chrome Downloads/chromedriver.exe"
wd = webdriver.Chrome(executable_path=chrome_driver_path) #, options=chromeOptions)
scihub = ['https://sci-hub.ru/', 'https://sci-hub.st/', 'https://sci-hub.se/']
root = scihub[random.randint(0,2)]
# search by doi
doi = dois[i]
wd.get(root+doi)
time.sleep(1)
try:
b = wd.find_element_by_xpath('//*[@id="buttons"]/button')
b.click()
flag = True
time.sleep(20)
except:
print('access failed. index = '+str(i)+' doi = '+doi)
flag = False
time.sleep(5)
wd.quit()
return flag
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
|
def rename_file(dois, finished, auths, years, i):
time.sleep(1)
path = 'C:/Users/86158/Downloads/'
dir_list = os.listdir(path)
if len(dir_list)>0:
found = 0
for file in dir_list:
if file[0:3]!='No_':
found = 1
break
if found==0: #when didn't get the "save" button
print('download failed. index = '+str(i)+' doi = '+dois[i])
else: #when we get a new file
l = file.split('.')
if l[len(l)-1] != "pdf": #when the file was half downloaded
print('download incomplete. index = '+str(i)+' doi = '+dois[i])
else:
old = path+ file
auth = str(auths[i])
try:
year = str(int(years[i]))
except:
year = str(years[i])
index = [str(i//100), str((i%100)//10), str(i%10)]
new = path+ 'No_' +index[0]+index[1]+index[2]+'_' +auth+'_'+year+'.pdf'
os.rename(old, new)
finished[i] = 1
|
1
2
3
4
5
6
7
8
|
def article_get(dois, finished, auths, years, i):
# visit scihub
if dois[i]=='nan':
print('doi missing. index = '+str(i))
else:
if scihub_get(dois, i):
# rename
rename_file(dois, finished, auths, years, i)
|
1
2
|
for i in range(0,3): #len(dois)):
article_get(dois, finished, auths, years, i)
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
|
# try a new round for missing articles
path = 'C:/Users/86158/Downloads/'
dir_list = os.listdir(path)
for i in range(0,len(dois)):
auth = str(auths[i])
try:
year = str(int(years[i]))
except:
year = str(years[i])
index = [str(i//100), str((i%100)//10), str(i%10)]
filename = 'No_' +index[0]+index[1]+index[2]+'_' +auth+'_'+year+'.pdf'
if filename in dir_list:
continue
else:
article_get(dois, finished, auths, years, i)
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
|
# output information of all missing articles
path = 'C:/Users/86158/Downloads/'
dir_list = os.listdir(path)
count = 0
for i in range(0,len(dois)):
auth = str(auths[i])
try:
year = str(int(years[i]))
except:
year = str(years[i])
index = [str(i//100), str((i%100)//10), str(i%10)]
filename = 'No_' +index[0]+index[1]+index[2]+'_' +auth+'_'+year+'.pdf'
if filename in dir_list:
continue
else:
print(filename+" doi: "+dois[i])
if dois[i]=="nan":
print(" title: "+table["Article Title"][i])
count+=1
print(str(count)+" articles missing in total.")
|
60% downloaded after first round. Many failures were caused by webserver block. Our IP address may be identified as a robot or DDoS attacker due to regular, frequent visit to the website.
I used sleep command to lower the visiting frequency, and switched randomly between different sites, as methods to deal with webserver block.
80% downloaded after second round. Those that remained unfinished were mostly not in the database of sci-hub or lacked DOI numbers.
However, when an article is half downloaded, and the webdriver reached time limit and closed, it is possible that we get a broken file. This means that the file is correctly named, hiding in the group. Yet if we try to open the PDF file it won’t work. At present I don’t know how to do automatic check on these errors. The only solution is longer time period before the webdriver closes, but the whole process will be slower as a sacrifice.x