How to crawl lyrics and music data from netease music with artist's name and song's name

参考的源码以及新版selenium的api修改

根据 https://blog.51cto.com/u_13403836/5674642 中的代码,其中有基于selenium得到歌曲id的代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from urllib.error import URLError
import time
import math
import random
'''
无界面模式代码
chrome_options=webdriver.ChromeOptions()
chrome_options.add_argument("--headless")
browser = webdriver.Chrome(chrome_options=chrome_options)
'''
def getMaxPage(SongComments):
max_page = SongComments.split('(')[1].split(')')[0] #从字符串中解析出评论数
offset = 20 # 每页显示20条最新评论
max_page = math.ceil(int(max_page) / offset) #求出总的评论页数
return max_page #返回评论页数


def go_nextpage(browser): #模拟人为操作,点击【下一页】
next_button = browser.find_elements_by_xpath("//div[@class='m-cmmt']/div[3]/div[1]/a")[-1] #获取到下一页按钮的位置
browser.execute_script("window.scrollTo(0, document.body.scrollHeight)"); #拉动滚动条到浏览器底部
if next_button.text == '下一页':
next_button.click() #点击下一页


def get_comments(is_first, browser): #获取评论数据
items = browser.find_elements_by_xpath("//div[@class='cmmts j-flag']/div[@class='itm']")
# 首页的数据中包含15条精彩评论,20条最新评论,只保留最新评论
if is_first: #如果是第一页
items = items[15: len(items)]
data_list = []
data = {}
for each in items:
# 用户id
userId = each.find_elements_by_xpath("./div[@class='head']/a")[0]
userId = userId.get_attribute('href').split('=')[1]
# 用户昵称
nickname = each.find_elements_by_xpath("./div[@class='cntwrap']/div[1]/div[1]/a")[0]
nickname = nickname.text
# 评论内容
content = each.find_elements_by_xpath("./div[@class='cntwrap']/div[1]/div[1]")[0]
content = content.text.split(':')[1] # 中文冒号
# 点赞数
like = each.find_elements_by_xpath("./div[@class='cntwrap']/div[@class='rp']/a[1]")[0]
like = like.text
if like:
like = like.strip().split('(')[1].split(')')[0]
else:
like = '0'
# 头像地址
avatar = each.find_elements_by_xpath("./div[@class='head']/a/img")[0]
avatar = avatar.get_attribute('src')

data['userId'] = userId
data['nickname'] = nickname
data['content'] = content
data['like'] = like
data['avatar'] = avatar
print(data)
data_list.append(data)
data = {}
return data_list


def gotoSong(SongName):
try:
browser=webdriver.Chrome()
browser.get("https://music.163.com/")
time.sleep(2)
input=browser.find_element_by_id("srch") #找到搜索栏
input.send_keys(SongName) #输入歌曲名
input.send_keys(Keys.ENTER) #按下ENTER键
time.sleep(1)
oldiframe = browser.find_element_by_class_name('g-iframe')
browser.switch_to.frame(oldiframe)
SongLink=browser.find_element_by_class_name("s-fc7") #找到歌曲的超链接位置
SongLink.click() #点击歌曲超链接
url=str(browser.current_url)
print(url) #打印当前的url
SongID = url.split("=")[1] # 得到此歌曲的ID
print(SongID) # 打印此歌曲的ID
time.sleep(2)
SongComments = browser.find_elements_by_xpath("//h3[@class='u-hd4']")[1] #求出最新评论数
MaxPage = getMaxPage(SongComments.text) # 获取到评论总页数
current = 1
is_first = True
if MaxPage>10: #最多只爬取十页评论
MaxPage=10
while current <= MaxPage:
print('正在爬取第', current, '页的数据')
if current == 1:
is_first = True
else:
is_first = False
data_list = get_comments(is_first, browser)
time.sleep(1)
go_nextpage(browser)
time.sleep(random.randint(8, 12))
current += 1
except URLError as e:
print("错误原因:"+e.reason)
finally:
browser.close()


if __name__ == "__main__":
SongName = input("请输入歌曲名:")
gotoSong(SongName)

但是该代码有些久远,其中find_element_by_x("something") 已经被弃用,现在变为了find_element("locator", "something")

具体而言,参见 https://stackoverflow.com/questions/72773206/selenium-python-attributeerror-webdriver-object-has-no-attribute-find-el 的第二个高赞答案,通过引入from selenium.webdriver.common.by import By来确定locator具体是什么:

Old API New API
find_element_by_id(‘id’) find_element(By.ID, ‘id’)
find_element_by_name(‘name’) find_element(By.NAME, ‘name’)
find_element_by_xpath(‘xpath’) find_element(By.XPATH, ‘xpath’)
find_element_by_class(“class”) find_element(By.CLASS_NAME, ‘class’)

将源代码中的find_element_by_x根据上表进行替换,代码可以成功运行了。但是由于没有本地用户的数据(比如cookie),因此wyyyy中无法登录账号,接下来我们将本地用户的数据直接引入到selenium中。

引入本地用户的数据跳过账号登录

首先,我们使用webdriver.ChromeOptions() 创建一个选项,该选项用于之后的webdriver加载本地数据和一些自定义的参数:

1
2
3
4
profile_dir=r"C:\Users\Username\AppData\Local\Google\Chrome\User Data"    # 对应你的chrome的用户数据存放路径  
chrome_options=webdriver.ChromeOptions()
chrome_options.add_argument("--user-data-dir=" + os.path.abspath(profile_dir))
chrome_options.add_argument("--disable-extensions") # no extension loaded

之后再将chrome_options导入到webdriver中:

1
browser=webdriver.Chrome(options=chrome_options)

至此,代码已经可以正常工作并爬取数据。(但是脚本只能在chrome关闭的情况下使用,如果在打开脚本时,chrome已经有页面在运行了,脚本会直接崩溃)

解决崩溃问题

可以通过如下两点(主要是第二点)解决:

  • 加入headless等参数以及将 user data 复制一份到其他地方(主要是用户与selenium或者多个selenium不能共享user data文件夹,需要独占):

    1
    2
    3
    4
    chrome_options.add_argument("--headless")  # Run Chrome in headless mode
    chrome_options.add_argument("--no-sandbox") # Bypass OS security model
    chrome_options.add_argument("--disable-dev-shm-usage") # Overcome limited resource problems
    chrome_options.add_argument('--disable-gpu') # Disable GPU for headless mode
  • 并将profile_dir=r"C:\Users\Username\AppData\Local\Google\Chrome\User Data" # 对应你的chrome的用户数据存放路径 切换至对应地方解决。

添加模糊搜索(解决代码只能使用单独的歌名进行搜索的问题)

在使用该脚本的时候,发现该脚本只能使用歌曲名进行精确匹配,如果使用歌手名+歌曲名这种形式的模糊搜索,该脚本在搜索结果的页面会失效,对两个搜索结果页面进行对比,发现如下区别:

精确搜索的歌曲名页面的html结构:

image-20231117061501960

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
<div class="srchsongst">
<div class="item f-cb h-flag ">
<div class="td"> ... </div>
<div class="td w0" >
<div class="sn">
<div class="text" >
<a href=" /song?id=22282046">
<b title="Como La Flor">
<span class="s-fc7">Como La Flor</span>
</b>
</a>
</div>
</div>
</div>
<div class="td"> ... </div>
<div class="td wl"> </div>
<div class="td w2"> ... </div>
<div class="td">03:04</div>
::after
</div>
<div class="item f-cb h-flag even "> ... </div>

模糊搜索的歌曲名页面的html结构:

image-20231117062247088

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
<div class="srchsongst">
<div class="item f-cb h-flag ">
<div class="td"> ... </div>
<div class="td w0" >
<div class="sn">
<div class="text" >
<a href=" /song?id=22282046">
<b title="Como La Flor">Como La Flor</b>
</a>
</div>
</div>
</div>
<div class="td"> ... </div>
<div class="td wl"> </div>
<div class="td w2"> ... </div>
<div class="td">03:04</div>
::after
</div>
<div class="item f-cb h-flag even "> ... </div>

对比二者可以发现,唯一的区别是精确搜索的<b title="Como La Flor">Como La Flor</b>中多包含了一项<span class="s-fc7"></span>。所以我此处的做法是在srchsongst中选取第一个item f-cb h-flag之后找到其sn选中其中的a即可得到歌曲的id信息

最后修改过的代码为:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
def gotoSearch(SongName):
browser=webdriver.Chrome(options=chrome_options)
try:
browser.get("https://music.163.com/")
time.sleep(2)
input=browser.find_element(By.ID, "srch") #找到搜索栏
input.send_keys(SongName) #输入歌曲名
input.send_keys(Keys.ENTER) #按下ENTER键
time.sleep(1)
oldiframe = browser.find_element(By.CLASS_NAME, 'g-iframe') # iFrame or Shadow DOM: If the content is inside an iFrame or a Shadow DOM, you need to switch to that context before accessing the elements.
browser.switch_to.frame(oldiframe)
wait = WebDriverWait(browser, 1)
srchsongst_div = wait.until(EC.presence_of_element_located((By.CLASS_NAME, 'srchsongst')))
srchsongst_div=browser.find_element(By.CLASS_NAME,"srchsongst") #找到歌曲的超链接位置
first_child_div = srchsongst_div.find_element(By.CSS_SELECTOR, 'div.item.f-cb.h-flag')
details = first_child_div.find_element(By.CLASS_NAME, "sn")
url_element = details.find_element(By.TAG_NAME, 'a')
url = url_element.get_attribute('href')
url_element.click() # its different from browser.get(url) but I donnot know why
url=str(browser.current_url)
print(url) #打印当前的url
SongID = url.split("=")[1] # 得到此歌曲的ID
print("SongID", SongID) # 打印此歌曲的ID
time.sleep(2)
SongComments = browser.find_elements(By.XPATH, "//h3[@class='u-hd4']")[1] #求出最新评论数
print("SongComments", SongComments)
MaxPage = getMaxPage(SongComments.text) # 获取到评论总页数
current = 1
is_first = True
if MaxPage>10: #最多只爬取十页评论
MaxPage=10
while current <= MaxPage:
print('正在爬取第', current, '页的数据')
if current == 1:
is_first = True
else:
is_first = False
data_list = get_comments(is_first, browser)
time.sleep(1)
go_nextpage(browser)
time.sleep(random.randint(8, 12))
current += 1
except URLError as e:
print("错误原因:"+e.reason)
finally:
browser.close()

ps: 我一开始将url_element.click()改为了browser.get(url),这样做虽然页面也会跳转到歌曲页面,但是在执行SongComments = browser.find_elements(By.XPATH, "//h3[@class='u-hd4']")[1] #求出最新评论数会报错,暂时不清楚原因(难道是browser并没有刷新并加载新的页面,导致无法获取这个信息?)

根据歌曲id获取歌词

根据这篇博客 https://blog.csdn.net/weixin_45576923/article/details/113815385 介绍,无需账户登录,通过url = f"http://music.163.com/api/song/lyric?id={song_id}+&lv=1&tv=-1"可以直接获取歌词信息。

具体而言表现如下:

比如给定网址: http://music.163.com/api/song/lyric?id=22282046+&lv=1&tv=-1

会返回如下json格式的字符串:

1
{"sgc":false,"sfy":false,"qfy":false,"transUser":{"id":12477352,"status":99,"demand":1,"userid":73446966,"nickname":"一片丹心向日葵","uptime":1604988576230},"lyricUser":{"id":12477349,"status":99,"demand":0,"userid":73446966,"nickname":"一片丹心向日葵","uptime":1604988562848},"lrc":{"version":9,"lyric":"[00:00.00] 作词 : A.B. Quintanilla III/Pete Astudillo\n[00:01.00] 作曲 : A.B. Quintanilla III/Pete Astudillo\n[00:17.23]Yo sé que tienes un nuevo amor\n[00:22.35]Sin embargo, te deseo lo mejor\n[00:27.55]Si en mi no encontraste felicidad\n[00:32.88]Tal vez alguien más te la dará\n[00:37.90]Como la flor (Como la flor)\n[00:40.35]Con tanto amor (Con tanto amor)\n[00:43.12]Me diste tú, se marchito\n[00:48.24]Me marcho hoy, yo sé perder\n[00:53.85]Pero, a-a-ay\n[00:56.30]Cómo me duele\n[00:59.35]A-a-ay\n[01:01.57]Cómo me duele\n[01:20.33]Si vieras como duele perder tu amor\n[01:25.60]Con tu adiós te llevas mi corazón\n[01:31.01]No sé si pueda volver a amar\n[01:36.13]Porque te di todo el amor que pude dar\n[01:41.03]Como la flor (Como la flor)\n[01:43.69]Con tanto amor (Con tanto amor)\n[01:46.32]Me diste tú, se marchito\n[01:51.44]Me marcho hoy, yo sé perder\n[01:56.83]Pero, a-a-ay\n[01:59.50]Cómo me duele\n[02:02.77]A-a-ay\n[02:04.77]Cómo me duele\n[02:15.14]Como la flor (Como la flor)\n[02:17.97]Con tanto amor (Con tanto amor)\n[02:20.44]Me diste tú, se marchito\n[02:25.66]Me marcho hoy, yo sé perder\n[02:31.33]Pero, a-a-ay\n[02:33.70]Cómo me duele\n[02:36.93]A-a-ay\n[02:38.89]Cómo me duele\n[02:42.30]A-a-ay\n[02:44.29]Cómo me duele\n"},"tlyric":{"version":10,"lyric":"[by:一片丹心向日葵]\n[00:17.23]我知道你已另寻新欢\n[00:22.35]尽管如此 我仍愿你一切安好\n[00:27.55]如果在我身上你找不到幸福\n[00:32.88]或许别人能够给予你\n[00:37.90]如一朵花(如一朵花)\n[00:40.35]载着万般爱意(载着万般爱意)\n[00:43.12]你赠予给我的花 也已枯萎\n[00:48.24]我必须离开 我知道我输了\n[00:53.85]但是啊\n[00:56.30]我真的很心痛\n[00:59.35]啊\n[01:01.57]我真的很痛\n[01:20.33]你无法想象失去你的爱我有多伤痛\n[01:25.60]随着你的道别 也带走了我的心\n[01:31.01]我不知道未来是否会再爱一次\n[01:36.13]因为我已把能给的爱都奉献给你\n[01:41.03]如一朵花(如一朵花)\n[01:43.69]载着万般爱意(载着万般爱意)\n[01:46.32]你赠予给我的花 也已枯萎\n[01:51.44]我必须离开 我知道我输了\n[01:56.83]但是啊\n[01:59.50]我真的很心痛\n[02:02.77]啊\n[02:04.77]我真的很痛\n[02:15.14]如一朵花(如一朵花)\n[02:17.97]载着万般爱意(载着万般爱意)\n[02:20.44]你赠予给我的花 也已枯萎\n[02:25.66]我必须离开 我知道我输了\n[02:31.33]但是啊\n[02:33.70]我真的很心痛\n[02:36.93]啊\n[02:38.89]我真的很痛\n[02:42.30]啊\n[02:44.29]我真的很痛"},"code":200}

根据博主的代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
def get_lyric(song_id):
headers = {
"user-agent" : "Mozilla/5.0",
"Referer" : "http://music.163.com",
"Host" : "music.163.com"
}
if not isinstance(song_id, str):
song_id = str(song_id)
url = f"http://music.163.com/api/song/lyric?id={song_id}+&lv=1&tv=-1"
r = requests.get(url, headers=headers)
r.raise_for_status()
r.encoding = r.apparent_encoding
json_obj = json.loads(r.text)
return json_obj["lrc"]["lyric"]

自此,我们完成了通过歌曲名+歌手名模糊搜索得到歌曲id进而得到歌词的过程,接下来我们通过整理之前收集到的matched_lyrics_with_key.json批量获得带有时间轴的歌词。

批量爬取歌词

该部分由两部分组成

批量获取歌曲id

首先使用已经统计好的json文件,其文件格式如下:

1
2
3
4
5
6
7
8
9
{
"md5":
{"artist" : "song"}
,
"md5":
{"artist" : "song"}
,
....
}

将爬取后的json文件格式设计为,用_将id加入到md5的后面:

1
2
3
4
5
6
7
8
9
{
"md5_id":
{"artist" : "song"}
,
"md5_id":
{"artist" : "song"}
,
....
}

实现代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
def split_dict_keys(original_dict):
new_dict = {}
filted_new_dict = {}
for key, value in original_dict.items():
# Splitting the key on the underscore
front_part = key.split('_')[0]
# Using the front part as the new key
if key.split('_')[1] != "-1": # not found key may caused by network problem rather than not exist
new_dict[front_part] = value
filted_new_dict[key] = value
return new_dict, filted_new_dict

def findIDs(input_path = r'matched_lyrics_with_key.json'
, output_path = r'matched_lyrics_with_id.json'):
with open(input_path, 'r', encoding='utf-8') as file0:
json_data = json.load(file0)
file0.close()


updated_data = {}
exist_keys = {}
if os.path.exists(output_path):
with open(output_path, 'r', encoding='utf-8') as file2:
json_data_exist = json.load(file2)
exist_keys, updated_data = split_dict_keys(json_data_exist) # load data crawled before
file2.close()
# updated_data = json_data_exist # load data crawled before into current dict

temp_num = 0
for key, value in tqdm(json_data.items()):
temp_num += 1
if key in exist_keys: # check if the key already in crawled data
continue
else:
artist, song = next(iter(value.items()))
search_str = f"{artist}: {song}"
song_id = None

try:
song_id = gotoSearchID(search_str)
except Exception as e:
song_id = -1 # to skip the data not found
print("something went wrong:", e, "\n")
if "I/O operation on closed file" in str(e):
break

if song_id is not None:
new_key = f"{key}_{song_id}"
updated_data[new_key] = {artist : song}
if temp_num % 10 == 0:
with open(output_path, "w", encoding='utf-8') as file1:
json.dump(updated_data, file1, indent=4, ensure_ascii=False)
file1.close()
time.sleep(1)

with open(output_path, "w", encoding='utf-8') as file1:
json.dump(updated_data, file1, indent=4, ensure_ascii=False)
file1.close()
time.sleep(1)

if __name__ == "__main__":
# SearchName = "Selena: Como La Flor"
# SongName = "Como La Flor"
# # gotoSong(SongName)
# print(gotoSearchLyrics(SearchName))
findIDs()

根据id批量获取歌词

还记得我们之前提到的 http://music.163.com/api/song/lyric?id={song_id}+&lv=1&tv=-1 这个开放api吗,只需要将json文件中的id逐一添加入该url并dump出其中的lyrics信息即可。

代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
import os
import json
import requests
from tqdm import tqdm

def get_filenames_dict(folder_path):
# Dictionary to store filenames without extension
filenames_dict = {}

# Iterate over all files in the folder
for filename in os.listdir(folder_path):
# Check if it's a file and not a directory
if os.path.isfile(os.path.join(folder_path, filename)):
# Split the filename from its extension and add to the dictionary
name, _ = os.path.splitext(filename)
filenames_dict[name] = None # Or any other default value

return filenames_dict


def get_key_and_id(ori_key):
key = ori_key.split('_')[0]
id = ori_key.split('_')[1]
return key, id

def get_lyric(song_id):
headers = {
"user-agent" : "Mozilla/5.0",
"Referer" : "http://music.163.com",
"Host" : "music.163.com"
}
if not isinstance(song_id, str):
song_id = str(song_id)
url = f"http://music.163.com/api/song/lyric?id={song_id}+&lv=1&tv=-1"
r = requests.get(url, headers=headers)
r.raise_for_status()
r.encoding = r.apparent_encoding
json_obj = json.loads(r.text)
return json_obj["lrc"]["lyric"]

input_path = r'C:\Users\iMusic\Desktop\netease_lyrics_crawler\matched_lyrics_with_id - Copy.json'
if os.path.exists(input_path):
with open(input_path, 'r', encoding='utf-8') as file2:
json_data_exist = json.load(file2)

output_path = r'\lyrics'
if not os.path.exists(r'\lyrics'):
os.makedirs(r'\lyrics')

exist_filenames_dict = get_filenames_dict(output_path)


for key, value in tqdm(json_data_exist.items()):
key, id = get_key_and_id(key)
if key not in exist_filenames_dict:
lyric = None
try:
lyric = get_lyric(id)
except:
pass
if lyric:
with open(output_path + '\\' + key + '.json', 'w', encoding='utf-8') as file:
json.dump(lyric, file, indent=4, ensure_ascii=False)


至此,整个代码完整搭建。接下来是使用该脚本时出现的一些问题。

后日谈

I/O operation on closed file

这个问题会在程序执行一段时间后间歇性的出现,最后就会导致这一轮程序运行在报错之后遍历之后所有的条目且返回id = -1。我尝试了以下几种方式进行解决,但是最后发现都只有一定程度的缓解

加入循环尝试多轮运行程序

基本思路就是一轮不行多来几轮。

第一个思路是直接在python程序中加入死循环:(完全无效)

1
2
3
4
5
while(1):
try:
findID()
except:
continue

但是似乎这个报错是与整个python脚本对应的进程捆绑在一起的,这一轮函数执行完成跳出后并不能解决这个错误。

第二个思路是使用一个bat脚本嵌套python脚本,即尝试在一次执行完python脚本后再次启动执行,bat脚本如下,尝试上该脚本执行100次(有效):

1
2
3
4
@echo off
for /L %%i in (1,1,100) do (
python lyrics_crawler.py
)

减少json的读写次数

将每次获取id就完整写一次json文件改为每十次获取id再完整写一次json文件

1
2
3
4
5
6
7
8
9
10
11
temp = 0

......

if temp_num % 10 == 0:
with open(output_path, "w", encoding='utf-8') as file1:
json.dump(updated_data, file1, indent=4, ensure_ascii=False)
time.sleep(1)
......

temp += 1

早停

实际上报错不止有I/O的error,比如还有以下几类不同的error:

1:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
something went wrong: Message:
Stacktrace:
GetHandleVerifier [0x00007FF6EE2D82B2+55298]
(No symbol) [0x00007FF6EE245E02]
(No symbol) [0x00007FF6EE1005AB]
(No symbol) [0x00007FF6EE14175C]
(No symbol) [0x00007FF6EE1418DC]
(No symbol) [0x00007FF6EE17CBC7]
(No symbol) [0x00007FF6EE1620EF]
(No symbol) [0x00007FF6EE17AAA4]
(No symbol) [0x00007FF6EE161E83]
(No symbol) [0x00007FF6EE13670A]
(No symbol) [0x00007FF6EE137964]
GetHandleVerifier [0x00007FF6EE650AAB+3694587]
GetHandleVerifier [0x00007FF6EE6A728E+4048862]
GetHandleVerifier [0x00007FF6EE69F173+4015811]
GetHandleVerifier [0x00007FF6EE3747D6+695590]
(No symbol) [0x00007FF6EE250CE8]
(No symbol) [0x00007FF6EE24CF34]
(No symbol) [0x00007FF6EE24D062]
(No symbol) [0x00007FF6EE23D3A3]
BaseThreadInitThunk [0x00007FFCA6CF7344+20]
RtlUserThreadStart [0x00007FFCA73626B1+33]

2:

1
something went wrong: can only concatenate str (not "TimeoutException") to str

3:

以及一个与timelatency相关的报错,找不到报错原信息了

1
timelatency

4:

远程服务器拒绝访问

1
remote host reject(具体忘记了,没有截取下来)

这几类错误一般都是爬取过程中没有找到对应数据的报错,对程序运行以及爬取结果没有正确性或者效率上的影响。因此可以不用在意,着重处理I/O error,毕竟他会拖慢我们的效率。

处理代码如下:

1
2
3
4
5
6
7
try:
song_id = gotoSearchID(search_str)
except Exception as e:
song_id = -1 # to skip the data not found
print("something went wrong:", e, "\n")
if "I/O operation on closed file" in str(e):
break

一旦报错,必须打印错误消息,之前由于没有打印导致产生了一堆-1的id而不知道其原因,一开始还以为cookie被限制了,后面才发现是程序本身的I/O error的问题,这里由于没有打印错误信息,至少浪费了两个小时。

最后是对报错信息进行判断,我也不清楚报错返回的具体是什么类型,但是可以确定的是,包括"I/O operation on closed file"的报错一定就是我们需要让程序停下来重新启动的报错。因此直接将报错转换为字符串进行判断是一个比较方便且有效的做法。

将selenium中的端口号固定

image-20231118163019325

在每次webdriver启动的时候,都会在本地创建一个进程监听对应的端口号,在默认情况下,这个端口在每次webdriver启动的时候是随机分配的,基本上每次启动的时候都不一样,由于我使用了多个selenium同时运行,考虑到随机分配端口号可能会出现端口冲突的问题,我使用如下的python代码强制让webdriver在每次启动的时候固定监听一个端口号:

1
chrome_options.add_argument("--remote-debugging-port=12964")

尽可能少的让selenium在命令行中打印log info

正常情况下,每一次启动webdriver,selenium都会打印很多的log信息(懒得截图了,已经改成静默模式了),这时候可以考虑在代码中加入如下语句让其不再打印log信息:

1
2
os.environ["webdriver.chrome.silentOutput"] = "true"  # I donnot this line make sense or not
chrome_options.add_argument("--log-level=3") # make it silent

尽可能的精简selenium启动时需要的部件

仍然是通过初始化webdriver时修改传递进去的参数做到的,参数如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
profile_dir=r"User Data"    # 对应你的chrome的用户数据存放路径  
chrome_options=webdriver.ChromeOptions()
chrome_options.add_argument("--user-data-dir=" + os.path.abspath(profile_dir))
chrome_options.add_argument("--disable-extensions")
chrome_options.add_argument("--remote-debugging-port=12964")

chrome_options.add_argument("--headless") # Run Chrome in headless mode
chrome_options.add_argument("--no-sandbox") # Bypass OS security model
chrome_options.add_argument("--disable-dev-shm-usage") # Overcome limited resource problems
chrome_options.add_argument('--disable-gpu') # Disable GPU for headless mode


os.environ["webdriver.chrome.silentOutput"] = "true"
chrome_options.add_argument("--log-level=3") # make it silent
chrome_service = ChromeService(executable_path=r"chromedriver.exe", log_path=os.devnull) # make it totally silent but not work

网络波动

脚本运行一段时间后,不清楚是什么原因网络变得很卡,也可能二者之间并没有直接联系,但是可以确定网络波动确实会使得脚本挂掉,因此我重启了软路由以及重新注册了网络让网络变得更加顺畅。

将CPU调回默认的设置

image-20231118162848696

通过上图可以观察到CPU在运行脚本的时候因为需要反复的发送网络请求以及频繁的进行IO操作,会导致CPU运行过程中出现很多负载上的尖峰,我之前在BIOS中设置CPU的时候,选择了降压超频,将CPU的电压锁在1.26v,这样也许会使得程序不那么稳定。因此我将其电压设置为了auto,不再对其电压上限做出人为的限制。


How to crawl lyrics and music data from netease music with artist's name and song's name
http://example.com/2023/11/16/How-to-crawl-lyrics-and-music-data-from-netease-music-with-artist-s-name-and-song-s-name/
Author
iMusic
Posted on
November 16, 2023
Licensed under