尝试用playwright写爬虫

#software

playwright是微软开源的一个自动化测试Chromium、Firefox和WebKit的python工具,很明显,这种工具往往都会被用来做爬虫。

首先需要安装playwright

pip install --upgrade pip
pip install playwright
playwright install

以上步骤会按照python的playwright库,然后按照Chromium、Firefox和WebKit等的测试工具。

官方文档中介绍了很多使用方式,这里还有一篇介绍playwright、selenium等工具差异的文章

以爬取DECIPHER gene这个页面为例,使用下面命令打开新窗口,默认是使用chromium

playwright codegen -o test.py

在浏览器中输入网址:https://www.deciphergenomics.org/genes,等待加载完成后,在页面数下拉框中选择100,可以看到此时共有页数56页。点击第2页。这时结束,查看自动化生成的代码。

from playwright.sync_api import Playwright, sync_playwright


def run(playwright: Playwright) -> None:
    browser = playwright.chromium.launch(headless=False)
    context = browser.new_context()

    # Open new page
    page = context.new_page()

    # Go to https://www.deciphergenomics.org/genes
    page.goto("https://www.deciphergenomics.org/genes")

    # Select 100
    page.select_option("[aria-label=\"Number of rows per page\"]", "100")

    # Click [aria-label="Page 2 of 56"]
    page.click("[aria-label=\"Page 2 of 56\"]")

    # Close page
    page.close()

    # ---------------------
    context.close()
    browser.close()


with sync_playwright() as playwright:
    run(playwright)

以上代码由playwright自动录制生成,为了爬取全部56页,需要进行一些修改,见下面代码的中文注释

from playwright.sync_api import Playwright, sync_playwright
import time

def run(playwright: Playwright) -> None:
    browser = playwright.chromium.launch(headless=False)
    context = browser.new_context()

    # Open new page
    page = context.new_page()

    # Go to https://www.deciphergenomics.org/genes
    page.goto("https://www.deciphergenomics.org/genes")
    # 首次打开网页等待加载
    time.sleep(40)

    # Select 100
    page.select_option("[aria-label=\"Number of rows per page\"]", "100")
    # 获得网页并保存
    html = page.content()
    with open("html/page1.html", "w", encoding="utf-8") as f:
        f.write(html)
    
    # Click [aria-label="Page 2 of 56"]
    # 调整为56页的循环
    n = 2
    while n <= 56:
        page.click("[aria-label=\"Page {} of 56\"]".format(str(n)))
        time.sleep(10)
        with open("html/page{}.html".format(str(n)), "w", encoding="utf-8") as f:
            f.write(html)
        n += 1

    # Close page
    page.close()

    # ---------------------
    context.close()
    browser.close()


with sync_playwright() as playwright:
    run(playwright)

使用录制功能就能非常快的编写好爬取代码。如果还需要指定UA等信息,可以参考playwright的文档,也能简单查看

playwright codegen --help