I am trying to scraping question from chegg.com site and save it
as html file
the web site when contains images .
The images link is either internal as https://media.cheggcdn.com/media/eb7...0307/phpDbKTCI look at the question link https://www.chegg.com/homework-help/...t2-u-q69085812
or external as //d2vlcm61l7u1fs.cloudfront.net/media%2Fb2b%2Fb2b8dcb5-ae0d-4ad1-9156-eda0dd651978%2FphpX4CpFQ.png look at the question link https://www.chegg.com/homework-help/...s-ch-q10531553 ,
so when it is external, the images do not appear in the scraping process
errors console
GET file://d2vlcm61l7u1fs.cloudfront.net/media%2F078%2F078e768f-d236-48fa-aff9-3365467e00d3%2FphpjRcT9F.png net::ERR_INVALID_URL
....
my code
Expand|Select|Wrap|Line Numbers
- url=''
- headers = {
- 'authority': 'www.chegg.com',
- ....
- ...
- }
- a = scraper.get(url, headers=headers)
- b =r.content
- soup = BeautifulSoup(b, "html.parser")
- c= soup.find("div", {"class": "rKMzl"})
- with open("d.html", "w", encoding = 'utf-8') as file:
- file.write(str(c))