Problem:
You need to extract some information from a public website an process it to do "thinks"
Solution:
One solution is to use a web scrap tool, to let you know who it works, lets see a simple example to extract metada from single url in a single program.
1.- Install scrappy and resolve any dependence issue, maybe you need to upload crypto library [sudo pip3 install cryptography==2.8]
sudo pip3 install scrapy
2.- Check scrappy release
scrapy version
3.- Use the next code in file "myscrapbot.py"
#get links from a single page without jumpimport scrapyclass QuotesSpider(scrapy.Spider):name = "quotes"start_urls = ['https://raspberrypihack.blogspot.com/',]def parse(self, response):yield {'enlaces': response.xpath('//a/@href').extract(),}
4.- Run the program using "scrapy"
scrapy runspider myscrapbot.py -o export.json4.- The data selected to be scraped will be saved in file export.json
5.- How to extract data from a data base table urls
First, you have to checl the disscussion in this post.
import scrapyimport mysql.connector
class ProductsSpider(scrapy.Spider):
name = "Products"
start_urls = []
def parse(self, response):
print(self.start_urls)
yield {
'enlaces': response.xpath('//a/@href').extract(),
}
def start_requests(self):
conn = mysql.connector.connect(
user='user',
passwd='_a30e4qK',
db='DDBB',
host='localhost',
charset="utf8",
use_unicode=True
)
cursor = conn.cursor()
cursor.execute('SELECT links FROM CUSTOMERS;')
rows = cursor.fetchall()
for row in rows:
ProductsSpider.start_urls.append(row[0])
yield self.make_requests_from_url(row[0])
conn.close()
X.- How to extract href data?
response.xpath('//a/@href').extract()
response.xpath('//a/text()').extract()
more will continue soon...
Source: