Using scrapy in raspberry pi with python

Problem:

You need to extract some information from a public website an process it to do "thinks"

Solution:

One solution is to use a web scrap tool, to let you know who it works, lets see a simple example to extract metada from  single url in a single program.

1.- Install scrappy and resolve any dependence issue, maybe you need to upload crypto library [sudo pip3 install cryptography==2.8]

sudo pip3 install scrapy

2.- Check scrappy release

scrapy version

3.- Use the next code in file "myscrapbot.py"

#get links from a single page without jump
import scrapy
class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'https://raspberrypihack.blogspot.com/',
    ]
    def parse(self, response):
            yield {
                'enlaces': response.xpath('//a/@href').extract(),
            }

4.- Run the program using "scrapy"

scrapy runspider myscrapbot.py -o export.json
4.- The data selected to be scraped will be saved in file export.json

5.- How to extract data from a data base table urls

First, you have to checl the disscussion in this post.

Then you have to addapt the code using this post recomendation from the first answers.

import scrapy

import mysql.connector

class ProductsSpider(scrapy.Spider):
    name = "Products"
    start_urls = []
    def parse(self, response):
 
      print(self.start_urls)
         yield {
              'enlaces': response.xpath('//a/@href').extract(),
        }
    def start_requests(self):
        conn = mysql.connector.connect(
                user='user',
                passwd='_a30e4qK',
                db='DDBB',
                host='localhost',
                charset="utf8",

                use_unicode=True
                )
        cursor = conn.cursor()
        cursor.execute('SELECT links FROM CUSTOMERS;')
        rows = cursor.fetchall()
        for row in rows:
           ProductsSpider.start_urls.append(row[0])
           yield self.make_requests_from_url(row[0])
        conn.close()




X.- How to extract href data?

response.xpath('//a/@href').extract()
response.xpath('//a/text()').extract()
more will continue soon...


Source: