Skip to content

ViniciusReisch/Robin

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Description ⚙

Robin is a site that aims to help people how to choose the parts to assemble their computer Robin collects computer data sales sites, returning the most affordable values.

Used Tools 🛠

  • Selenium Selenium
  • MySQL MySQL
  • Python MySQL

Web-Scraping

The way we found to obtain all the data from the parts of the computer was through Web-Scraping, which is a form of mining that allows us to extract data from websites, converting them into structured information for later analysis, the framework used to obtain these data was Selenium in Python.

Each site has a specific structure for its data:

Pichau

Pichau was certainly the site that we had the most difficulty with, the structure of the site changes with each different computer that it is opened, so a way we found to get the data on different computers that we ran the code and it still worked, was using the Socket lib to save a specific structure for each IP.

import socket
    hostIP = socket.gethostname()           # IP Local
    IP = socket.gethostbyname(hostIP)       # Specif IP

Another issue we had was the small fade-in effect that is applied from the third product line on the site, so the item images only start to appear in the site's HTML source after you scroll down.

To solve this problem, we use a Selenium command to get the full height of the site and make it automatically scroll down according to the size of the site.

from selenium import webdriver
    height = driver.execute_script("return document.body.scrollHeight") 
    while scroll < height:
         driver.execute_script(f"window.scrollTo(0, {scroll});")
         scroll += 200

Problems solved, now it's time to get the specification of each part like price, name, etc...

With that in mind, we chose this list of specifications in Pichau:

Especificações Dados
Preço parcelado R$ 771,29
Preço R$678,74
Nome MEMORIA TEAM GROUP T-FORCE DELTA RGB, 8GB(1X8GB), DDR4, 3200MHZ, C16, BRANCO, TF4D48G3200HC16C01
Link https://www.pichau.com.br/memoria-team-group-t-force-delta-rgb-8gb-1x8gb-ddr4-3200mhz-c16-branco-tf4d48g3200hc16c01
Link da imagem https://media.pichau.com.br/media/catalog/product/cache/2f958555330323e505eba7ce930bdf27/t/f/tf4d48g3200hc16c011.jpg
Horário de Scraping 09/07/2022 23:22:34

How to perform the scraping on Pichau:

Bloco de informação Código de web-scraping Explicação
# Crawling Products == Name
product = driver.find_elements('tag name', 'h2')
for i in product:
    if i.text == "":
        continue
    namesProducts.append(i.text)

On Pichau's website, product titles are separated into h2 tags, so we have to pull all h2 tags from the website using find_elements('tag name', 'h2')

# Crawling Products == Image
while scroll < height:
    driver.execute_script(f"window.scrollTo(0, {scroll});")
    product = driver.find_elements('tag name', 'img')
    for e in product:
        if 'product' in e.get_attribute('src'):
            imgProducts.append(e.get_attribute('src'))
    scroll += 200
imgProducts = list(dict.fromkeys(imgProducts))

Product images are separated into img tags due to the fade-in issue explained above. We have to use a command driver.execute_script(f"window.scrollTo(0, {scroll});") inside a loop while so that the code is scraping and scrolling down the page, so we have to separate the images from the products using:

for e in product:
        if 'product' in e.get_attribute('src'):
            imgProducts.append(e.get_attribute('src'))

On Pichau's site, the product titles are separated into h2 tags, so we have to pull all the h2 tags from the site using find_elements('tag name', 'h2')

# Crawling Products == Image
while scroll < height:
    driver.execute_script(f"window.scrollTo(0, {scroll});")
    product = driver.find_elements('tag name', 'img')
    for e in product:
        if 'product' in e.get_attribute('src'):
            imgProducts.append(e.get_attribute('src'))
    scroll += 200
imgProducts = list(dict.fromkeys(imgProducts))

Product images are separated into img tags due to the fade-in issue explained above. We have to use a command driver.execute_script(f"window.scrollTo(0, {scroll});") inside a loopwhile so that the code is scraping and scrolling down the page, so we have to separate the images from the products using:

for e in product:
        if 'product' in e.get_attribute('src'):
            imgProducts.append(e.get_attribute('src'))

On Pichau's site, the product titles are separated into h2 tags, so we have to pull all the h2 tags from the site using find_elements('tag name', 'h2')

# Crawling Products == Image
while scroll < height:
    driver.execute_script(f"window.scrollTo(0, {scroll});")
    product = driver.find_elements('tag name', 'img')
    for e in product:
        if 'product' in e.get_attribute('src'):
            imgProducts.append(e.get_attribute('src'))
    scroll += 200
imgProducts = list(dict.fromkeys(imgProducts))

Product images are separated into img tags due to the fade-in issue explained above. We have to use a commanddriver.execute_script(f"window.scrollTo(0, {scroll});") inside a loop while so that the code is scraping and scrolling down the page, so we have to separate the images from the products using:

for e in product:
        if 'product' in e.get_attribute('src'):
            imgProducts.append(e.get_attribute('src'))

UNDER DEVELOPMENT


📜Project developed in the Entra21 Matutine Python Class

About

Project developed in the Entra21 Python Class

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published