Skip to content
StefKot edited this page Jun 6, 2024 · 8 revisions

ADAC_Parser

ADAC (Allgemeiner Deutscher Automobil-Club) is a German automobile club, which, among other things, is known for its independent crash test program for children's car seats.

Без имени-1

Here are the key points about the ADAC program:

  • Strict Criteria: ADAC uses the world's strictest standards for testing child car seats. They simulate various accident scenarios, including frontal and side collisions, to assess the strength and protection of the seat.
  • Comprehensive approach: ADAC is not limited to crash tests only. They also evaluate usability, ergonomics, ease of installation and other important factors.
  • Publicly available results: All test results are published on the ADAC website, and include detailed reports on each tested chair, describing its strengths and weaknesses.
  • Influential opinion: Due to its reputation, ADAC has a significant impact on the child car seat market, encouraging manufacturers to improve the safety of their products. The ADAC program is an important resource for parents to help them make the right choice and ensure maximum safety for their children on the road.

Additionally:

  • Not only crash tests: ADAC also provides information about traffic rules, insurance, roadside assistance and more.
  • International recognition: ADAC test results are used all over the world.

Installing the required library:

Windows:
    pip install scrapy
Linux:
    sudo apt install scrapy

Usage:

scrapy crawl adac

Test

scrapy shell
fetch('link')
response.css('main > div > h3 > div > div::text').get()
response.xpath("//title/text()").get()

Available Shortcuts

  • shelp() - print a help with the list of available objects and shortcuts

  • fetch(url[, redirect=True]) - fetch a new response from the given URL and update all related objects accordingly. You can optionally ask for HTTP 3xx redirections to not be followed by passing redirect=False

  • fetch(request) - fetch a new response from the given request and update all related objects accordingly.

  • view(response) - open the given response in your local web browser, for inspection. This will add a tag to the response body in order for external links (such as images and style sheets) to display properly. Note, however, that this will create a temporary file in your computer, which won’t be removed automatically.

Settings

The LOG_LEVEL variable controls the amount of information that is displayed during the execution of a program.
There are several logging options:

logging.CRITICAL - for critical errors (highest severity)
logging.ERROR - for regular errors
logging.WARNING - for warning messages
logging.INFO - for informational messages
logging.DEBUG - for debugging messages (lowest severity)

Crawl responsibly by identifying yourself (and your website) on the user-agent:

USER_AGENT = "adac (+http://www.yourdomain.com)"

Configure maximum concurrent requests performed by Scrapy (default: 16):

CONCURRENT_REQUESTS = 16

Enable and configure HTTP caching (disabled by default)
More information

HTTPCACHE_ENABLED = True
REDIRECT_ENABLED = False
HTTPCACHE_EXPIRATION_SECS = 0
HTTPCACHE_DIR = "httpcache"
HTTPCACHE_IGNORE_HTTP_CODES = [301, 302]

Scrapy provides generating an “export file” with the scraped data to be consumed by other systems out of the box which allows you to generate feeds with the scraped items, using multiple serialization formats and storage backends.
When using the feed exports you define where to store the feed using one or multiple URIs (through the FEEDS setting).

FEED_FORMAT="json"
FEED_URI="adac.json"

Description

Used to parse the characteristics of child seats and collect this information into a json file:

  • Name
  • ID
  • ADAC Rating
  • ADAC security
  • Reliability
  • Service
  • Convenience
  • Environmental friendliness
  • Permissible weight of the child
  • Permissible height of the child
  • ADAC Age Group

Code

config.py

Создания полного пути к файлу adac.json.

  • path = "/upload/": This line represents the path to the directory (folder) where the file will be located.
  • os.path.join(), is designed to concatenate (combine) parts of the path in accordance with the rules of the operating system.
    • path - this is the first argument of the function, it points to the initial folder.
    • "adac.json" - this is the second argument of the function, it points to the file name.
path = "/upload/"
full_path = os.path.join(path, "adac.json")

parser.py

number_to_word()

This function converts a numerical score from the ADAC website into a word. As an argument, a number is passed to the number_to_word function, which translates it into a word.

def number_to_word(number):
    if 0.6 <= number <= 1.5:
        return "Отлично"
    if 1.6 <= number <= 2.5:
        return "Хорошо"
    if 2.6 <= number <= 3.5:
        return "Нормально"
    if 3.6 <= number <= 4.5:
        return "Плохо"
    if 4.6 <= number <= 5.5:
        return "Очень плохо"

translate()

This function translates German words into Russian. scrapped_info = { ... } - A dictionary is created that stores pairs of German words and their Russian translations. This dictionary contains all the pairs of words that the function needs to translate.

scraped_info.get(word, word):

  • get(word, word) is a dictionary method that tries to find the word key in the dictionary.
  • If the key is found, it returns the corresponding value (Russian translation).
  • If the key is not found, word itself is returned (not translated).
def translate(word):
    scraped_info = {
        'Zugelassenes Gewicht des Kindes': 'Допустимый вес ребенка',
        'k.A.': 'Нет информации',
        'Baby': 'Младенец',
        'Kleinkind': 'Малыш',
        'Kind': 'Ребенок',
        'Baby und Kleinkind': 'Младенец и малыш',
        'Kleinkind und Kind': 'Малыш и ребенок',
        'Baby, Kleinkind, Kind': 'Младенец, малыш, ребенок',
        'Zugelassene Größe des Kindes': 'Допустимый рост ребенка',
        'ADAC Alterklasse': 'Возрастная группа ADAC',
        'Testergebnis': 'ADAC безопасность',
        'Sicherheit': 'Надежность',
        'Bedienung': 'Обслуживание',
        'Ergonomie': 'Удобство',
        'Schadstoffe': 'Экологичность',
    }
    return scraped_info.get(word, word)

class ADACSpider()

  1. Defining a class named ADACSpide that inherits from scrapy.Spider class, making it a Scrapy-compatible spider.
  2. Spider Attributes:
    • name: This attribute sets the name of the spider, which is used to identify it within scraping project.
    • allowed_domains: This attribute specifies the domain(s) that the spider is permitted to crawl. In this case, it's restricted to adac.de.
    • start_urls: This list contains the starting URLs from which the spider will begin its crawl. It includes the link to the ADAC car seat test page.
  3. custom_settings: This dictionary allows you to override default Scrapy settings specific to this spider.
    • FEED_URI: This setting defines the output file path where the scraped data will be saved. It uses the config.full_path variable, which should be defined elsewhere in project, likely within a configuration file.
    • FEED_FORMAT: This setting specifies the format of the output data, which is set to json in this case.
class ADACSpider(scrapy.Spider):
    name = "adac"
    allowed_domains = ["adac.de"]
    start_urls = [
        "https://www.adac.de/rund-ums-fahrzeug/ausstattung-technik-zubehoer/kindersitze/kindersitztest/",
    ]
    
    custom_settings = {
        'FEED_URI': config.full_path,
        'FEED_FORMAT': 'json',
    }

__init__()
  • Inheritance (super()):
    • __init__ method of the parent class (which is scrapy.Spider in this case).
    • It ensures that any initialization logic from the parent class is executed before custom initialization logic.
    • *args and **kwargs are used to pass any positional or keyword arguments received by the spider's constructor to the parent class's constructor.
  • File Deletion:
    • This section checks if a file exists at the path specified by config.full_path.
    • If the file exists, it's deleted using os.remove(config.full_path).
    • os.path is a Python module providing functions for interacting with the operating system's file system.
def __init__(self, *args, **kwargs):
        super(ADACSpider, self).__init__(*args, **kwargs)
        if os.path.exists(config.full_path):
            os.remove(config.full_path)

parse_product()
  1. Extracting Data from JavaScript:
    • This code extracts a unique product ID from a JavaScript snippet embedded in the HTML.
    • It searches for a specific string "-id-" in the JavaScript content, then extracts the three characters after it, assuming they represent the ID.
  2. Extracting Basic Product Information:
    • A dictionary named data is created to store the extracted data.
    • It initializes the dictionary with:
      • Кресло: The car seat name, extracted from the <h1> tag.
      • ID: The product ID extracted earlier.
  3. Looping Through Sections:
    • Iterating through all div elements within the main section of the page.
    • The goal is to parse different sections of the page based on their content.
  4. Parsing ADAC Rating Section:
    • Blocking processes the section containing the ADAC rating.
    • It extracts the rating, translates the values, and stores them in the data dictionary.
  5. Parsing General Data Section:
    • Blocking handles the section with general car seat data, such as child weight, height, and age group.
    • It extracts these values, translates them, and stores them in the data dictionary.
  6. Yielding Extracted Data:
    • Finally, the yield data statement returns the data dictionary containing all the extracted information to the Scrapy engine.
def parse_product(self, response):
        info = response.css('body > div > div > script::text').get()
        start = info.find("-id-") + 4
        ID = info[start:start+3]
        data = {
            "Кресло": response.css("h1::text").get(),
            "ID": ID,
        }
        
        divs = response.css("main > div")
        for div in divs:
            if "ADAC Urteil" in div.get():
                adac_rating = div.css("h3 div::text").get()   
                if adac_rating is None:
                    adac_rating = response.css("main > div > h3 > div > div::text").get()
                if isinstance(adac_rating, tuple):
                    adac_rating = ''.join(adac_rating)
                adac_rating = float(adac_rating.replace(",", "."))  
                data["ADAC Рейтинг"] = adac_rating
                data[translate("Testergebnis")] = number_to_word(adac_rating)

                for button in div.css("button"):
                    key = translate(button.css("p::text").get())
                    if key != 'Verarbeitung und Reinigung':
                        n = float(button.css("dd p::text").get().replace(',', '.'))
                        data[key] = number_to_word(n)

            if "Allgemeine Daten" in div.get():
                child_weight = response.css('main > div.sc-eCYdKt.cOUdGC.sc-jSMdHm.sc-jCPYrn.eIeQCA.jpfxCx > div > table > tbody > tr:nth-child(2) > td:nth-child(2)::text').get()
                data[translate("Zugelassenes Gewicht des Kindes")] = translate(child_weight.replace(' bis ', '-').replace('bis ', 'до ').replace('kg', 'кг'))
                
                child_height = response.css('main > div.sc-eCYdKt.cOUdGC.sc-jSMdHm.sc-jCPYrn.eIeQCA.jpfxCx > div > table > tbody > tr:nth-child(3) > td:nth-child(2)::text').get()
                data[translate("Zugelassene Größe des Kindes")] = translate(child_height.replace(' cm bis ', '-').replace('cm', 'см'))                  
                
                child_age_group = response.css('main > div.sc-eCYdKt.cOUdGC.sc-jSMdHm.sc-jCPYrn.eIeQCA.jpfxCx > div > table > tbody > tr:nth-child(4) > td:nth-child(2)::text').get()
                data[translate("ADAC Alterklasse")] = translate(child_age_group)


        yield data

parse()
  1. Extracting Product URLs:
    • Iterating through each table row (tr) on the page, likely from a table listing car seats.
    • It checks if each row contains a link (a tag) using row.css("a").
    • If a link is found, it extracts the URL from the first link in the row and assigns it to the url variable.
  2. Following URLs with response.follow():
    • Blocking handles following the extracted URLs.
    • yield response.follow(url, callback=self.parse_product) uses the response.follow() method from Scrapy to create a new request to the extracted url.
    • The callback=self.parse_product argument specifies that the parse_product function (which was defined earlier) should be used to parse the content of the product page.
    • It uses a try...except block to handle potential errors while following the URL. If an error occurs, it prints an error message and uses traceback.print_exc() to display a detailed traceback of the exception.
  3. Handling Pagination:
    • It looks for pagination links within a div element with the attribute data-testid="pagination".
    • It then iterates through each pagination link and uses yield response.follow(a, callback=self.parse) to follow the link.
    • The callback=self.parse argument tells Scrapy to call the parse function again on the new page, effectively continuing the crawling process through the pagination links.
def parse(self, response):
        for row in response.css('tr'):
            links = row.css("a")
            if links:
                url = links[0]
                try:
                    yield response.follow(url, callback=self.parse_product)
                except Exception as e:
                    print(f"Can't parse {url}")
                    traceback.print_exc()

        
        for a in response.css('div[data-testid="pagination"] a'):
            yield response.follow(a, callback=self.parse)