-
Notifications
You must be signed in to change notification settings - Fork 0
Home
ADAC (Allgemeiner Deutscher Automobil-Club) is a German automobile club, which, among other things, is known for its independent crash test program for children's car seats.
Here are the key points about the ADAC program:
- Strict Criteria: ADAC uses the world's strictest standards for testing child car seats. They simulate various accident scenarios, including frontal and side collisions, to assess the strength and protection of the seat.
- Comprehensive approach: ADAC is not limited to crash tests only. They also evaluate usability, ergonomics, ease of installation and other important factors.
- Publicly available results: All test results are published on the ADAC website, and include detailed reports on each tested chair, describing its strengths and weaknesses.
- Influential opinion: Due to its reputation, ADAC has a significant impact on the child car seat market, encouraging manufacturers to improve the safety of their products. The ADAC program is an important resource for parents to help them make the right choice and ensure maximum safety for their children on the road.
Additionally:
- Not only crash tests: ADAC also provides information about traffic rules, insurance, roadside assistance and more.
- International recognition: ADAC test results are used all over the world.
Windows:
pip install scrapy
Linux:
sudo apt install scrapy
scrapy crawl adac
scrapy shell
fetch('link')
response.css('main > div > h3 > div > div::text').get()
response.xpath("//title/text()").get()
-
shelp()
- print a help with the list of available objects and shortcuts -
fetch(url[, redirect=True])
- fetch a new response from the given URL and update all related objects accordingly. You can optionally ask for HTTP 3xx redirections to not be followed by passingredirect=False
-
fetch(request)
- fetch a new response from the given request and update all related objects accordingly. -
view(response)
- open the given response in your local web browser, for inspection. This will add a tag to the response body in order for external links (such as images and style sheets) to display properly. Note, however, that this will create a temporary file in your computer, which won’t be removed automatically.
The LOG_LEVEL
variable controls the amount of information that is displayed during the execution of a program.
There are several logging options:
logging.CRITICAL - for critical errors (highest severity)
logging.ERROR - for regular errors
logging.WARNING - for warning messages
logging.INFO - for informational messages
logging.DEBUG - for debugging messages (lowest severity)
Crawl responsibly by identifying yourself (and your website) on the user-agent:
USER_AGENT = "adac (+http://www.yourdomain.com)"
Configure maximum concurrent requests performed by Scrapy (default: 16):
CONCURRENT_REQUESTS = 16
Enable and configure HTTP caching (disabled by default)
More information
HTTPCACHE_ENABLED = True
REDIRECT_ENABLED = False
HTTPCACHE_EXPIRATION_SECS = 0
HTTPCACHE_DIR = "httpcache"
HTTPCACHE_IGNORE_HTTP_CODES = [301, 302]
Scrapy
provides generating an “export file” with the scraped data to be consumed by other systems out of the box which allows you to generate feeds with the scraped items, using multiple serialization formats and storage backends.
When using the feed exports you define where to store the feed using one or multiple URIs (through the FEEDS setting).
FEED_FORMAT="json"
FEED_URI="adac.json"
Used to parse the characteristics of child seats and collect this information into a json file:
Name
ID
ADAC Rating
ADAC security
Reliability
Service
Convenience
Environmental friendliness
Permissible weight of the child
Permissible height of the child
ADAC Age Group
Создания полного пути к файлу adac.json
.
-
path = "/upload/"
: This line represents the path to the directory (folder) where the file will be located. -
os.path.join()
, is designed to concatenate (combine) parts of the path in accordance with the rules of the operating system.-
path
- this is the first argument of the function, it points to the initial folder. -
"adac.json"
- this is the second argument of the function, it points to the file name.
-
path = "/upload/"
full_path = os.path.join(path, "adac.json")
This function converts a numerical score from the ADAC website into a word. As an argument, a number is passed to the number_to_word
function, which translates it into a word.
def number_to_word(number):
if 0.6 <= number <= 1.5:
return "Отлично"
if 1.6 <= number <= 2.5:
return "Хорошо"
if 2.6 <= number <= 3.5:
return "Нормально"
if 3.6 <= number <= 4.5:
return "Плохо"
if 4.6 <= number <= 5.5:
return "Очень плохо"
This function translates German words into Russian. scrapped_info = { ... }
- A dictionary is created that stores pairs of German words and their Russian translations. This dictionary contains all the pairs of words that the function needs to translate.
scraped_info.get(word, word)
:
-
get(word, word)
is a dictionary method that tries to find the word key in the dictionary. - If the key is found, it returns the corresponding value (Russian translation).
- If the key is not found, word itself is returned (not translated).
def translate(word):
scraped_info = {
'Zugelassenes Gewicht des Kindes': 'Допустимый вес ребенка',
'k.A.': 'Нет информации',
'Baby': 'Младенец',
'Kleinkind': 'Малыш',
'Kind': 'Ребенок',
'Baby und Kleinkind': 'Младенец и малыш',
'Kleinkind und Kind': 'Малыш и ребенок',
'Baby, Kleinkind, Kind': 'Младенец, малыш, ребенок',
'Zugelassene Größe des Kindes': 'Допустимый рост ребенка',
'ADAC Alterklasse': 'Возрастная группа ADAC',
'Testergebnis': 'ADAC безопасность',
'Sicherheit': 'Надежность',
'Bedienung': 'Обслуживание',
'Ergonomie': 'Удобство',
'Schadstoffe': 'Экологичность',
}
return scraped_info.get(word, word)
- Defining a class named
ADACSpide
that inherits fromscrapy.Spider
class, making it a Scrapy-compatible spider. - Spider Attributes:
-
name
: This attribute sets the name of the spider, which is used to identify it within scraping project. -
allowed_domains
: This attribute specifies the domain(s) that the spider is permitted to crawl. In this case, it's restricted toadac.de
. -
start_urls
: This list contains the startingURLs
from which the spider will begin its crawl. It includes the link to the ADAC car seat test page.
-
-
custom_settings
: This dictionary allows you to override default Scrapy settings specific to this spider.-
FEED_URI
: This setting defines the output file path where the scraped data will be saved. It uses theconfig.full_path
variable, which should be defined elsewhere in project, likely within a configuration file. -
FEED_FORMAT
: This setting specifies the format of the output data, which is set to json in this case.
-
class ADACSpider(scrapy.Spider):
name = "adac"
allowed_domains = ["adac.de"]
start_urls = [
"https://www.adac.de/rund-ums-fahrzeug/ausstattung-technik-zubehoer/kindersitze/kindersitztest/",
]
custom_settings = {
'FEED_URI': config.full_path,
'FEED_FORMAT': 'json',
}
- Inheritance (
super()
):-
__init__
method of the parent class (which isscrapy.Spider
in this case). - It ensures that any initialization logic from the parent class is executed before custom initialization logic.
-
*args
and**kwargs
are used to pass any positional or keyword arguments received by the spider's constructor to the parent class's constructor.
-
- File Deletion:
- This section checks if a file exists at the path specified by
config.full_path
. - If the file exists, it's deleted using
os.remove(config.full_path)
. -
os.path
is a Python module providing functions for interacting with the operating system's file system.
- This section checks if a file exists at the path specified by
def __init__(self, *args, **kwargs):
super(ADACSpider, self).__init__(*args, **kwargs)
if os.path.exists(config.full_path):
os.remove(config.full_path)
- Extracting Data from JavaScript:
- This code extracts a unique product
ID
from a JavaScript snippet embedded in the HTML. - It searches for a specific string
"-id-"
in the JavaScript content, then extracts the three characters after it, assuming they represent theID
.
- This code extracts a unique product
- Extracting Basic Product Information:
- A dictionary named
data
is created to store the extracted data. - It initializes the dictionary with:
-
Кресло
: The car seat name, extracted from the<h1>
tag. -
ID
: The product ID extracted earlier.
-
- A dictionary named
- Looping Through Sections:
- Iterating through all
div
elements within the main section of the page. - The goal is to parse different sections of the page based on their content.
- Iterating through all
- Parsing ADAC Rating Section:
- Blocking processes the section containing the
ADAC rating
. - It extracts the rating, translates the values, and stores them in the data dictionary.
- Blocking processes the section containing the
- Parsing General Data Section:
- Blocking handles the section with general car seat data, such as
child weight
,height
, andage group
. - It extracts these values, translates them, and stores them in the data dictionary.
- Blocking handles the section with general car seat data, such as
- Yielding Extracted Data:
- Finally, the
yield data
statement returns the data dictionary containing all the extracted information to theScrapy
engine.
- Finally, the
def parse_product(self, response):
info = response.css('body > div > div > script::text').get()
start = info.find("-id-") + 4
ID = info[start:start+3]
data = {
"Кресло": response.css("h1::text").get(),
"ID": ID,
}
divs = response.css("main > div")
for div in divs:
if "ADAC Urteil" in div.get():
adac_rating = div.css("h3 div::text").get()
if adac_rating is None:
adac_rating = response.css("main > div > h3 > div > div::text").get()
if isinstance(adac_rating, tuple):
adac_rating = ''.join(adac_rating)
adac_rating = float(adac_rating.replace(",", "."))
data["ADAC Рейтинг"] = adac_rating
data[translate("Testergebnis")] = number_to_word(adac_rating)
for button in div.css("button"):
key = translate(button.css("p::text").get())
if key != 'Verarbeitung und Reinigung':
n = float(button.css("dd p::text").get().replace(',', '.'))
data[key] = number_to_word(n)
if "Allgemeine Daten" in div.get():
child_weight = response.css('main > div.sc-eCYdKt.cOUdGC.sc-jSMdHm.sc-jCPYrn.eIeQCA.jpfxCx > div > table > tbody > tr:nth-child(2) > td:nth-child(2)::text').get()
data[translate("Zugelassenes Gewicht des Kindes")] = translate(child_weight.replace(' bis ', '-').replace('bis ', 'до ').replace('kg', 'кг'))
child_height = response.css('main > div.sc-eCYdKt.cOUdGC.sc-jSMdHm.sc-jCPYrn.eIeQCA.jpfxCx > div > table > tbody > tr:nth-child(3) > td:nth-child(2)::text').get()
data[translate("Zugelassene Größe des Kindes")] = translate(child_height.replace(' cm bis ', '-').replace('cm', 'см'))
child_age_group = response.css('main > div.sc-eCYdKt.cOUdGC.sc-jSMdHm.sc-jCPYrn.eIeQCA.jpfxCx > div > table > tbody > tr:nth-child(4) > td:nth-child(2)::text').get()
data[translate("ADAC Alterklasse")] = translate(child_age_group)
yield data
- Extracting Product URLs:
- Iterating through each table
row (tr)
on the page, likely from a table listing car seats. - It checks if each row contains a link (
a
tag) usingrow.css("a")
. - If a link is found, it extracts the
URL
from the first link in the row and assigns it to the url variable.
- Iterating through each table
- Following URLs with
response.follow()
:- Blocking handles following the extracted
URLs
. -
yield response.follow(url, callback=self.parse_product)
uses theresponse.follow()
method fromScrapy
to create a new request to the extractedurl
. - The
callback=self.parse_product
argument specifies that theparse_product
function (which was defined earlier) should be used to parse the content of the product page. - It uses a
try...except
block to handle potential errors while following the URL. If an error occurs, it prints an error message and usestraceback.print_exc()
to display a detailed traceback of the exception.
- Blocking handles following the extracted
- Handling Pagination:
- Handling pagination.
- It looks for pagination links within a
div
element with the attributedata-testid="pagination"
. - It then iterates through each pagination link and uses
yield response.follow(a, callback=self.parse)
to follow the link. - The
callback=self.parse
argument tellsScrapy
to call the parse function again on the new page, effectively continuing the crawling process through the pagination links.
def parse(self, response):
for row in response.css('tr'):
links = row.css("a")
if links:
url = links[0]
try:
yield response.follow(url, callback=self.parse_product)
except Exception as e:
print(f"Can't parse {url}")
traceback.print_exc()
for a in response.css('div[data-testid="pagination"] a'):
yield response.follow(a, callback=self.parse)