Google News Scraper

This repository provides two methods to collect news data from Google News.

Free Method: Perfect for small projects and learning
Google News API: Ideal for large-scale, reliable, real-time data extraction

Method 1: Free Google News Scraper

This free tool lets you collect news articles based on any topic you're interested in. You'll get everything from headlines to publication dates, all neatly organized.

Prerequisites

Python 3.9+
Two key packages:
- aiohttp (for making requests)
- beautifulsoup4 (for parsing HTML)

Installation

Clone the repository:

git clone https://github.com/luminati-io/Google-News-Scraper.git

Navigate to the project directory:
```
cd Google-News-Scraper
```
Install required dependencies:
```
pip install -r requirements.txt
```

Usage

Navigate to the free_scraper directory and open main.py

Define your search terms in the file:

search_terms = [
    "artificial intelligence",
    "climate change",
    "space exploration",
    # Add more search terms as needed
]

Run the scraper:
```
python main.py
```

Output

The scraper generates JSON files:

Individual JSON files for each search term
A combined_results.json file containing data from all search terms

Each article in the JSON output contains:

{
    "title": "OpenAI launches full o1 model with image uploads and analysis, debuts ChatGPT Pro - VentureBeat",
    "link": "https://news.google.com/rss/articles/CBMipgFBVV95cUxQTTVmS1I4aW1QanZXTnBfa2tBR3d0Y2JzNjJJNldBZTd1TVVfRmpxaUM3bGJld3RycXhPbU8wM1loT0JGd2JDRzFmU1pLU3FSbkRRZ0FPY29INmdhU1RsWXFqXzdLTjNCbU5ES3pIQXZLbTVmMWVhc0FqVlljeWNPOHZMeFlXV2F5Q21ac0lSZVhIOHlnS05sdkR5ZjhJTU9HazJ6MWJR?oc=5",
    "publication_date": "Thu, 05 Dec 2024 18:00:00 GMT",
    "source": "VentureBeat",
    "source_url": "https://venturebeat.com",
    "guid": "CBMipgFBVV95cUxQTTVmS1I4aW1QanZXTnBfa2tBR3d0Y2JzNjJJNldBZTd1TVVfRmpxaUM3bGJld3RycXhPbU8wM1loT0JGd2JDRzFmU1pLU3FSbkRRZ0FPY29INmdhU1RsWXFqXzdLTjNCbU5ES3pIQXZLbTVmMWVhc0FqVlljeWNPOHZMeFlXV2F5Q21ac0lSZVhIOHlnS05sdkR5ZjhJTU9HazJ6MWJR",
}

👉 You can find a complete example output in our free_scraper/data/ directory.

Common Scraping Challenges

Scraping data from Google News can be quite challenging. Here are some common issues you may encounter:

CAPTCHA and Anti-Bot Mechanisms: Google often employs CAPTCHAs or rate-limiting mechanisms to prevent bots from accessing its content.
Scalability: Scraping large volumes of data or performing high-frequency scraping can overwhelm free scrapers.
Global and Localized News Access: Customizing scrapers for different regions and languages often requires significant effort and manual adjustments.

Method 2: Bright Data Google News API

Want something more robust? Let's talk about Bright Data's Google News API. Here's why it's worth considering:

Key Benefits

Zero Infrastructure Headaches: Forget about proxies and CAPTCHAs
Built to Scale: Handles heavy traffic with exceptional performance
Global Reach: Get news from any country, any language
Privacy First: GDPR & CCPA compliant
Pay for Success: Only charged for successful requests
Try Before You Buy: 20 free API calls to test things out

Getting Started with the Google News API

For a detailed guide on setting up the Google News API, check our Step-by-Step Setup Guide.

Key Input Parameters

Parameter	Required?	Description	Example
`url`	Yes	Base Google News URL	`news.google.com`
`keyword`	Yes	Your search topic	`"ChatGPT"`
`country`	No	Where to get news from	`"US"`
`language`	No	What language you want	`"en"`

Sample Result

Here’s what the API returns:

{
    "url": "https://www.tomsguide.com/news/live/12-days-of-openai-live-blog-chatgpt-sora",
    "title": "12 Days of OpenAI Day 2 LIVE: o1 full is here and every new ChatGPT AI announcement as it happens",
    "publisher": "Tom's Guide",
    "date": "2024-12-06T20:54:01.000Z",
    "category": null,
    "keyword": "chatgpt",
    "country": "US",
    "image": "https://news.google.com/api/attachments/CC8iK0NnNW9SbTFVTWtkNGFGSjJSVGhGVFJDb0FSaXNBaWdCTWdhQmtJcWpOQWM=-w200-h112-p-df-rw",
    "timestamp": "2024-12-08T10:06:05.122Z",
    "input": {
        "url": "https://news.google.com/",
        "keyword": "chatgpt",
        "country": "US",
        "language": "en",
    },
}

👉 You can find a complete example output in our news_scraper_output.json file.

Ready-to-Use Python Code

Here's a script to get you started:

import requests
import json
import time


class BrightDataNews:
    def __init__(self, api_token):
        self.api_token = api_token
        self.headers = {
            "Authorization": f"Bearer {api_token}",
            "Content-Type": "application/json",
        }
        self.dataset_id = "gd_lnsxoxzi1omrwnka5r"

    def collect_news(self, search_queries):
        """
        Collect Google News articles using BrightData API
        """
        # 1. Trigger data collection
        print("Starting news collection...")
        trigger_response = self._trigger_collection(search_queries)
        snapshot_id = trigger_response.get("snapshot_id")
        print(f"Snapshot ID: {snapshot_id}")

        # 2. Wait for data to be ready
        print("Waiting for data...")
        while True:
            status = self._check_status(snapshot_id)
            print(f"Status: {status}")

            if status == "ready":
                # Check if data is actually available
                data = self._get_data(snapshot_id)
                if data and len(data) > 0:
                    break
            time.sleep(10)  # Wait 10 seconds before next check
        # 3. Get and save the data
        print("Saving data...")
        filename = f"news_scraper_output.json"
        with open(filename, "w", encoding="utf-8") as f:
            json.dump(data, f, indent=2, ensure_ascii=False)
        print(f"✓ Data saved to {filename}")
        print(f"✓ Collected {len(data)} news articles")
        return data

    def _trigger_collection(self, search_queries):
        """Trigger news data collection"""
        response = requests.post(
            "https://api.brightdata.com/datasets/v3/trigger",
            headers=self.headers,
            params={"dataset_id": self.dataset_id, "include_errors": "true"},
            json=search_queries,
        )
        return response.json()

    def _check_status(self, snapshot_id):
        """Check collection status"""
        response = requests.get(
            f"https://api.brightdata.com/datasets/v3/progress/{snapshot_id}",
            headers=self.headers,
        )
        return response.json().get("status")

    def _get_data(self, snapshot_id):
        """Get collected data"""
        response = requests.get(
            f"https://api.brightdata.com/datasets/v3/snapshot/{snapshot_id}",
            headers=self.headers,
            params={"format": "json"},
        )
        return response.json()

Here's how to use it:

# Initialize the client
news_client = BrightDataNews("<YOUR_API_TOKEN>")

# Define what you want to collect
queries = [
    {
        "url": "https://news.google.com/",
        "keyword": "artificial intelligence startups",
        "country": "US",
        "language": "en",
    },
    {
        "url": "https://news.google.com/",
        "keyword": "tech industry layoffs",
        "country": "US",
        "language": "en",
    },
]

# Start collection
try:
    news_data = news_client.collect_news(queries)
    print(f"Successfully collected {len(news_data)} articles")
except Exception as e:
    print(f"Collection failed: {str(e)}")

Understanding the API Implementation

Setting Up Your API Token
- First things first: you'll need an API token
- If you haven't got one yet, check out our setup guide
Starting the Collection
- Pass your search parameters to the API
- You'll get back a snapshot_id
Monitoring Progress
- The process takes a few minutes
- Our code checks the status automatically:
  - "running" = Still collecting your data
  - "ready" = Time to collect your results!
Getting Your Data
- Once the status shows "ready", we fetch and save your results
- Data comes in clean JSON format
- Each article includes all the fields we discussed earlier

Customizing Your Data Collection

You can use the following parameters to fine-tune your results:

Parameter	Type	Description	Example
`limit`	`integer`	Max results per input	`limit=10`
`include_errors`	`boolean`	Get error reports for troubleshooting	`include_errors=true`
`notify`	`url`	Webhook notification URL to be notified upon completion	`notify=https://notify-me.com/`
`format`	`enum`	Output format (e.g., JSON, NDJSON, JSONL, CSV)	`format=json`

💡 Pro Tip: You can also select whether to deliver the data to an external storage or to deliver it to a webhook.

Need more details? Check the official API docs.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
free_scraper		free_scraper
google-news-api-scraper		google-news-api-scraper
Proxies and scrapers GitHub bonus banner.png		Proxies and scrapers GitHub bonus banner.png
README.md		README.md
google_news_api_setup.md		google_news_api_setup.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Google News Scraper

Table of Contents

Method 1: Free Google News Scraper

Prerequisites

Installation

Usage

Output

Common Scraping Challenges

Method 2: Bright Data Google News API

Key Benefits

Getting Started with the Google News API

Key Input Parameters

Sample Result

Ready-to-Use Python Code

Understanding the API Implementation

Customizing Your Data Collection

About

Languages

luminati-io/Google-News-Scraper

Folders and files

Latest commit

History

Repository files navigation

Google News Scraper

Table of Contents

Method 1: Free Google News Scraper

Prerequisites

Installation

Usage

Output

Common Scraping Challenges

Method 2: Bright Data Google News API

Key Benefits

Getting Started with the Google News API

Key Input Parameters

Sample Result

Ready-to-Use Python Code

Understanding the API Implementation

Customizing Your Data Collection

About

Topics

Resources

Stars

Watchers

Forks

Languages