Skip to content

Extract news data with the #1 Google News API, including top stories, publisher details, article summaries, and trending topics. Start your free trial now!

Notifications You must be signed in to change notification settings

luminati-io/Google-News-Scraper

Repository files navigation

Google News Scraper

Promo

This repository provides two methods to collect news data from Google News.

  • Free Method: Perfect for small projects and learning
  • Google News API: Ideal for large-scale, reliable, real-time data extraction

Table of Contents

Method 1: Free Google News Scraper

image

This free tool lets you collect news articles based on any topic you're interested in. You'll get everything from headlines to publication dates, all neatly organized.

Prerequisites

Installation

  1. Clone the repository:

    git clone https://github.com/luminati-io/Google-News-Scraper.git
  2. Navigate to the project directory:

    cd Google-News-Scraper
  3. Install required dependencies:

    pip install -r requirements.txt

Usage

  1. Navigate to the free_scraper directory and open main.py

  2. Define your search terms in the file:

    search_terms = [
        "artificial intelligence",
        "climate change",
        "space exploration",
        # Add more search terms as needed
    ]
  3. Run the scraper:

    python main.py

Output

The scraper generates JSON files:

  • Individual JSON files for each search term
  • A combined_results.json file containing data from all search terms

Each article in the JSON output contains:

{
    "title": "OpenAI launches full o1 model with image uploads and analysis, debuts ChatGPT Pro - VentureBeat",
    "link": "https://news.google.com/rss/articles/CBMipgFBVV95cUxQTTVmS1I4aW1QanZXTnBfa2tBR3d0Y2JzNjJJNldBZTd1TVVfRmpxaUM3bGJld3RycXhPbU8wM1loT0JGd2JDRzFmU1pLU3FSbkRRZ0FPY29INmdhU1RsWXFqXzdLTjNCbU5ES3pIQXZLbTVmMWVhc0FqVlljeWNPOHZMeFlXV2F5Q21ac0lSZVhIOHlnS05sdkR5ZjhJTU9HazJ6MWJR?oc=5",
    "publication_date": "Thu, 05 Dec 2024 18:00:00 GMT",
    "source": "VentureBeat",
    "source_url": "https://venturebeat.com",
    "guid": "CBMipgFBVV95cUxQTTVmS1I4aW1QanZXTnBfa2tBR3d0Y2JzNjJJNldBZTd1TVVfRmpxaUM3bGJld3RycXhPbU8wM1loT0JGd2JDRzFmU1pLU3FSbkRRZ0FPY29INmdhU1RsWXFqXzdLTjNCbU5ES3pIQXZLbTVmMWVhc0FqVlljeWNPOHZMeFlXV2F5Q21ac0lSZVhIOHlnS05sdkR5ZjhJTU9HazJ6MWJR",
}

πŸ‘‰ You can find a complete example output in our free_scraper/data/ directory.

Common Scraping Challenges

Scraping data from Google News can be quite challenging. Here are some common issues you may encounter:

  1. CAPTCHA and Anti-Bot Mechanisms: Google often employs CAPTCHAs or rate-limiting mechanisms to prevent bots from accessing its content.
  2. Scalability: Scraping large volumes of data or performing high-frequency scraping can overwhelm free scrapers.
  3. Global and Localized News Access: Customizing scrapers for different regions and languages often requires significant effort and manual adjustments.

Method 2: Bright Data Google News API

Want something more robust? Let's talk about Bright Data's Google News API. Here's why it's worth considering:

Key Benefits

  • Zero Infrastructure Headaches: Forget about proxies and CAPTCHAs
  • Built to Scale: Handles heavy traffic with exceptional performance
  • Global Reach: Get news from any country, any language
  • Privacy First: GDPR & CCPA compliant
  • Pay for Success: Only charged for successful requests
  • Try Before You Buy: 20 free API calls to test things out

Getting Started with the Google News API

For a detailed guide on setting up the Google News API, check our Step-by-Step Setup Guide.

Key Input Parameters

Parameter Required? Description Example
url Yes Base Google News URL news.google.com
keyword Yes Your search topic "ChatGPT"
country No Where to get news from "US"
language No What language you want "en"

Sample Result

Here’s what the API returns:

{
    "url": "https://www.tomsguide.com/news/live/12-days-of-openai-live-blog-chatgpt-sora",
    "title": "12 Days of OpenAI Day 2 LIVE: o1 full is here and every new ChatGPT AI announcement as it happens",
    "publisher": "Tom's Guide",
    "date": "2024-12-06T20:54:01.000Z",
    "category": null,
    "keyword": "chatgpt",
    "country": "US",
    "image": "https://news.google.com/api/attachments/CC8iK0NnNW9SbTFVTWtkNGFGSjJSVGhGVFJDb0FSaXNBaWdCTWdhQmtJcWpOQWM=-w200-h112-p-df-rw",
    "timestamp": "2024-12-08T10:06:05.122Z",
    "input": {
        "url": "https://news.google.com/",
        "keyword": "chatgpt",
        "country": "US",
        "language": "en",
    },
}

πŸ‘‰ You can find a complete example output in our news_scraper_output.json file.

Ready-to-Use Python Code

Here's a script to get you started:

import requests
import json
import time


class BrightDataNews:
    def __init__(self, api_token):
        self.api_token = api_token
        self.headers = {
            "Authorization": f"Bearer {api_token}",
            "Content-Type": "application/json",
        }
        self.dataset_id = "gd_lnsxoxzi1omrwnka5r"

    def collect_news(self, search_queries):
        """
        Collect Google News articles using BrightData API
        """
        # 1. Trigger data collection
        print("Starting news collection...")
        trigger_response = self._trigger_collection(search_queries)
        snapshot_id = trigger_response.get("snapshot_id")
        print(f"Snapshot ID: {snapshot_id}")

        # 2. Wait for data to be ready
        print("Waiting for data...")
        while True:
            status = self._check_status(snapshot_id)
            print(f"Status: {status}")

            if status == "ready":
                # Check if data is actually available
                data = self._get_data(snapshot_id)
                if data and len(data) > 0:
                    break
            time.sleep(10)  # Wait 10 seconds before next check
        # 3. Get and save the data
        print("Saving data...")
        filename = f"news_scraper_output.json"
        with open(filename, "w", encoding="utf-8") as f:
            json.dump(data, f, indent=2, ensure_ascii=False)
        print(f"βœ“ Data saved to {filename}")
        print(f"βœ“ Collected {len(data)} news articles")
        return data

    def _trigger_collection(self, search_queries):
        """Trigger news data collection"""
        response = requests.post(
            "https://api.brightdata.com/datasets/v3/trigger",
            headers=self.headers,
            params={"dataset_id": self.dataset_id, "include_errors": "true"},
            json=search_queries,
        )
        return response.json()

    def _check_status(self, snapshot_id):
        """Check collection status"""
        response = requests.get(
            f"https://api.brightdata.com/datasets/v3/progress/{snapshot_id}",
            headers=self.headers,
        )
        return response.json().get("status")

    def _get_data(self, snapshot_id):
        """Get collected data"""
        response = requests.get(
            f"https://api.brightdata.com/datasets/v3/snapshot/{snapshot_id}",
            headers=self.headers,
            params={"format": "json"},
        )
        return response.json()

Here's how to use it:

# Initialize the client
news_client = BrightDataNews("<YOUR_API_TOKEN>")

# Define what you want to collect
queries = [
    {
        "url": "https://news.google.com/",
        "keyword": "artificial intelligence startups",
        "country": "US",
        "language": "en",
    },
    {
        "url": "https://news.google.com/",
        "keyword": "tech industry layoffs",
        "country": "US",
        "language": "en",
    },
]

# Start collection
try:
    news_data = news_client.collect_news(queries)
    print(f"Successfully collected {len(news_data)} articles")
except Exception as e:
    print(f"Collection failed: {str(e)}")

Understanding the API Implementation

  1. Setting Up Your API Token
    • First things first: you'll need an API token
    • If you haven't got one yet, check out our setup guide
  2. Starting the Collection
    • Pass your search parameters to the API
    • You'll get back a snapshot_id
  3. Monitoring Progress
    • The process takes a few minutes
    • Our code checks the status automatically:
      • "running" = Still collecting your data
      • "ready" = Time to collect your results!
  4. Getting Your Data
    • Once the status shows "ready", we fetch and save your results
    • Data comes in clean JSON format
    • Each article includes all the fields we discussed earlier

Customizing Your Data Collection

You can use the following parameters to fine-tune your results:

Parameter Type Description Example
limit integer Max results per input limit=10
include_errors boolean Get error reports for troubleshooting include_errors=true
notify url Webhook notification URL to be notified upon completion notify=https://notify-me.com/
format enum Output format (e.g., JSON, NDJSON, JSONL, CSV) format=json

πŸ’‘ Pro Tip: You can also select whether to deliver the data to an external storage or to deliver it to a webhook.


Need more details? Check the official API docs.

About

Extract news data with the #1 Google News API, including top stories, publisher details, article summaries, and trending topics. Start your free trial now!

Topics

Resources

Stars

Watchers

Forks

Languages