Skip to content

Simple example of Hadoop Application count letter, with an intersting Romance Language Analysis

Notifications You must be signed in to change notification settings

nikisetti01/Hadoop-MapReduce-LetterFrequency-Analysis

Repository files navigation

Letter Frequency Count with Hadoop MapReduce

Welcome to the Letter Frequency Count application using Hadoop MapReduce! This project demonstrates how to count letter frequencies in a text using two different MapReduce techniques and perform analysis on text data in multiple languages. The project is organized into several components for clarity.

📁 Project Structure

  1. Java MapReduce Code:

    • LetterFrequencyMapReduce.java: Implements MapReduce functionality to count letter frequencies.
      • InMapperCombiner: Utilizes an in-mapper combiner to optimize performance.
      • Classic Combiner: Uses the classic combiner technique.
  2. Python Analysis:

    • performance_evaluation: Analyzes Hadoop application performance using various HDFS and YARN parameters.
    • linguistic_analysis: Compares Latin text with texts in Romance languages to determine linguistic affinities and differences.

💻 Java MapReduce Code

LetterFrequencyMapReduce.java

This file contains two implementations of the MapReduce job for counting letter frequencies:

  • InMapperCombiner:

    • Purpose: Reduces intermediate data during the map phase, optimizing performance.
    • Usage: The mapper combines intermediate results within the map function.
  • Classic Combiner:

    • Purpose: Aggregates intermediate results before the reduce phase.
    • Usage: The combiner is used as a separate phase between map and reduce.

How It Works

  1. Mapper: Processes input text to produce letter frequency counts.
  2. Combiner: Aggregates results to reduce data volume.
  3. Reducer: Finalizes the count and outputs the total letter frequencies.

📊 Python Analysis

performance_evaluation

This script evaluates the performance of the Hadoop MapReduce job based on various parameters:

  • HDFS Parameters: Analyzes the efficiency of Hadoop Distributed File System settings.
  • YARN Parameters: Assesses the impact of YARN settings on job performance.

Objective: Identify performance bottlenecks and optimize configuration settings.

linguistic_analysis

This script performs a comparative analysis of texts in Latin and Romance languages:

  • Text Analysis: Uses data from texts of the Aeneid in various Romance languages and Latin.
  • Statistical Comparison: Identifies similarities and differences, determining which Romance language is closest to Latin.

Objective: Provide insights into linguistic evolution and affinities over time.

About

Simple example of Hadoop Application count letter, with an intersting Romance Language Analysis

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published