Repository of the data and models generated by Mr. Shyam Ratan as part of his MPhil dissertation titled 'Automatic Detection Of Propaganda In Hindi On Social Media', in collaboration with the UnReaL-TecE LLP. All the information regarding the dataset, models, and results are given below:
This data is used for automatic detection of propaganda in Hindi and two supporting case studies in MPhil Dissertation. This version has two phases and both of these phases has two divisons: Annotated data and Raw data. Phase - 1 has the data which is used for the pilot of this work for automatic detection and result as well. The Phase - 2 data is used in two imporatnt case studies of this research work. Though, in the final stage of this research whole data of phase - 1 and 2 is used to train and test language models for automatic detection of propaganda in Hindi.
Navigation - Dataset -> v0.1 -> Phase - 1 -> {1. Annotated and 2. Raw} - 500 articles/documents;
Phase - 2 -> {1. Annotated and 2. Raw} - 399 articles/docuemnts.
Here in this version data is distributed in two phases which is mentioned earlier. Phase - 1 has annoated data of 8 Hindi newspapers viz. Aap Ki Kranti, Amar Ujala, Dainik Bhaskar, Dainik Jagran, Hindustan, Media Vigil, Saamana, tfipost and 2 peiodicals viz. Kamal Sandesh and Panchjanya, for balancy each source has 50 annotated news articles/documents. Each direcotry has same numbers of ann and txt file, here ann files has propaganda labeled spans and sentences while txt files has data. This phase also has same amount of raw data news articles/documents in Raw direcotry. Where as Phase - 2 has annotated data of 18 newspapers viz. Aap Ki Kranti, Agnibaan, Amar Ujala, Dainik Bihar, Dainik Bhaskar, Dainik Jagran, Haribhoomi, Hindustan, Jansandesh Times, Janwarta, Media Vigil, Naye Samikaran, Newslaundry, Panchjanya, Saamana, Swarajya, Swatantra Bharat, tfipost, Virarjun and 2 periodicals viz. Kamal Sandesh and Panchjanya. Here each source has 20 annoated news articles/documents except Panchajanya has 19 articles.
This data is annotated but not used in this Mphil work because of maintaing the balancy of data used in automatic detection and case studies.
This is still in raw form and developed from social media, which available for intrested people who can use this data for furture study in this direction.