Skip to content

Commit

Permalink
Webcrawler to get link.
Browse files Browse the repository at this point in the history
  • Loading branch information
patil-ashutosh committed Jan 17, 2019
0 parents commit 1f92f43
Show file tree
Hide file tree
Showing 11 changed files with 1,025 additions and 0 deletions.
163 changes: 163 additions & 0 deletions .ipynb_checkpoints/scrape-checkpoint.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,163 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"def retrieve_from_web(url, user_agent, fname):\n",
" request = urllib.request.Request(url, headers = {'User-Agent': user_agent})\n",
" response = urllib.request.urlopen(request)\n",
" html = response.read()\n",
" fname = '/home/ashutosh/Desktop/WebCrawler/HTML/' + fname\n",
" fp = open(fname, 'wb')\n",
" fp.write(html)\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"def read_html():\n",
" fp = open('/home/ashutosh/Desktop/WebCrawler/HTML/medium_html', 'r')\n",
" buff = fp.read()\n",
" return buff"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"\n",
"#from urllib.request import urlopen\n",
"import urllib.request\n",
"from bs4 import BeautifulSoup\n",
"user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'\n",
"url = 'https://medium.freecodecamp.org/'\n",
"#retrieve_from_web(url, user_agent, 'medium_html')\n",
"buff = read_html()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<ol>\n",
"<li><a href = https://medium.freecodecamp.org/the-mobile-app-launch-checklist-how-to-ship-apps-like-a-boss-84a20f5d8a45?source=collection_home---6------0--------------------->The Mobile App Launch Checklist — How to Ship Apps Like a Boss</a></li>\n",
"<li><a href = https://medium.freecodecamp.org/how-to-master-async-await-with-this-real-world-example-19107e7558ad?source=collection_home---6------1--------------------->How To Master Async/Await With This Real World Example</a></li>\n",
"<li><a href = https://medium.freecodecamp.org/here-are-some-super-secret-vs-code-hacks-to-boost-your-productivity-20d30197ac76?source=collection_home---6------2--------------------->Here are some super secret VS Code hacks to boost your productivity</a></li>\n",
"<li><a href = https://medium.freecodecamp.org/removing-javascripts-this-keyword-makes-it-a-better-language-here-s-why-db28060cc086?source=collection_home---6------3--------------------->Removing JavaScript’s “this” keyword makes it a better language. Here’s why.</a></li>\n",
"<li><a href = https://medium.freecodecamp.org/a-chaotic-mind-leads-to-chaotic-code-e7d6962777c0?source=collection_home---6------4--------------------->A chaotic mind leads to chaotic code</a></li>\n",
"<li><a href = https://medium.freecodecamp.org/i-know-nothing-but-it-is-okay-6c0d9a4fe09f?source=collection_home---6------5--------------------->I know nothing, but it is okay</a></li>\n",
"<li><a href = https://medium.freecodecamp.org/which-programming-language-should-you-learn-next-487d077baa32?source=collection_home---6------6--------------------->Which Programming Language Should You Learn Next?</a></li>\n",
"<li><a href = https://medium.freecodecamp.org/how-to-create-a-discord-bot-under-15-minutes-fb2fd0083844?source=collection_home---6------7--------------------->How to create a Discord bot under 15 minutes</a></li>\n",
"<li><a href = https://medium.freecodecamp.org/how-to-go-from-scratch-to-create-react-app-on-windows-a8a24687d595?source=collection_home---6------8--------------------->How to go from scratch to Create-React-App on Windows</a></li>\n",
"<li><a href = https://medium.freecodecamp.org/how-i-built-an-async-form-validation-library-in-100-lines-of-code-with-react-hooks-81dbff6c4a04?source=collection_home---6------9--------------------->How I built an async form validation library in ~100 lines of code with React Hooks</a></li>\n",
"<li><a href = https://medium.freecodecamp.org/introducing-abs-a-programming-language-for-shell-scripting-dfbd737d621?source=collection_home---6------10--------------------->Introducing ABS, a programming language for shell scripting</a></li>\n",
"<li><a href = https://medium.freecodecamp.org/how-to-write-a-better-cv-the-web-developer-edition-6d27f37d4e67?source=collection_home---6------11--------------------->How to write a better CV— the Web Developer edition</a></li>\n",
"<li><a href = https://medium.freecodecamp.org/the-react-handbook-b71c27b0a795?source=collection_home---6------12--------------------->The React Handbook</a></li>\n",
"<li><a href = https://medium.freecodecamp.org/simple-site-hosting-with-amazon-s3-and-https-5e78017f482a?source=collection_home---6------13--------------------->Simple site hosting with Amazon S3 and HTTPS</a></li>\n",
"<li><a href = https://medium.freecodecamp.org/how-to-host-a-static-website-with-s3-cloudfront-and-route53-7cbb11d4aeea?source=collection_home---6------14--------------------->How to Host a Static Website with S3, CloudFront and Route53</a></li>\n",
"<li><a href = https://medium.freecodecamp.org/how-to-publish-an-application-in-the-play-store-8ddcc6dc3587?source=collection_home---6------15--------------------->How to Publish An Application In The Play Store</a></li>\n",
"<li><a href = https://medium.freecodecamp.org/the-strategy-pattern-explained-using-java-bc30542204e0?source=collection_home---6------16--------------------->The Strategy Pattern explained using Java</a></li>\n",
"<li><a href = https://medium.freecodecamp.org/how-to-calculate-binary-tree-height-with-the-recursive-method-aafc461f2201?source=collection_home---6------17--------------------->How to calculate Binary Tree height with the recursive method</a></li>\n",
"<li><a href = https://medium.freecodecamp.org/i-landed-an-internship-at-facebook-here-are-some-tips-i-learned-b83685cde27?source=collection_home---6------18--------------------->I landed an internship at Facebook. Here are some tips I learned.</a></li>\n",
"<li><a href = https://medium.freecodecamp.org/essential-gems-for-rails-applications-75fed43d2798?source=collection_home---6------19--------------------->Essential Gems for Rails Applications</a></li>\n",
"<li><a href = https://medium.freecodecamp.org/securing-managing-secrets-using-google-cloud-kms-3fe08c69f499?source=collection_home---6------20--------------------->How to secure and manage secrets using Google Cloud KMS</a></li>\n",
"<li><a href = https://medium.freecodecamp.org/how-to-pass-oracles-java-certifications-a-practical-guide-for-developers-e9b607ba6173?source=collection_home---6------21--------------------->How to Pass Oracle’s Java Certifications — a Practical Guide for Developers</a></li>\n",
"<li><a href = https://medium.freecodecamp.org/master-the-art-of-looping-in-javascript-with-these-incredible-tricks-a5da1aa1d6c5?source=collection_home---6------22--------------------->Master the art of looping in JavaScript with these incredible tricks</a></li>\n",
"<li><a href = https://medium.freecodecamp.org/the-art-of-asking-questions-84c01c9987a4?source=collection_home---6------23--------------------->The art of asking questions</a></li>\n",
"<li><a href = https://medium.freecodecamp.org/the-definitive-guide-to-contributing-to-open-source-900d5f9f2282?source=collection_home---6------24--------------------->The Definitive Guide to Contributing to Open Source</a></li>\n"
]
},
{
"data": {
"text/plain": [
"5748"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import time\n",
"soup = BeautifulSoup(buff, \"html.parser\")\n",
"#print(soup.prettify())\n",
"all_news = soup.find_all('a')\n",
"#print(all_news[0])\n",
"#print(type(all_news))\n",
"#html_links = \"<n_links>\"\n",
"html_links =\"<ol>\"\n",
"for news in all_news:\n",
" head = news.find('h3')\n",
" if head:\n",
" #<a href=\"https://www.w3schools.com/html/\">Visit our HTML tutorial</a> \n",
" lnks = \"<li><a href = {0}>{1}</a></li>\".format(news.get('href'), head.text)\n",
" html_links = html_links + \"\\n\" + lnks\n",
" #print((news.get('href')))\n",
" #print(type(head))\n",
" #print(head.attrs)\n",
" #print(head.text)\n",
"print(html_links)\n",
"html_links = html_links + \"</ol>\"\n",
"fname = '/home/ashutosh/Desktop/WebCrawler/result/'+ str(time.strftime(\"%y-%m%-d\")) + \".html\"\n",
"fp = open(fname, 'w')\n",
"fp.write(html_links)\n",
"#print(type(par))\n",
"#print(par)\n",
"#print((all_news[0].parent.name))\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\n",
"\n",
"\n",
"\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
6 changes: 6 additions & 0 deletions .ipynb_checkpoints/test-checkpoint.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
{
"cells": [],
"metadata": {},
"nbformat": 4,
"nbformat_minor": 2
}
62 changes: 62 additions & 0 deletions .ipynb_checkpoints/web_log-checkpoint.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [],
"source": [
"import logging\n",
"#for handler in logging.root.handlers[:]:\n",
"# logging.root.removeHandler(handler)\n",
"\n",
"logging.basicConfig(filename = \"wb.log\", format = '%(asctime)s-%(levelname)s - %(message)s', level=logging.INFO, filemode = 'w')\n",
"log = logging.getLogger(__name__)\n",
"#log.setLevel(20)\n",
"log.info(\"logging outputr\")"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
1 change: 1 addition & 0 deletions HTML/b.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
sjsjj
237 changes: 237 additions & 0 deletions HTML/medium_html

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions app.log
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
root - INFO - s
26 changes: 26 additions & 0 deletions result/19-0116.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
<ol>
<li><a href = https://medium.freecodecamp.org/the-mobile-app-launch-checklist-how-to-ship-apps-like-a-boss-84a20f5d8a45?source=collection_home---6------0--------------------- style="color: black">The Mobile App Launch Checklist — How to Ship Apps Like a Boss</a></li>
<li><a href = https://medium.freecodecamp.org/how-to-master-async-await-with-this-real-world-example-19107e7558ad?source=collection_home---6------1--------------------->How To Master Async/Await With This Real World Example</a></li>
<li><a href = https://medium.freecodecamp.org/here-are-some-super-secret-vs-code-hacks-to-boost-your-productivity-20d30197ac76?source=collection_home---6------2--------------------->Here are some super secret VS Code hacks to boost your productivity</a></li>
<li><a href = https://medium.freecodecamp.org/removing-javascripts-this-keyword-makes-it-a-better-language-here-s-why-db28060cc086?source=collection_home---6------3--------------------->Removing JavaScript’s “this” keyword makes it a better language. Here’s why.</a></li>
<li><a href = https://medium.freecodecamp.org/a-chaotic-mind-leads-to-chaotic-code-e7d6962777c0?source=collection_home---6------4--------------------->A chaotic mind leads to chaotic code</a></li>
<li><a href = https://medium.freecodecamp.org/i-know-nothing-but-it-is-okay-6c0d9a4fe09f?source=collection_home---6------5--------------------->I know nothing, but it is okay</a></li>
<li><a href = https://medium.freecodecamp.org/which-programming-language-should-you-learn-next-487d077baa32?source=collection_home---6------6--------------------->Which Programming Language Should You Learn Next?</a></li>
<li><a href = https://medium.freecodecamp.org/how-to-create-a-discord-bot-under-15-minutes-fb2fd0083844?source=collection_home---6------7--------------------->How to create a Discord bot under 15 minutes</a></li>
<li><a href = https://medium.freecodecamp.org/how-to-go-from-scratch-to-create-react-app-on-windows-a8a24687d595?source=collection_home---6------8--------------------->How to go from scratch to Create-React-App on Windows</a></li>
<li><a href = https://medium.freecodecamp.org/how-i-built-an-async-form-validation-library-in-100-lines-of-code-with-react-hooks-81dbff6c4a04?source=collection_home---6------9--------------------->How I built an async form validation library in ~100 lines of code with React Hooks</a></li>
<li><a href = https://medium.freecodecamp.org/introducing-abs-a-programming-language-for-shell-scripting-dfbd737d621?source=collection_home---6------10--------------------->Introducing ABS, a programming language for shell scripting</a></li>
<li><a href = https://medium.freecodecamp.org/how-to-write-a-better-cv-the-web-developer-edition-6d27f37d4e67?source=collection_home---6------11--------------------->How to write a better CV— the Web Developer edition</a></li>
<li><a href = https://medium.freecodecamp.org/the-react-handbook-b71c27b0a795?source=collection_home---6------12--------------------->The React Handbook</a></li>
<li><a href = https://medium.freecodecamp.org/simple-site-hosting-with-amazon-s3-and-https-5e78017f482a?source=collection_home---6------13--------------------->Simple site hosting with Amazon S3 and HTTPS</a></li>
<li><a href = https://medium.freecodecamp.org/how-to-host-a-static-website-with-s3-cloudfront-and-route53-7cbb11d4aeea?source=collection_home---6------14--------------------->How to Host a Static Website with S3, CloudFront and Route53</a></li>
<li><a href = https://medium.freecodecamp.org/how-to-publish-an-application-in-the-play-store-8ddcc6dc3587?source=collection_home---6------15--------------------->How to Publish An Application In The Play Store</a></li>
<li><a href = https://medium.freecodecamp.org/the-strategy-pattern-explained-using-java-bc30542204e0?source=collection_home---6------16--------------------->The Strategy Pattern explained using Java</a></li>
<li><a href = https://medium.freecodecamp.org/how-to-calculate-binary-tree-height-with-the-recursive-method-aafc461f2201?source=collection_home---6------17--------------------->How to calculate Binary Tree height with the recursive method</a></li>
<li><a href = https://medium.freecodecamp.org/i-landed-an-internship-at-facebook-here-are-some-tips-i-learned-b83685cde27?source=collection_home---6------18--------------------->I landed an internship at Facebook. Here are some tips I learned.</a></li>
<li><a href = https://medium.freecodecamp.org/essential-gems-for-rails-applications-75fed43d2798?source=collection_home---6------19--------------------->Essential Gems for Rails Applications</a></li>
<li><a href = https://medium.freecodecamp.org/securing-managing-secrets-using-google-cloud-kms-3fe08c69f499?source=collection_home---6------20--------------------->How to secure and manage secrets using Google Cloud KMS</a></li>
<li><a href = https://medium.freecodecamp.org/how-to-pass-oracles-java-certifications-a-practical-guide-for-developers-e9b607ba6173?source=collection_home---6------21--------------------->How to Pass Oracle’s Java Certifications — a Practical Guide for Developers</a></li>
<li><a href = https://medium.freecodecamp.org/master-the-art-of-looping-in-javascript-with-these-incredible-tricks-a5da1aa1d6c5?source=collection_home---6------22--------------------->Master the art of looping in JavaScript with these incredible tricks</a></li>
<li><a href = https://medium.freecodecamp.org/the-art-of-asking-questions-84c01c9987a4?source=collection_home---6------23--------------------->The art of asking questions</a></li>
<li><a href = https://medium.freecodecamp.org/the-definitive-guide-to-contributing-to-open-source-900d5f9f2282?source=collection_home---6------24--------------------->The Definitive Guide to Contributing to Open Source</a></li></ol>
Loading

0 comments on commit 1f92f43

Please sign in to comment.