Data Collection

Tools

requests


import requests
content = requests.get(urlpage)

It gives the whole html of the static page content.

Beautiful Soup -- for static page


import requests
from bs4 import BeautifulSoup
content = requests.get(urlpage)
soup = BeautifulSoup(content.content,'html.parser')
table = soup.find('table',{'class':'wikitable sortable'})

It is used to scrape static pages.

Selenium -- for dynamic page

It can be used to automate web browser interaction with Python. Need to install the browser and its driver.


sudo apt-get install firefox
sudo apt-get install firefox-geckodriver
pip install selenium

github repo google-images-download

Terminal:


python3 bing_scraper.py --search 'living room' --limit 100 --download --chromedriver /home/yui/Downloads/chromedriver_linux64/chromedriver

TWINT - Twitter Intelligence Tool
Wikimedia data pages articles → Go to Link → Database backup dumps → Sql/XML dumps issues -- enwiki → Articles, templates, ... → enwiki-20210501-pages-articles-multistream1.xml-p1p41242.bz2 237.5 MB


sudo apt install install ruby-full
gem install wp2txt
wp2txt -i file_name

Kaggle, create an account → Profile Icon → Account → API : Create New API Token → Move credentials json → Start Downloading.


mv ~/Downloads/kaggle.json ~/.kaggle
kaggle datasets download -d sorour/95cloud-cloud-segmentation-on-satellite-images

Data Collection

Tools

References