Data Collection


Tools

  1. requests
  2. 
    import requests
    content = requests.get(urlpage)
        

    It gives the whole html of the static page content.

  3. Beautiful Soup -- for static page
  4. 
    import requests
    from bs4 import BeautifulSoup
    content = requests.get(urlpage)
    soup = BeautifulSoup(content.content,'html.parser')
    table = soup.find('table',{'class':'wikitable sortable'})
        

    It is used to scrape static pages.

  5. Selenium -- for dynamic page
  6. It can be used to automate web browser interaction with Python. Need to install the browser and its driver.

    
    sudo apt-get install firefox
    sudo apt-get install firefox-geckodriver
    pip install selenium
        
  7. github repo google-images-download
  8. Terminal:

    
    python3 bing_scraper.py --search 'living room' --limit 100 --download --chromedriver /home/yui/Downloads/chromedriver_linux64/chromedriver
        
  9. TWINT - Twitter Intelligence Tool
  10. Wikimedia data pages articles → Go to Link → Database backup dumps → Sql/XML dumps issues -- enwiki → Articles, templates, ... → enwiki-20210501-pages-articles-multistream1.xml-p1p41242.bz2 237.5 MB
  11. 
    sudo apt install install ruby-full
    gem install wp2txt
    wp2txt -i file_name
        
  12. Kaggle, create an account → Profile Icon → Account → API : Create New API Token → Move credentials json → Start Downloading.
  13. 
    mv ~/Downloads/kaggle.json ~/.kaggle
    kaggle datasets download -d sorour/95cloud-cloud-segmentation-on-satellite-images
        

References

  1. YouTube Face Dataset
  2. CelebA Dataset
  3. IBug dataset for face landmarks model
  4. Trillionpair Dataset
  5. SESYD Dataset for floorplan
  6. Common Objects in Context (COCO Dataset)
  7. MNIST
  8. Image Scraping with Python
  9. Modern Web Automation With Python and Selenium
  10. Ultimate Guide to Web Scraping with Python Part 1: Requests and BeautifulSoup
  11. dumps.wikimedia.org pages articles convert with wp2txt