Do you need social media data for your machine learning project?

- Twitter data?
- Reddit data?
- Facebook data?

Where to get it?

Reddit: Pushshift

Pushshift is a big-data storage and analytics project.

Most people know it for its copy of reddit comments and submissions.

https://t.co/ne0XqKIt9A
Reddit: Pushshift API

The https://t.co/HWZNWEvrxY Reddit API was designed and created by the /r/datasets mod team to help provide enhanced functionality and search capabilities for searching Reddit comments and submissions.

https://t.co/FkUE7R2jlb
Reddit: Pushshift file download

Note: The latest data for manual download is from April 2020

https://t.co/jBM4U71dnm
Reddit: PMAW: Pushshift Multithread API Wrapper

PMAW is an ultra minimalist wrapper for the Pushshift API which uses multithreading to retrieve Reddit comments and submissions.

If you pull data via Pushshift use PMAW, highly recommended!

https://t.co/xSlaX3Di6T
Reddit: Redditsearch

Frontend which uses Pushshift for detail searches on subreddits or domain

https://t.co/8C37LM7aTx
Twitter: Stream as download

The Internet Archive is a digital library of Internet sites and other cultural artifacts in digital form.

Note: The last archived data is from January 2021

https://t.co/hywWgFtbjq
Twitter: Tweepy Twitter for Python!

An easy-to-use Python library for accessing the Twitter API.

Note: The downside is the API limitations of Twitter, so you need a lot of time.

https://t.co/dhc7x1lZ8U
Twitter: Script

Most twitter scraper are banned by Twitter or no longer work so here is a simple and unlimited twitter scraper with python and without authentication

Note: Headless mode no longer work and it uses Selenium to access Twitter

https://t.co/feZsbOFmJR
Facebook: Scrape Facebook public pages without an API key.

$ pip install facebook-scraper

https://t.co/Rh4s93P1YD
Facebook: Large Page-Page Network data

Nodes represent official Facebook pages while the links are mutual likes between sites.

Node features are extracted from the site descriptions that the page owners created to summarize the purpose of the site.

https://t.co/tLSvGA95Az
Octoparse: Easy Web Scraping for Anyone

Everything you need to automate your web scraping.

Note: It's a paid service.

https://t.co/f0bBLpkxSZ
Spread the open source love!

If you know an amazing project drop me message @philipvollet
we need this edit function. my inner zen isn't balanced every time i spot a typo

More from All

You May Also Like