I was wondering if I could get advice on how I should go about scraping tweets via Python. So far I'm able to scrape tweets containing the word "Google" from Nov 27th through Dec 6th and abide the QPS. However, it takes so long just to download 30 minutes worth of tweets. A day's worth of tweets is taking me 2-3 days to download. My end goal is to plot a trend of how people feel about google by taking the average polarity value(where I'm classifying each value through a Naive Bayes and MaxEnt classifier) for each day. Is there anyway to speed up this process? Or have the code scrape a randomized sample of each day, that way there are no biases if I run the Naive Bayes and MaxEnt classifiers on a smaller sample size?
This is what I have so far:
import tweepy
from textblob import TextBlob
import csv
import time
consumer_key = ''
consumer_secret = ''
access_token = ''
access_token_secret = ''
#lets authenticate with Twitter which means login via code
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)
#connect to API
#currently in a try/catch block to avoid blowing out the API rate limit
try:
api
print("Twitter API connection already set.")
except NameError:
api = tweepy.API(auth)
print("Setting Twitter API connection.")
tweet_set = tweepy.Cursor(api.search,
q="Google",
since="2016-11-27",
until="2016-12-06",
lang="en").items()
def getElements(tweet):
user = tweet.get('user').get('screen_name')
txt = tweet.get('text')
dt = tweet.get('created_at')
tb= TextBlob(txt)
return user, txt, dt, tb
i = 0
csvFile = open('tweets.csv', 'a')
TweetWriter = csv.writer(csvFile, delimiter=',')
for tweet in tweet_set:
i +=1
if (i%1500 != 0):
try:
user, txt, dt, tb = getElements(tweet._json)
TweetWriter.writerow([user, txt, dt, tb])
except:
pass
else:
time.sleep(60*15)
I'd greatly appreciate any advice
Aucun commentaire:
Enregistrer un commentaire