by Dhanny Indrakusuma
The Tinder Swindler is a viral British true crime documentary film directed by Felicity Morris and was released on Netflix on 2 February 2022. The show tells the story of an Israeli con artist, Simon Leviev, who used the dating-application Tinder to locate individuals that he emotionally manipulates into providing financial support for his lavish lifestyle. For the project, I analyzed if the activity of tweets in a given region affects the show's viewership of the Netflix Original show.
To create the database for my analysis, I have used a total of five datasources, gathered from three different organizations such as Netflix, Nasdaq and Kaggle.
global_all_weeks.csv
- the data contains weekly global rank of each title in weekly top 10 and hours viewed
countries_all_weeks.csv
- the data contains weekly country rank of each title in weekly top 10netflix_stocks.csv
was retrieved from: https://www.nasdaq.com/market-activity/stocks/nflx/historicalmatch_group_stocks.csv
was retrieved from: https://www.nasdaq.com/market-activity/stocks/mtch/historicalTweets.csv
was retrieved from: https://www.kaggle.com/datasets/deepcontractor/100k-tinder-swindler-tweetsThe csv files were then loaded into the cloud server and the original structure of the data can be seen below:
There were some issues that had to be managed when reading the datasources into the tinder_swindler
database which is outlined on the "Data Wrangling Process" section.
weeks
table that will act as a connection between viewerships
and tweets
tables through days
.weeks
table also connects viewerships
and stock_observations
tables through days
table.tweets
and viewerships
tables are connected through regions
and location_names
tables.ER Diagram
Relational Vocab
Viewership belongs_to Week
Week has_many Viewership
Viewership belongs_to Region
Region has_many Viewership
LocationName belongs_to Region
Region has_many LocationName
Day belongs_to Week
Week has_many Day
StockObservation belongs_to Day
Day has_many StockObservation
Tweet belongs_to Day
Day has_many Tweet
Tweet belongs_to LocationName
LocationName has_many Tweet
Sample Table
For this project, I managed synonyms of region names on regions
table and creating has_many relationship to location_names
table through informal_location
. location_name_id
column is used on tweets
table to identify origin of a tweet and region_id
column is used to identify region/country related to the viewership metric.
The above figure represents the flowchart for my project workflow.
psycopg2
library. Depending on the tables, the SQL queries were created in a way that avoids duplicate records in the tables.pandas
library, I created dataframes from the resulting csv outputs. Next, I analyzed how the popularity of Netflix Original show "Tinder Swindler" influences Netflix and Match Group stock performance and if the activity of tweets in a given country affects the show's viewership for that particular country. The resulting visualizations were created using Python's Matplotlib
or seaborn
libraries.Prior to importing data into database, there were several pre-processing steps that needed to be executed to clean the data:
date
or week
columns into a YYYY-MM-DD date format prior to insertion to the database. The datetime format is inserted into days
table. At the same, time the datetime format was converted into a "YYYY_WW" string and inserted to the weeks
table.Tweets.csv
, the user_location
column contains location strings with emojis that will first need to be removed using Python's emoji
library and inserted into location_names
table in as a value in the informal_location
column. Next, it is processed with Python's geopy
library to obtain its country name (if exists) and inserted as a value in the geocode_raw
column that will then be used to connect to the regions
table of the database. When geopy
returns "None" or the value in the user_location
is null, "Global" will be inserted as a value instead.Close/Last
column of both netflix_stocks.csv
and match_group_stocks.csv
, I removed the "$" using regular expression operations and converted the values into float
format. Then, I inserted the corresponding company's name in addition to its closing price and volume traded.countries_all_weeks.csv
and global_all_weeks.csv
are inserted into regions
table. For global-level metric, I inserted "Global" as a placeholder.countries_all_weeks.csv
and global_all_weeks.csv
I am only inserting informations from weekly_hours_viewed
, date form week
column , weekly_rank
and identify region_id
based on whether it came from global metric or a country-specific metric.netflix_stocks.csv
and match_group_stocks.csv
into days
table of the database.Tweets.csv
I inserted user_location
information into location_names
table and text
into tweets
table. I have intentionally left the format of the string in the text
column as is in case there is interest on conducting sentiment analysis on the tweets itself and how it relates to viewership of the show. However, the current project scope will not be going as far as conducting sentiment analysis.To start, we will be creating the tinder_swindler
database on the cloud server as well as creating the respective tables needed to store the data following the database design outlined in the previous section. The code used for the database creation are displayed below.
# import libraries
import psycopg2
from psycopg2.extensions import ISOLATION_LEVEL_AUTOCOMMIT
from psycopg2.extras import RealDictCursor
# create database
conn = psycopg2.connect(host='localhost');
conn.set_isolation_level(ISOLATION_LEVEL_AUTOCOMMIT);
cursor = conn.cursor();
cursor.execute("DROP database tinder_swindler;")
cursor.execute("CREATE DATABASE tinder_swindler;")
cursor.close()
conn.close()
# create tables within database
with psycopg2.connect(host='localhost', dbname='tinder_swindler') as conn:
with conn.cursor(cursor_factory=RealDictCursor) as cursor:
cursor.execute(
'''
DROP TABLE IF EXISTS weeks;
CREATE TABLE weeks (
id SERIAL PRIMARY KEY,
year_week VARCHAR
);
DROP TABLE IF EXISTS regions;
CREATE TABLE regions (
id SERIAL PRIMARY KEY,
name TEXT
);
INSERT INTO regions(name) VALUES ('Global'); -- always 1 for 'Global'
DROP TABLE IF EXISTS viewerships;
CREATE TABLE viewerships (
id SERIAL PRIMARY KEY,
week_id INTEGER,
region_id INTEGER,
rank INTEGER,
hours_viewed INTEGER
);
DROP TABLE IF EXISTS location_names;
CREATE TABLE location_names (
id SERIAL PRIMARY KEY,
region_id INTEGER, -- region_id only there if successfully geocoded.
informal_location VARCHAR, -- we hope to map this to region_id
geocode_raw TEXT
);
DROP TABLE IF EXISTS days;
CREATE TABLE days (
id SERIAL PRIMARY KEY,
week_id INTEGER,
date DATE
);
DROP TABLE IF EXISTS stock_observations;
CREATE TABLE stock_observations (
id SERIAL PRIMARY KEY,
day_id INTEGER,
company TEXT,
close_price NUMERIC,
volume INTEGER
);
DROP TABLE IF EXISTS tweets;
CREATE TABLE tweets (
id SERIAL PRIMARY KEY,
day_id INTEGER,
location_name_id INTEGER,
text VARCHAR
);
'''
)
Prior to loading data into the tinder_swindler
database, parts of the data needs to be pre-processed as described in "Data Wrangling Process" section. The code used for pre-processing and insertion into the database are displayed below.
# install required packages
%pip install emoji
%pip install geopy
# import additional libraries
import csv
import re
import emoji
from datetime import datetime
from geopy.extra.rate_limiter import RateLimiter
from geopy.geocoders import Nominatim
def tweet_date(date):
# parsing date in tweets
only_datetime = date[:date.index("+")]
return datetime.strptime(only_datetime, "%Y-%m-%d %H:%M:%S").date()
def get_week(date):
# get week in format YYYY_WW
if date.isocalendar()[1]/10 >= 1:
return str(date.year) + '_'+ str(date.isocalendar()[1])
else:
return str(date.year) + '_0'+ str(date.isocalendar()[1])
def show_week(date):
# get week in format YYYY_WW for show metrics
date = datetime.strptime(date, "%Y-%m-%d").date()
if date.isocalendar()[1]/10 >= 1:
return str(date.year) + '_'+ str(date.isocalendar()[1])
else:
return str(date.year) + '_0'+ str(date.isocalendar()[1])
def stock_date(date):
# parse date for stocks
return datetime.strptime(date, "%m/%d/%Y").date()
def remove_emoji(string):
# remove emoji from tweets user_location
return emoji.replace_emoji(string, replace='').strip()
def remove_sign(string):
# remove "$" on stock price, converting to float
return float(string.replace("$",""))
global_all_weeks.csv
. We are only inserting metric values related to the show "The Tinder Swindler" and ignoring the rest.with psycopg2.connect(host='localhost', dbname='tinder_swindler') as conn:
with conn.cursor(cursor_factory=RealDictCursor) as cursor:
with open('global_all_weeks.csv') as csvfile:
myCSVReader = csv.DictReader(csvfile, delimiter=",", quotechar='"')
for row in myCSVReader:
if row['show_title'] != 'The Tinder Swindler':
continue
week_num = show_week(row['week'])
# need to get the week_id
sql = """
INSERT INTO weeks(year_week)
VALUES (%(year_week)s)
RETURNING id
"""
param_dict = {"year_week": week_num}
cursor.execute(sql, param_dict)
week_id = cursor.fetchone()['id']
# also need to insert viewership data
viewership_sql = """
INSERT into viewerships(week_id, region_id, rank, hours_viewed)
VALUES (%(week_id)s, %(region_id)s, %(weekly_rank)s, %(weekly_hours_viewed)s)
"""
param_dict = {'week_id': week_id,
'region_id': 1, # hard coded region for Global
'weekly_rank': row['weekly_rank'],
'weekly_hours_viewed': row['weekly_hours_viewed']}
cursor.execute(viewership_sql, param_dict)
country_all_weeks.csv
, making sure that we don't have duplicates of year_week
in weeks
table and name
in regions
table. We are also only inserting metric values related to the show "The Tinder Swindler" and ignoring the rest.with psycopg2.connect(host='localhost', dbname='tinder_swindler') as conn:
with conn.cursor(cursor_factory=RealDictCursor) as cursor:
with open('countries_all_weeks.csv') as csvfile:
myCSVReader = csv.DictReader(csvfile, delimiter=",", quotechar='"')
for row in myCSVReader:
if row['show_title'] != 'The Tinder Swindler':
continue
# check if week_id already exist: fetch id, else: insert and get id
week_num = show_week(row['week'])
w_dict = {"year_week": week_num}
cursor.execute("SELECT id FROM weeks WHERE year_week = %(year_week)s", w_dict)
if (cursor.rowcount == 1):
week_id = cursor.fetchone()['id']
else:
cursor.execute("""
INSERT INTO weeks(year_week)
VALUES (%(year_week)s)
RETURNING id
""", w_dict)
week_id = cursor.fetchone()['id']
# check if region_id already exist: fetch id, else: insert and get id
region = row['country_name']
r_dict = {"region": region}
cursor.execute("SELECT id FROM regions WHERE name = %(region)s", r_dict)
if (cursor.rowcount == 1):
region_id = cursor.fetchone()['id']
else:
cursor.execute("""
INSERT into regions(name)
VALUES (%(region)s)
RETURNING id
""", r_dict)
region_id = cursor.fetchone()['id']
# also need to insert viewership data
viewership_sql = """
INSERT into viewerships(week_id, region_id, rank, hours_viewed)
VALUES (%(week_id)s, %(region_id)s, %(weekly_rank)s, %(weekly_hours_viewed)s)
"""
param_dict = {'week_id': week_id,
'region_id': region_id,
'weekly_rank': row['weekly_rank'],
'weekly_hours_viewed': None}
cursor.execute(viewership_sql, param_dict)
match_group_stocks.csv
data and handling for duplicates in weeks
. At this, point we are just inserting date
values in the day
table since we know that each values are unique and this is our first insertions to the day
table.with psycopg2.connect(host='localhost', dbname='tinder_swindler') as conn:
with conn.cursor(cursor_factory=RealDictCursor) as cursor:
with open('match_group_stocks.csv') as csvfile:
myCSVReader = csv.DictReader(csvfile, delimiter=",", quotechar='"')
for row in myCSVReader:
date = stock_date(row['Date'])
week_num = get_week(date)
price = remove_sign(row['Close/Last'])
# check if week_id already exist: fetch id, else: insert and get id
w_dict = {"year_week": week_num}
cursor.execute("SELECT id FROM weeks WHERE year_week = %(year_week)s", w_dict)
if (cursor.rowcount == 1):
week_id = cursor.fetchone()['id']
else:
cursor.execute("""
INSERT INTO weeks(year_week)
VALUES (%(year_week)s)
RETURNING id
""", w_dict)
week_id = cursor.fetchone()['id']
# insert date, get id
sql = """
INSERT INTO days(week_id, date)
VALUES (%(week_id)s, %(date)s)
RETURNING id
"""
d_dict = {"week_id": week_id, "date": date}
cursor.execute(sql, d_dict)
day_id = cursor.fetchone()['id']
# also insert stock_observations data
stock_sql = """
INSERT into stock_observations(day_id, company, close_price, volume)
VALUES (%(day_id)s, %(company)s, %(close_price)s, %(volume)s)
"""
param_dict = {'day_id': day_id,
'company': 'Match Group',
'close_price': price,
'volume': row['Volume']}
cursor.execute(stock_sql, param_dict)
netflix_stocks.csv
data. Since the dates here are exactly the same as in match_group_stocks.csv
, we will fetch the day_id
as we insert values into stock_observations
table.with psycopg2.connect(host='localhost', dbname='tinder_swindler') as conn:
with conn.cursor(cursor_factory=RealDictCursor) as cursor:
with open('match_group_stocks.csv') as csvfile:
myCSVReader = csv.DictReader(csvfile, delimiter=",", quotechar='"')
for row in myCSVReader:
date = stock_date(row['Date'])
price = remove_sign(row['Close/Last'])
# get date id since dates are same as match group stocks
d_dict = {"week_id": week_id, "date": date}
cursor.execute("SELECT id FROM days WHERE date = %(date)s", d_dict)
if (cursor.rowcount == 1):
day_id = cursor.fetchone()['id']
# also insert stock_observations data
stock_sql = """
INSERT into stock_observations(day_id, company, close_price, volume)
VALUES (%(day_id)s, %(company)s, %(close_price)s, %(volume)s)
"""
param_dict = {'day_id': day_id,
'company': 'Netflix',
'close_price': price,
'volume': row['Volume']}
cursor.execute(stock_sql, param_dict)
Tweets.csv
. In this step we remove emojis present in user_location
and inserting it as informal_location
values in location_names
table. We are also fetching day_id
as we insert text
into tweets
table.with psycopg2.connect(host='localhost', dbname='tinder_swindler') as conn:
with conn.cursor(cursor_factory=RealDictCursor) as cursor:
with open('Tweets.csv') as csvfile:
myCSVReader = csv.DictReader(csvfile, delimiter=",", quotechar='"')
for row in myCSVReader:
# insert informal_location
location = remove_emoji(row['user_location'])
l_dict = {"location": location}
cursor.execute("""
SELECT id FROM location_names
WHERE informal_location = %(location)s
""", l_dict)
if (cursor.rowcount == 1):
location_id = cursor.fetchone()['id']
else:
cursor.execute("""
INSERT INTO location_names(informal_location)
VALUES (%(location)s)
RETURNING id
""", l_dict)
location_id = cursor.fetchone()['id']
# need to compare date, get day_id
date = tweet_date(row['date'])
d_dict = {"date": date}
cursor.execute("SELECT id FROM days WHERE date = %(date)s", d_dict)
if (cursor.rowcount == 1):
day_id = cursor.fetchone()['id']
# insert row['text'] and get location_names_id
tweets_sql = """
INSERT into tweets(day_id, location_name_id, text)
VALUES (%(day_id)s, %(location_id)s, %(text)s)
"""
param_dict = {'day_id': day_id,
'location_id': location_id,
'text': row['text']}
cursor.execute(tweets_sql, param_dict)
informal_location
into geocode_raw
and fetching region_id
if it is successful. If it returns %pip install geopy"None", we will insert it as region_id = 1
to recognize it as a "Global" region entry. The code below need to be run multiple times as we are parsing 100,000 rows of user location from the tweet data.# Fill null values or empty string with 'None' to speed up the process of geocoding
with psycopg2.connect(host='localhost', dbname='tinder_swindler') as conn:
with conn.cursor(cursor_factory=RealDictCursor) as cursor:
update_location = """
UPDATE location_names
SET region_id = %(region_id)s
WHERE id = %(location_name_id)s
"""
cursor.execute("""
SELECT id, informal_location
FROM location_names
WHERE (informal_location IS NULL
AND geocode_raw IS NULL)
OR (informal_location = '')
""")
results = cursor.fetchall()
for row in results:
# insert 'None' when informal_location and geocode_raw is NULL
# store in the geocode_raw
none_geocode_sql = """
UPDATE location_names
SET geocode_raw = %(geocode_raw)s
WHERE id = %(location_name_id)s
"""
location = geocode(row['informal_location'], language='en')
param_dict = {'geocode_raw': 'None',
'location_name_id': row['id']}
print(param_dict['geocode_raw'])
cursor.execute(none_geocode_sql, param_dict)
# Now parse informal_location to geocode_raw
with psycopg2.connect(host='localhost', dbname='tinder_swindler') as conn:
with conn.cursor(cursor_factory=RealDictCursor) as cursor:
update_location = """
UPDATE location_names
SET region_id = %(region_id)s
WHERE id = %(location_name_id)s
"""
cursor.execute("""
SELECT id, informal_location
FROM location_names
WHERE geocode_raw IS NULL
LIMIT 1000
""") # limit parsing, otherwise, timeout error
# setup the geocoder.
geolocator = Nominatim(user_agent = "geoapiExercises")
# creates a function we will call later.
geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1)
results = cursor.fetchall()
for row in results:
# geocode the location_string
# store in the geocode_raw
cache_geocode_sql = """
UPDATE location_names
SET geocode_raw = %(geocode_raw)s
WHERE id = %(location_name_id)s
"""
# print(row)
location = geocode(row['informal_location'], language='en')
param_dict = {'geocode_raw': str(location), # location.raw will give us a JSON file with latitude and longtitude
'location_name_id': row['id']}
#print(param_dict['geocode_raw'])
cursor.execute(cache_geocode_sql, param_dict)
""" NOTE: original code causes timeout issue as it kept trying to parse NULL rows that had been parsed before
if location is not None:
param_dict = {'geocode_raw': str(location), # location.raw will give us a JSON file with latitude and longtitude
'location_name_id': row['id']}
print(param_dict)
cursor.execute(cache_geocode_sql, param_dict)
"""
# Next, we update region_id in location_names table
with psycopg2.connect(host='localhost', dbname='tinder_swindler') as conn:
with conn.cursor(cursor_factory=RealDictCursor) as cursor:
update_location = """
UPDATE location_names
SET region_id = %(region_id)s
WHERE id = %(location_name_id)s
"""
cursor.execute("""
SELECT id, geocode_raw
FROM location_names
WHERE geocode_raw IS NOT NULL
""")
results = cursor.fetchall()
for row in results:
location = row['geocode_raw'].split(",")[-1].strip()
if location == 'None':
region_id = 1
else:
# find or create a region in the regions table
sql_select_region = """
SELECT id
FROM regions
WHERE name = %(location)s
"""
cursor.execute(sql_select_region, {"location": location})
# check if there is region_id result
if (cursor.rowcount == 1):
region_id = cursor.fetchone()['id']
else:
region_id = 1
cursor.execute(update_location, {"region_id": region_id, "location_name_id": row['id']})
%load_ext sql
%env DATABASE_URL=postgresql://localhost/tinder_swindler
env: DATABASE_URL=postgresql://localhost/tinder_swindler
Let's check for for count of unique locations and parsed geocode
%%sql
SELECT COUNT(DISTINCT informal_location) AS "location count",
COUNT(DISTINCT geocode_raw) AS "geocode count"
FROM location_names
WHERE informal_location IS NOT NULL
OR geocode_raw IS NOT NULL
1 rows affected.
location count | geocode count |
---|---|
17329 | 6744 |
Then check if all rows were geocoded
%%sql
SELECT COUNT(geocode_raw)
FROM location_names
WHERE geocode_raw IS NULL
* postgresql://localhost/tinder_swindler 1 rows affected.
count |
---|
0 |
Looking at sample of tweets data on database
%%sql
select text, date, region_id, location_name_id, informal_location, geocode_raw
from tweets
JOIN days ON tweets.day_id = days.id
JOIN location_names ON tweets.location_name_id = location_names.id
where geocode_raw IS NOT NULL
LIMIT 10
* postgresql://localhost/tinder_swindler 10 rows affected.
text | date | region_id | location_name_id | informal_location | geocode_raw |
---|---|---|---|---|---|
@T1gmee Tinder Swindler, off Netflix | 2022-02-09 | 1 | 1 | Travelling the World on Web3 | None |
#tinderswindler , basically homeboy got away with it | 2022-02-09 | 1 | 2 | None | |
Not the tinder swindler becoming a TIk Tok star 🙄 #FreeHushpuppi | 2022-02-09 | 1 | 2 | None | |
Why isn’t Shimon Hayut in jail ????? #thetinderswindler | 2022-02-09 | 56 | 3 | Noord-Brabant | North Brabant, Netherlands |
the tinder swindler a sick nigga 😂😂 | 2022-02-09 | 81 | 4 | PG/MD | Polígono Industrial de Fuencarral, Fuencarral, Madrid, Área metropolitana de Madrid y Corredor del Henares, Community of Madrid, 28001, Spain |
No those women on the Tinder swindler pissed me the fuck off because how are you a “heir to a billionaire Diamond c… https://t.co/ULnjv2TBJ7 | 2022-02-09 | 12 | 5 | Toronto, Ontario | Toronto, Golden Horseshoe, Ontario, Canada |
This manipulation is how scammers pull you in to fall in love with em then in a few days or month they asking for c… https://t.co/weKMMAnmew | 2022-02-09 | 1 | 2 | None | |
@robertstweets1 @DistantFixed I waited for him to say three out the list then shafted him back! It was fun actually… https://t.co/A0SQM1VkJw | 2022-02-09 | 91 | 6 | East, England | East (Andrews) Park, Cultural Quarter, The Polygon, Southampton, South East England, England, SO14 0DA, United Kingdom |
Hmmm...he's like the #tinderswindler https://t.co/O9T3mCAJ5T | 2022-02-09 | 79 | 7 | Johannesburg, South Africa | Johannesburg, City of Johannesburg Metropolitan Municipality, Gauteng, 2001, South Africa |
Wow just watching the Twitter Swindler and it reminds me so much of Ron & Kate Bensimon Leviev duo - Australian dia… https://t.co/jRS5nCMoEW | 2022-02-09 | 1 | 2 | None |
To prepare for the analysis, I exported data on viewership and tweet counts for "Global" region (region id 1). The SQL queries used to pull out desired data and exporting them into csv files can be seen below:
import psycopg2
from psycopg2.extras import RealDictCursor
import csv
with psycopg2.connect(host='localhost', dbname='tinder_swindler') as conn:
with conn.cursor(cursor_factory=RealDictCursor) as cursor:
cursor.execute("""
SELECT regions.id AS region_id,
weeks.year_week,
viewerships.hours_viewed,
COUNT(TEXT) OVER (PARTITION BY weeks.year_week) AS tweet_count
FROM tweets
JOIN days
ON tweets.day_id = days.id
JOIN weeks
ON days.week_id = weeks.id
JOIN location_names
ON tweets.location_name_id = location_names.id
JOIN regions
ON location_names.region_id = regions.id
JOIN viewerships
ON viewerships.week_id = weeks.id
AND viewerships.region_id = regions.id
WHERE regions.id = 1
GROUP BY regions.id,
weeks.year_week,
viewerships.hours_viewed,
tweets.text
""")
with open('global-tweetcount-viewership.csv', 'w') as csvfile:
# Declare csv fields in the order we want them
column_names = ["region_id", "year_week", "hours_viewed", "tweet_count"]
myCsvWriter = csv.DictWriter(csvfile,
fieldnames=column_names)
myCsvWriter.writeheader()
for row in cursor:
myCsvWriter.writerow(row)
print("Done writing csv")
Done writing csv
To answer the project's question if the activity of tweets in a given region affects the show's viewership of the Netflix Original show, I used the SQL query above to pull out the desired data. Then, the data was manipulated using Pandas
library.
Tool: Having learned Pandas
in another course as well as through Datacamp caourses, I find that it is quite user-friendly as I can visualize the resulting data manipulation to check if it output desired results.
%pip install pandas
import pandas as pd
tweet_view_df = pd.read_csv('global-tweetcount-viewership.csv')
tweet_view_df[['year', 'week']] = tweet_view_df['year_week'].str.split("_", expand=True)
tweet_view_df.head()
region_id | year_week | hours_viewed | tweet_count | year | week | |
---|---|---|---|---|---|---|
0 | 1 | 2022_05 | 45800000 | 7649 | 2022 | 05 |
1 | 1 | 2022_05 | 45800000 | 7649 | 2022 | 05 |
2 | 1 | 2022_05 | 45800000 | 7649 | 2022 | 05 |
3 | 1 | 2022_05 | 45800000 | 7649 | 2022 | 05 |
4 | 1 | 2022_05 | 45800000 | 7649 | 2022 | 05 |
import datetime
from dateutil.relativedelta import relativedelta
week = tweet_view_df['week'].str.replace("0","")
date = [datetime.date(2022, 1, 1) + relativedelta(weeks=+int(e)) for e in week]
tweet_view_df['date'] = pd.DataFrame (date, columns = ['date'])
tweet_view_df.drop(['region_id', 'year_week', 'year', 'week'], axis=1).head()
hours_viewed | tweet_count | date | |
---|---|---|---|
0 | 45800000 | 7649 | 2022-02-05 |
1 | 45800000 | 7649 | 2022-02-05 |
2 | 45800000 | 7649 | 2022-02-05 |
3 | 45800000 | 7649 | 2022-02-05 |
4 | 45800000 | 7649 | 2022-02-05 |
grouped_df = tweet_view_df.groupby('date')[['tweet_count', 'hours_viewed']].mean()
print(grouped_df)
tweet_count hours_viewed date 2022-02-05 7649.0 45800000.0 2022-02-12 24981.0 64700000.0
As we see above, we only have two weeks worth of tweets data for this analysis. Although we can see that the number of hours viewed increased on the second week, we can't be too sure if it is directly correlated to the interest of the viewers, that was expressed as tweet counts.
The biggest challenge I faced initially was in the database design as I struggled to understand how to best map the relationships between viewership data that was collected weekly, and company stock data and tweets data that was collected daily. Once I had resolved the issue by creating a separate weeks
table to hold the year_week
record and connect it to days
table using foreign key week_id
, the whole database design made more sense. It helped to think about what my query to export the resulting data for analysis loooked like and reverse engineer the database design from there.
Another challege I have faced was in parsing the user_location
column in Tweets.csv
into a region name (identified by either a country's name or "Global") to create relationship between tweets
table and viewerships
table. Although I was able to easily remove emojis from user_location
records, I kept running into issues such as AttributeError: 'NoneType' object has no attribute 'raw'
and Read timed out
when using Geocoder. At the same time, the results I was getting from using Photon geocoding service from geopy library was not satisfactory as it was not able to output country name in English when the input was foreign letters.
After further trial and errors, I found that Nominatim geocoding service from geopy library gave me the best results and I have used it to parsed the locations for my database. I have also found a way to automate parsing a country's name based on user-generated location string, if the location string is in a form of a list. it can also be used on pandas dataframe using the .apply()
method.
# geocoding service need to be declared outside function to limit time out issue
geolocator = Nominatim(user_agent = "geoapiExercises")
geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1)
def get_location(loc):
location = str(geocode(loc, language='en', exactly_one=True)) # returns a string of address separated by ","
country = location.split(",")[-1] # split string by "," and obtain last element, which is the country's name
return country
Example of use:
location string: 'الله '
location
output: 'الله, District 8, Isfahan, بخش مرکزی شهرستان اصفهان, Isfahan County, Isfahan Province, +98313, Iran'
country
output: 'Iran'
However, if we are interested in getting the raw location generated by Nominatim, we need to use location.raw
to get a JSON file of geopy location that also includes the latitude and longtitude of the location. There is definitely an opportunity in the future to create a geographic map of the tweets origin based on the latitude and longtitude from geocode parsing.
To be able to convert 100,000 rows of user location (that was mostly very messy data) found in the tweets and connecting it to the regions
table was definetly the most interesting and time-consuming challenge during this project.