The Tinder Swindler Analysis

by Dhanny Indrakusuma

Summary

The Tinder Swindler is a viral British true crime documentary film directed by Felicity Morris and was released on Netflix on 2 February 2022. The show tells the story of an Israeli con artist, Simon Leviev, who used the dating-application Tinder to locate individuals that he emotionally manipulates into providing financial support for his lavish lifestyle. For the project, I analyzed if the activity of tweets in a given region affects the show's viewership of the Netflix Original show.

Datasets

To create the database for my analysis, I have used a total of five datasources, gathered from three different organizations such as Netflix, Nasdaq and Kaggle.

  1. Weekly Top 10 lists of the most-watched TV and films starting from June 28, 2021.
    Source: https://top10.netflix.com/films/. Prior to loading the data into the cloud server, I have converted the original xlxs files into csv using https://convertio.co/ to make the format consistent with other data.
    a) global_all_weeks.csv - the data contains weekly global rank of each title in weekly top 10 and hours viewed
    b) countries_all_weeks.csv - the data contains weekly country rank of each title in weekly top 10
  2. Netflix stock data - contains daily stock trading details for the past year.
    netflix_stocks.csv was retrieved from: https://www.nasdaq.com/market-activity/stocks/nflx/historical
  3. Match Group stock data - the data contains daily stock trading details for the past year.
    Match Group owns and operates the largest global portfolio of popular online dating services including Tinder, Match.com, Meetic, OkCupid, Hinge, PlentyOfFish, Ship, and OurTime totalling over 45 global dating companies.
    match_group_stocks.csv was retrieved from: https://www.nasdaq.com/market-activity/stocks/mtch/historical
  4. One hundred thousand Tinder Swindler Tweets scraped from twitter. The data contains username, country, and source of tweets.
    Tweets.csv was retrieved from: https://www.kaggle.com/datasets/deepcontractor/100k-tinder-swindler-tweets

The csv files were then loaded into the cloud server and the original structure of the data can be seen below:

global_all_weeks.png
countries_all_weeks.png
netflix_stocks.png
match_group_stocks.png
Tweets.png

There were some issues that had to be managed when reading the datasources into the tinder_swindler database which is outlined on the "Data Wrangling Process" section.

Database Design

  1. Since Netflix records their show's viewership and rank on the Friday of the week, I created a weeks table that will act as a connection between viewerships and tweets tables through days.
  2. The weeks table also connects viewerships and stock_observations tables through days table.
  3. The tweets and viewerships tables are connected through regions and location_names tables.

ER Diagram

ER_final.png

Relational Vocab

Viewership belongs_to Week
Week has_many Viewership

Viewership belongs_to Region
Region has_many Viewership

LocationName belongs_to Region
Region has_many LocationName

Day belongs_to Week
Week has_many Day

StockObservation belongs_to Day
Day has_many StockObservation

Tweet belongs_to Day
Day has_many Tweet

Tweet belongs_to LocationName
LocationName has_many Tweet

Sample Table

For this project, I managed synonyms of region names on regions table and creating has_many relationship to location_names table through informal_location. location_name_id column is used on tweets table to identify origin of a tweet and region_id column is used to identify region/country related to the viewership metric.

Project Flowchart

workflow.png

The above figure represents the flowchart for my project workflow.

  1. Data Files: For this project, I have used I have used a total of five datasources, gathered from three different organizations such as Netflix, Nasdaq and Kaggle. A detailed description of each data file was outlined on the "Datasets" section.
  2. Tinder Swindler database: The above data files were then imported into the database using Python's psycopg2 library. Depending on the tables, the SQL queries were created in a way that avoids duplicate records in the tables.
  3. Files extracted from SQL queries: Based on the initial goal for the project's analysis, I constructed the appropriate SQL queries and exported it into csv files which was then used for visualizations.
  4. Visualization: Using Python's pandas library, I created dataframes from the resulting csv outputs. Next, I analyzed how the popularity of Netflix Original show "Tinder Swindler" influences Netflix and Match Group stock performance and if the activity of tweets in a given country affects the show's viewership for that particular country. The resulting visualizations were created using Python's Matplotlib or seaborn libraries.

Data Wrangling Process

Prior to importing data into database, there were several pre-processing steps that needed to be executed to clean the data:

  1. The five datasources did not have a uniform date/time format. Since I am only interested in day-level measure for my analysis, I converted the date or week columns into a YYYY-MM-DD date format prior to insertion to the database. The datetime format is inserted into days table. At the same, time the datetime format was converted into a "YYYY_WW" string and inserted to the weeks table.
  2. For Tweets.csv, the user_location column contains location strings with emojis that will first need to be removed using Python's emoji library and inserted into location_names table in as a value in the informal_location column. Next, it is processed with Python's geopy library to obtain its country name (if exists) and inserted as a value in the geocode_raw column that will then be used to connect to the regions table of the database. When geopy returns "None" or the value in the user_location is null, "Global" will be inserted as a value instead.
  3. Prior to inserting values from Close/Last column of both netflix_stocks.csv and match_group_stocks.csv, I removed the "$" using regular expression operations and converted the values into float format. Then, I inserted the corresponding company's name in addition to its closing price and volume traded.
  4. Region name from countries_all_weeks.csv and global_all_weeks.csv are inserted into regions table. For global-level metric, I inserted "Global" as a placeholder.
  5. When reading the countries_all_weeks.csv and global_all_weeks.csv I am only inserting informations from weekly_hours_viewed, date form week column , weekly_rank and identify region_id based on whether it came from global metric or a country-specific metric.
  6. We will insert the dates from netflix_stocks.csv and match_group_stocks.csv into days table of the database.
  7. For Tweets.csv I inserted user_location information into location_names table and text into tweets table. I have intentionally left the format of the string in the text column as is in case there is interest on conducting sentiment analysis on the tweets itself and how it relates to viewership of the show. However, the current project scope will not be going as far as conducting sentiment analysis.

Database Creation

To start, we will be creating the tinder_swindler database on the cloud server as well as creating the respective tables needed to store the data following the database design outlined in the previous section. The code used for the database creation are displayed below.

Loading Data into Database

Prior to loading data into the tinder_swindler database, parts of the data needs to be pre-processed as described in "Data Wrangling Process" section. The code used for pre-processing and insertion into the database are displayed below.

  1. Define Python functions to automate parsing.
  1. Next, we begin the process by inserting data from global_all_weeks.csv. We are only inserting metric values related to the show "The Tinder Swindler" and ignoring the rest.
  1. Insert data from country_all_weeks.csv, making sure that we don't have duplicates of year_week in weeks table and name in regions table. We are also only inserting metric values related to the show "The Tinder Swindler" and ignoring the rest.
  1. Insert match_group_stocks.csv data and handling for duplicates in weeks. At this, point we are just inserting date values in the day table since we know that each values are unique and this is our first insertions to the day table.
  1. Insert netflix_stocks.csv data. Since the dates here are exactly the same as in match_group_stocks.csv, we will fetch the day_id as we insert values into stock_observations table.
  1. Then, we insert data from Tweets.csv. In this step we remove emojis present in user_location and inserting it as informal_location values in location_names table. We are also fetching day_id as we insert text into tweets table.
  1. Last but not least, we parse informal_location into geocode_raw and fetching region_id if it is successful. If it returns %pip install geopy"None", we will insert it as region_id = 1 to recognize it as a "Global" region entry. The code below need to be run multiple times as we are parsing 100,000 rows of user location from the tweet data.

Data exploration using SQL queries

Let's check for for count of unique locations and parsed geocode

Then check if all rows were geocoded

Looking at sample of tweets data on database

Exporting Data using SQL

To prepare for the analysis, I exported data on viewership and tweet counts for "Global" region (region id 1). The SQL queries used to pull out desired data and exporting them into csv files can be seen below:

Data Analysis

To answer the project's question if the activity of tweets in a given region affects the show's viewership of the Netflix Original show, I used the SQL query above to pull out the desired data. Then, the data was manipulated using Pandas library.
Tool: Having learned Pandas in another course as well as through Datacamp caourses, I find that it is quite user-friendly as I can visualize the resulting data manipulation to check if it output desired results.

As we see above, we only have two weeks worth of tweets data for this analysis. Although we can see that the number of hours viewed increased on the second week, we can't be too sure if it is directly correlated to the interest of the viewers, that was expressed as tweet counts.

Challenges

The biggest challenge I faced initially was in the database design as I struggled to understand how to best map the relationships between viewership data that was collected weekly, and company stock data and tweets data that was collected daily. Once I had resolved the issue by creating a separate weeks table to hold the year_week record and connect it to days table using foreign key week_id, the whole database design made more sense. It helped to think about what my query to export the resulting data for analysis loooked like and reverse engineer the database design from there.

Another challege I have faced was in parsing the user_location column in Tweets.csv into a region name (identified by either a country's name or "Global") to create relationship between tweets table and viewerships table. Although I was able to easily remove emojis from user_location records, I kept running into issues such as AttributeError: 'NoneType' object has no attribute 'raw' and Read timed out when using Geocoder. At the same time, the results I was getting from using Photon geocoding service from geopy library was not satisfactory as it was not able to output country name in English when the input was foreign letters.

After further trial and errors, I found that Nominatim geocoding service from geopy library gave me the best results and I have used it to parsed the locations for my database. I have also found a way to automate parsing a country's name based on user-generated location string, if the location string is in a form of a list. it can also be used on pandas dataframe using the .apply() method.

# geocoding service need to be declared outside function to limit time out issue
geolocator = Nominatim(user_agent = "geoapiExercises")
geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1)

def get_location(loc):
    location = str(geocode(loc, language='en', exactly_one=True)) # returns a string of address separated by ","
    country = location.split(",")[-1] # split string by "," and obtain last element, which is the country's name
    return country

Example of use:

location string: 'الله '
location output: 'الله, District 8, Isfahan, بخش مرکزی شهرستان اصفهان, Isfahan County, Isfahan Province, +98313, Iran'
country output: 'Iran'

However, if we are interested in getting the raw location generated by Nominatim, we need to use location.raw to get a JSON file of geopy location that also includes the latitude and longtitude of the location. There is definitely an opportunity in the future to create a geographic map of the tweets origin based on the latitude and longtitude from geocode parsing.

To be able to convert 100,000 rows of user location (that was mostly very messy data) found in the tweets and connecting it to the regions table was definetly the most interesting and time-consuming challenge during this project.