Understanding the project workflow

Last modified: 2023-05-24

Below represents the high-level flow of the processes that take place to update the website each month.

These processes are activated by a cronjob (see crontab.bak) the morning of the first day of each month. The script that is run is: scripts/monthly_master_script.sh

1. Getting the data

The first step of the project is retrieving the data that is eventually processed to calculate the FIB-index values for all users.

Please see the Data page for all details.

2. Preparing the data for processing

After data has been retrieved and moved to the /home/data/apps/topfibers/repo/data/raw directory, the pipeline works by creating subdirectories full of symbolic links (kept here: /home/data/apps/topfibers/repo/data/symbolic_links) that point to the raw files for each platform. https://github.com/osome-iu/top-FIBers/blob/main/scripts/data_prep/create_data_file_symlinks.py This is done with the scripts/data_prep/create_data_file_symlinks.py script.

The structure of the symbolic_links directory is as follows:

├── facebook
│   ├── 2022_01
│   ├── 2022_02
    ...
└── twitter
    ├── 2022_01
    ├── 2022_02
    ...

Inside of each YYYY_MM subdirectory are symbolic links to the data used to calculate that month’s report. For example, inside of twitter/2022_01 we have:

2022_01
├── 2021-10-01__tweets_w_links.jsonl.gzip -> /home/data/apps/topfibers/repo/data/raw/twitter/2021-10-01__tweets_w_links.jsonl.gzip
├── 2021-11-01__tweets_w_links.jsonl.gzip -> /home/data/apps/topfibers/repo/data/raw/twitter/2021-11-01__tweets_w_links.jsonl.gzip
└── 2021-12-01__tweets_w_links.jsonl.gzip -> /home/data/apps/topfibers/repo/data/raw/twitter/2021-12-01__tweets_w_links.jsonl.gzip

This approach allows us to use the scripts/data_prep/create_data_file_symlinks.py script to generate unique reports for different time periods (i.e., more or less months than the standard three). These directories of symbolic links are then provided as input to generate the FIB-index output files.

3. Generating FIB-index output file

With the data gathered in the first process, we generate two output files each month.

{YYYY_mm_dd}__fib_indices_{platform}.parquet: contains all users observed in the three months prior to its calculation (indicated by the prefixed date) ranked by their FIB indices.
- Contains: [‘user_id’, ‘username’, ‘fib_index’, ‘total_reshares’]
{YYYY_mm_dd}__top_spreader_posts_{platform}.parquet: contains data on all POSTS sent by the top 50 superspreaders (according to their FIB indices).
- Contains: [‘user_id’, ‘post_id’, ‘num_reshares’, ‘timestamp’, ‘post_url’]

Notes:

Both files are generated by calc_{platform}_fib_indices.py

facebook script

twitter script

YYYY_mm_dd represents the date the file is generated

platform will be either crowdtangle (for facebook data) or twitter

4. Twitter image profile link extraction

After the top superspreaders have been found, the scripts/data_processing/get_latest_profile_image_links.py script collects the links for each user to their profile picture. These links are utilized to display top FIBer profile images on the website.

5. Updating the database

After all of the above has been completed, code kept in the data-loader/ directory is utilized to send data from lenny to the database on lisa. Specifically, the run_data_loader.sh script is executed by the monthly bash script which runs the data-loader/server.py script.

6. The front end

Once the database is updated the website (https://osome.iu.edu/tools/topfibers) automatically updates with the latest data. Find the front-end code here.