Understanding PSOSM in Social Media
Understanding PSOSM in Social Media
Content
Course Focus and Motivation
Ubiquity of Social Media: The instructor emphasizes that almost everyone uses
platforms like Instagram, Facebook, LinkedIn, or Twitter, highlighting their role in daily
interactions and business.
Data Volume and Activity: The massive increase in user activity on various platforms like
Tinder, Netflix, and Twitch between 2019 and 2020 demonstrates the phenomenal
growth and data generation.
o Video: YouTube
o Professional: LinkedIn
The lecture establishes the difference between the two main topics of the course:
Privacy: Defined as the state of being not watched or tracked. It centers on the
expectations of information—the ability to be selective about sharing personal data
with only chosen people.
Security: Concerns the protection of information during transfer and storage, ensuring
confidentiality, integrity, and availability.
Course Objective: The class will use these concepts to study problems like fake news,
identifying fake/bot accounts, and determining a user's location from posted pictures.
The lecture covers two foundational concepts from network science that are highly relevant to
social platforms:
1. Six Degrees of Separation: This is the finding that any two people in the world are
connected by an average of about six social connections or "hops". The instructor notes
that on modern platforms like Facebook, this distance has become even shorter,
estimated at around 3.5 hops.
2. Strength of Weak Ties: This concept suggests that acquaintances (weak ties)—people
you don't know well—are more likely to introduce you to new information and
opportunities compared to your close friends or family (strong ties).
Predictive Power: Social media data can be used to predict various events before they
happen, such as protests, stock market value changes, or health pandemics (like the flu).
The speaker references the movie Minority Report for the concept of predicting crime.
Case Studies:
o Hudson River Plane Landing (2009): One of the first incidents where social
media was used to disseminate information during a crisis.
o Arab Springs/Delhi Gang Rape (2012): Instances where social media was used
effectively to mobilize large groups of people for protest and societal change.
o Capitol Riots (2021): Mentioned as a recent event where platforms like Parler
were used for organization, illustrating the rapid spike in conversation around an
event.
Recommended Resources
The instructor highly recommends watching two documentaries for better course appreciation:
The Social Dilemma (Netflix): Focuses on how social media algorithms manipulate user
behavior and perception.
The Great Hack: Discusses the Cambridge Analytica incident, showing how collected
data can be used to profile users and influence political affiliations and elections.
The lecture also introduces the concept of a Shadow Profile, where a social network creates a
profile and collects information about a person who is not yet on the network.
The lecture categorizes social media into different types of services based on the content they
generate:
Social Games
The speaker notes that platforms are characterized by their primary content type: YouTube for
video, Flickr for images, Foursquare for location, and LinkedIn for professional connections.
Bidirectional
Combination of Posts, Likes, Comments,
(Friendship
Facebook content types (text, Shares, Friends, Pages,
requires mutual
image, video). Groups.
agreement).
Micro-blogging
Unidirectional
(short content, max Tweets, Retweets, Replies,
(Following does
Twitter 140 characters at Likes, Followers/Following,
not require
the time of the Mentions, Hashtags, Trends.
permission).
lecture).
N/A (Focused on
Professional
professional
LinkedIn connections and job Connections.
profile and
activity.
activity).
N/A (Focused on
Image sharing,
Pinterest Images, Boards. visual content
visually oriented.
organization).
N/A (Focused on
Live streaming of
Periscope Live video stream. ephemeral, real-
videos in real-time.
time video).
Location-based
N/A (Focused on
matchmaking for
Tinder Left/Right Swipe. relationship
connecting people
connections).
nearby.
The large-scale content generation on social media relates to the characteristics of Big Data,
summarized by the 5 Vs:
1. Velocity: The speed at which data is generated (e.g., 400 hours of video uploaded to
YouTube every 60 seconds).
2. Variety: The different types of content (text, image, video, location, etc.).
3. Veracity: The confirmation or legitimacy of the posted content, which is often hard to
verify.
4. Volume: The size or scale of the content being generated and stored.
5. Value: The utility of the data, which must be present to make analysis worthwhile.
The lecture concludes with a segment on the difference between reality and perception on
social media, using a short clip to illustrate how users curate their content to present a desired,
often exaggerated or false, image of their lives, such as pretending a bad presentation was great
or faking a morning run.
The lecture categorizes the impact of social media through various examples:
Hudson River Plane Landing (2009): This was the first major incident where social media
(specifically Twitter) was used for crisis management. A civilian's post alerted people
before first responders arrived.
Finding Missing Persons: Social media has been effectively used to help locate lost
children, often through the quick sharing of photographs and tagging relevant
authorities.
Disaster Relief: Platforms were used to organize aid and connect citizens during events
like the Nepal earthquake.
UK Riots (2011): Social media was used to propagate and organize unrest. Instead of
reporting on an incident, the platforms were used to coordinate it.
o Fake News and Hoaxes: False claims, such as the one about a child dying in the
Boston Marathon bombing or a tweet promising a donation for every retweet,
are spread rapidly.
Privacy Implications: Personal information and pictures posted by users or those around
them can have severe real-world consequences, as seen when a military intelligence
chief lost his job because of his wife's social media posts.
Job Loss/Company Security: Employees have been fired for excessive social media use
or for posting sensitive information about their work projects, which jeopardizes
company security.
The lecture wraps up the first week of the course by reviewing the material covered:
Social Media Fundamentals: The scale, content types, and basic building blocks of
platforms like Facebook, Twitter, and LinkedIn.
Incidents: Case studies of both positive and negative impacts of social media.
Next Steps: Students were instructed to set up their environments (Linux and Python
tutorials were provided) to begin hands-on data collection and analysis in the coming
weeks.
The video, "Week 2.1 OSM APIs and tools for data collection", introduces the tools and
techniques necessary for programmatically collecting and analyzing data from Online Social
Media (OSM) platforms like Facebook and Twitter.
An API allows a program to interact with social media services to collect data, essentially
creating a secure channel between the program and the platform.
Rate Limits are a key constraint, as social media companies restrict the amount of data a
user can collect within a specific period.
Python is highlighted as the programming language used for collecting and analyzing
data due to its popularity and extensive libraries for interacting with APIs, parsing data,
and handling JSON objects.
When a request is sent to a social media API (like Facebook's Graph API), the data is
typically returned in JSON (JavaScript Object Notation) format.
The lecture shows how a JSON object, containing information like a user's ID and name,
is structured. Tools like a JSON viewer can be used to visually inspect this structured
data.
o MySQL: A relational database used to store data in rows and columns, allowing
simple queries to be run (e.g., SELECT user_ID, user_name).
Tools for viewing the stored data include phpMyAdmin (for MySQL) and MongoVUE (for
MongoDB).
5. Facebook's Graph Data Model
o Nodes: Represent objects such as users, friends, pictures, videos, and status
updates.
The video "Week-2.2 Trust and Credibility on OSM" explores the challenges of misinformation
on online social media (OSM) platforms like Twitter, particularly in the context of major real-
world events. The lecture focuses on methodologies for identifying and classifying content as
trustworthy or fake.
Analyzing the Boston Marathon bombing, the lecture demonstrates that rumors spread
significantly faster on Twitter than legitimate, true information.
The key challenges are reducing the propagation of false information and ensuring true
information is posted as quickly as possible.
Multiple examples of misinformation are provided, including fake tweets about the
Boston blast and the use of unverified or old images during Hurricane Sandy and other
events.
1. Data Collection: Collecting data from platforms like Twitter (e.g., 1.7 million tweets
related to Hurricane Sandy).
2. Data Characterization: Understanding the volume, type, and source of the collected
data.
4. Feature Extraction: Identifying characteristics that distinguish fake posts from real ones.
5. Model Evaluation: Using machine learning techniques like Naive Bayes or Decision
Trees to automatically classify posts.
The lecture divides the distinguishing characteristics into three broad categories:
User Features (Source-based): Characteristics of the user profile, such as the number of
friends/followers, the follower-to-following ratio, how many lists the user is on, whether
the user is verified, and the age of the user account.
Tweet Features (Message-based): Characteristics of the post itself, such as the length of
the tweet, the number of words, and the presence of question marks, exclamation
marks, or emoticons.
Network Features: The user's connections and how their content diffuses through the
social network.
The analysis of Hurricane Sandy data showed that combining tweet features and user features
performed best in classification. The top 10 influencing features include: number of characters,
tweet word count, user location, number of retweets, and age of the tweet.
Fake Accounts: New fake Twitter accounts were rapidly created around the time of the
event to propagate malicious content. Over 32,000 new accounts were created, with a
high percentage eventually suspended or deleted.
Tweet Source: A higher percentage of fake posts were made through mobile devices
compared to true or general posts.
Community: The users posting fake content were found to be closely connected,
suggesting a small, coordinated community of malicious actors.
The video presents Tweet Cred, a Chrome browser extension built on this research. It uses a
real-time model to calculate a credibility score (on a scale of 1 to 7) for tweets directly in the
user’s timeline. It also allows users to provide feedback to update and improve the underlying
classification model.
The video "Week 2 Reddit tutorial" provides a practical guide on how to collect data from the
Reddit social networking platform using the PRAW (Python Reddit API Wrapper) library.
Interactions: Reddit relies on upvotes and comments as its primary interaction patterns.
Posts: A post contains a title, a score (upvotes), and a body, and records the user who
submitted it.
Flares: Reddit uses flares as a concept similar to hashtags, connecting a post to a specific
topic.
Community Details: Each subreddit has a list of rules and moderators who enforce
them, as well as a list of related communities.
The tutorial demonstrates the step-by-step process of setting up and using the PRAW Python
library to programmatically collect Reddit data:
1. Authentication Setup:
o To collect data, users must first sign up/log in to Reddit.
o They must then go to preferences > apps and create an app by selecting the
"script" option.
o This process generates the necessary Client ID (14-digit ID) and Client Secret
(long key) for authentication.
2. PRAW Configuration: The user authenticates the PRAW object by supplying the
following credentials:
o client_id
o client_secret
3. Collecting Posts:
o Posts can be collected from a specific subreddit (e.g., r/india) or from all
subreddits by setting the subreddit variable to "all".
o The PRAW object is used to retrieve posts, with options for limiting the number
(e.g., limit=100) and the sorting type (e.g., hot posts).
o The collected data points for each post include the title, subreddit, score, ID,
URL, number of comments, creation time, and body (selftext).
o It's advisable to save the collected data to a file, such as a CSV file, using the
Pandas to_csv() function.
The video "Week 3.1 Misinformation on Social Media" discusses how the methodologies for
identifying misinformation developed for Twitter can be adapted and applied to other social
networks, focusing specifically on Facebook.
The core principles for detecting misinformation—such as data collection, feature extraction,
ground truth generation, and classification modeling—are carried over from Twitter. However,
the models must account for the structural differences between the platforms:
Trust Dynamics: Facebook connections are more personal. Users tend to believe a post
shared by a friend to be more truthful compared to a post from a random person on a
public platform. This difference necessitates adapting the features and the model to
weigh the influence of the source's credibility.
Facebook Inspector and Web of Trust
The video introduces a tool called Facebook Inspector, a browser plugin that functions similarly
to the earlier Tweet Cred tool.
Facebook Inspector
Architecture: It uses a supervised learning model to take a post from the Facebook
Graph API, perform feature extraction, and compute a credibility score.
Functionality: The plugin annotates posts directly in the user’s news feed with a visual
warning. For example, it may display a red mark to indicate a post is malicious or a
message that the confidence is low on the decision.
A key feature integrated into the Facebook Inspector is the Web of Trust (WOT) score.
WOT is an external service used to assess the credibility and safety of URLs or domains
that are shared in a post.
It returns a rating (e.g., Excellent, Good, Satisfactory, Poor, or Very Poor) and a
confidence score, which is incorporated into the overall model to judge a post's
trustworthiness.
The Facebook Inspector is available as a browser extension for both Chrome and Firefox,
allowing users to get real-time feedback on the posts they view.
The video "Week 3.2 Privacy and Social Media" discusses the complex nature of privacy,
particularly in the context of online social media, and presents findings from a large-scale study
on privacy perceptions in India.
Defining Privacy
The Alan Westin Model: Professor Alan Westin's long-term research classified U.S.
citizens into three categories based on their privacy preferences:
o Fundamentalists: People who are unwilling to provide personal details and have
very strong privacy expectations (approx. 25% of the U.S. population).
o Unconcerned: People who do not care about privacy and may give away
personally identifiable information for minimal returns.
The lecture presents data from a large survey on privacy perceptions in India, covering over
10,000 respondents:
Trust in Privacy Settings: When asked about the security of their personal information
on online social networks:
o 42% of respondents said their data is secure from a privacy breach because they
specified their privacy settings (highlighting a potentially misplaced confidence).
o 23% expressed concern about privacy even though they specified their settings.
Accepting Friendship Requests: When asked which people they would add as friends on
their favorite social network (e.g., Facebook):
o 27% accepted a friend request simply because the person was of the opposite
gender.
o 10.12% accepted a request simply because the person had a nice profile picture.
The video "Week 3 Tutorial 3 1 Twitter API" provides a tutorial on how to collect data from
Twitter (now X) using its Application Programming Interface (API), specifically leveraging the
Python library Tweepy.
Twitter API: A third-party interface that allows you to write a program to interact with
Twitter to perform tasks like searching, posting, or collecting data.
Authentication Keys: To use the API, you need four keys for authentication:
o Access Token and Secret Key (for authenticating you as a user to the API).
Rate Limits: Twitter enforces rate limits to prevent misuse and ensure smooth
operation. These limits restrict the number of requests you can send within a given time.
The tutorial uses the Tweepy library to connect to the Twitter API in a Python environment.
The Search API is used to collect historical data that is already present on the platform.
Query Strings: You can search for tweets using keywords, hashtags, or user mentions.
o Language.
Limitation: The Search API generally only works for tweets that are less than 7 days old.
Process: To use this, you must create a custom listener class with an on_status function
that defines what to do with a tweet the moment it is received (e.g., print the text or
save it to a file).
Real-time Collection: The API starts filtering and delivering tweets to your listener class
as soon as they are posted, making it ideal for collecting live data.
Filtering: You can filter the live stream by keywords, user IDs, or locations.
The video "Week-4.1 Privacy and Pictures on Online Social Media" continues the discussion on
privacy by focusing on the serious implications of sharing images on social media and how
publicly available data can be used to re-identify individuals and infer sensitive information.
The lecture reinforces the difficulty of defining privacy, noting that it is contextual and can have
contradictory dimensions. The foundational definition of privacy is based on control over
information:
Balancing Act: Every individual constantly balances the desire for privacy with the
desire for disclosure/communication.
Forms of Privacy: The lecture primarily focuses on information privacy (internet privacy)
but mentions communication privacy (telephones), territorial privacy (living space), and
bodily privacy (physical presence).
The core of the video highlights that four converging trends are making it easier to compromise
user privacy through images:
3. Cheaper Cloud Computing: The ability to store and compute large datasets of images
has become cheaper and more efficient.
The video details two influential research experiments demonstrating how easily individuals can
be de-anonymized using public and offline data:
Method: Researchers used a face recognition tool to compare public Facebook images
(identified) against photos from dating site profiles (unidentified), focusing on a single
U.S. city.
Finding: Approximately 10% of the dating site profiles were successfully re-identified
and linked back to the user's real Facebook profile, confirmed by crowdsourced workers.
This means 1 in 10 anonymous dating profiles could be mapped to a real identity.
Method:
2. While the student filled out a survey, a system compared their photo against
25,000 public profile images scraped from the university's Facebook network.
3. The system then presented the student with a matching image and asked them
to confirm if it was their Facebook picture.
Finding: 38% of the participants were successfully matched with their correct Facebook
profile photo taken on campus.
Goal: To determine if facial recognition could lead to the inference of sensitive private
data, like a U.S. Social Security Number (SSN).
Finding: Researchers were able to correctly predict the first five digits of the SSN for
27% of the subjects by combining face images with public data.
The collective results show that individuals' faces, even when captured offline or on an
anonymous site, can be easily linked to their fully identified social media profiles, leading to the
potential exposure of sensitive information.
The video "Week 4 Tutorial Part 1 numpy" is a tutorial on using the NumPy library in Python,
which is essential for scientific computing, particularly for efficient handling of arrays.
The tutorial begins by showing how to use Jupyter Notebook, an interactive environment well-
suited for Python coding and documentation.
Launching Jupyter: You can launch the notebook by typing jupyter notebook in your
terminal. This opens a new session in your default browser.
Documentation: Jupyter Notebook allows you to mix code, output, and documentation
(using Markdown), making it great for demos and tutorials.
📊 NumPy Arrays
The primary focus is the NumPy (Numerical Python) library, which provides powerful N-
dimensional array objects and tools for working with them.
Array Creation
NumPy arrays are more efficient than Python lists for numerical operations.
From Lists: You can create an array from a standard Python list using [Link]().
Data Type: Arrays can hold elements of a single type (e.g., integer or float). You can
explicitly set the data type using the dtype argument, for example, dtype='float64'. You
can check the array's data type using .dtype.
o .shape: Returns a tuple representing the size of each dimension (e.g., (3, 4) for a
3x4 matrix).
o Syntax: Uses the colon operator in the format [start:stop:step]. The element at
the stop index is excluded.
o Multi-Dimensional Slicing: You can slice across multiple axes, for example,
matrix[row_slice, column_slice]. Using a colon : alone selects all elements along
that dimension.
o Splitting: Functions like [Link](), [Link](), and [Link]() divide an array into
multiple sub-arrays.
➕ Array Operations (Universal Functions)
NumPy's strength lies in its vectorized operations using Universal Functions (ufuncs), which
apply operations to every element of an array without the need for explicit Python loops,
resulting in significantly faster computation.
o [Link](): Calculates the sum of all elements. Can be used along a specific axis
(rows or columns) to find marginal sums.
The video "Week 4 Tutorial Part 2 pandas and matplotlib" is a practical tutorial covering the
use of the Pandas and Matplotlib Python libraries for data analysis and visualization.
The Pandas library is introduced as a highly effective tool for working with tabular data. Its
primary structures are the Series (a one-dimensional labeled array) and the DataFrame (a two-
dimensional labeled structure, like a spreadsheet or SQL table).
Data I/O: The tutorial shows how to easily read data from common file types using
functions like pd.read_csv() and pd.read_excel().
Inspection:
One of Pandas' most powerful features is grouping data to perform aggregated calculations:
[Link](): This function is used to split the data into groups based on unique values
in a specified column (e.g., grouping by 'country' or 'category').
Aggregation: Once grouped, you can apply aggregation functions like .mean(), .sum(),
or .count() to calculate results for each group.
**[Link](): ** This allows you to apply a custom function to each group for more
complex operations.
Combining DataFrames
The video covers techniques for combining data from multiple sources:
[Link](): Used to stack DataFrames either row-wise (adding more rows) or column-
wise (adding more columns).
[Link](): Used to join DataFrames based on common key columns, similar to SQL
joins (e.g., 'inner', 'outer', 'left', 'right').
[Link](): Replaces missing values with a specified value, such as the mean or median of
the column.
Matplotlib is the foundational plotting library in Python, typically used through its pyplot
interface (import [Link] as plt).
Visualization Basics
Customization:
Subplots: The tutorial touches on creating multiple plots within a single figure using
subplots.
Plot Types
The video "Week-5.1 Policing and Online Social Media" is a lecture from a course on Privacy
and Security in online social media. It covers the privacy implications of location-based social
networks and the increasing adoption of social media by police organizations, particularly in the
Indian context.
Inferring Home Location: Researchers have shown that information from location-based
social networks like Foursquare can be used to infer a user's home location with high
confidence, primarily based on their check-ins and mayor status.
Mobility: A key finding in this research is that people's mobility is often limited and
predictable, making location inference easier.
The main topic of the lecture focuses on how police organizations around the world, and
specifically in India, are utilizing social media.
The first major instance of social media being used for real-world crisis response was
during the 2009 US Airways plane landing on the Hudson River. A person on the
riverside posted a picture and a tweet about the event before first responders, marking
a shift from social media being purely for personal updates.
NYPD Campaign: The New York Police Department (NYPD) ran a campaign using the
hashtag #myNYPD, inviting the public to share photos with officers. While intended to
build rapport, the campaign was quickly repurposed by users to share photos of alleged
police misconduct, demonstrating the unpredictable nature of public interaction on
social platforms.
Adoption in India: In India, police organizations like the Bangalore City Police, Delhi
Traffic Police, and Hyderabad City Police have widely adopted verified pages on
platforms like Facebook and Twitter.
A significant problem highlighted is the proliferation of fake handles and accounts that mimic
legitimate police organizations.
Masquerading: These fake accounts often use the same profile pictures and names as
the official pages, making it very difficult for citizens to determine which account is
legitimate.
Need for Verification: The speaker emphasizes that the blue verification tick (a verified
account) is crucial for police organizations on platforms like Twitter, as it is the only way
to confirm their authenticity to the public.
Research Efforts: The lecture mentions research efforts to create a comprehensive list of
genuine police social media handles across India and to collect data from them to
analyze their activity (likes, comments, post timings, and trends).
The video "Week-5.2 Policing and Online Social Media" discusses a research study on how
police organizations, specifically the Bangalore City Police (BCP) in India, can use online social
media data to gather actionable information and understand public opinion about their
activities.
🎯 Research Objective
The study aimed to determine if online social media could help the police obtain:
1. Actionable Information about crime and local issues (e.g., traffic problems, car
breakdowns, potholes).
The researchers collected and analyzed posts and comments from the BCP's public Facebook
page.
Data Collected
Analysis Categories
1. Content: What the public was talking about (e.g., misinformation, traffic details,
neighborhood concerns).
3. Police Response Types: How the police responded (e.g., acknowledge, reply, follow up,
ignore).
Key Findings
o Police are publicly responding to citizens' posts, making them accountable for the
issues raised.
o Citizens express their concerns and expectations (wants and needs), making their
role in city safety more public.
Response Time: There was a large variance in police response times, ranging from a
minimum of 4 minutes to a maximum of 211 hours.
💡 Practical Applications
The study emphasizes that analyzing the textual content through techniques like word trees
(e.g., analyzing citizen posts starting with words like "worried," "why," or "need") can help
police:
Understand Public Needs and Wants: Identify common fears, anxieties, and
expectations.
The video "Week 5.3 Policing and Online Social Media" continues the analysis of police-citizen
communication on social media by examining attributes of the interactions across different
police departments.
❓ Research Questions
The study explored the feasibility of using social media content to quantify communication
attributes and identify behavioral attributes such as emotional expression, engagement, and
cognitive response processes. The specific questions focused on:
Emotional Exchanges: The nature of emotions and affective expression (e.g., positive,
negative sentiment).
Cognitive and Social Orientation: The linguistic attributes characterizing the response
process.
The analysis used a large dataset of wall posts and status updates from 85 police organization
Facebook pages in India.
1. Topical Characteristics
Police-Initiated Posts: Police focus on official content like advisories, case updates,
rules, and safety violations. These discussions are generally more focused and narrow.
Citizen-Initiated Posts: Citizens' posts are more diverse and generally request the police
to take action (e.g., "please take action") or discuss issues like neighborhood problems,
missing people, and appreciation.
2. Engagement Characteristics
Engagement Rate: Posts initiated by the police receive more comments and more likes
on average than posts initiated by citizens.
Discussion Closure: When a citizen posts, and the police quickly respond and interact
(Police and Citizen comment), the discussion tends to close earlier. This is believed to
be because the police's response satisfies the initial query, leading to a lower average
number of comments in police-involved threads compared to citizen-only discussions.
3. Emotional Exchanges
Negative Sentiment: Discussions where the police are involved in the comments on a
citizen-initiated thread (CPC) show a higher negative effect, anger, and arousal. This is
attributed to citizens using the platform to strongly express their views and push the
police to resolve crime or neighborhood problems.
Reduced Anxiety: Conversely, when the police engage in a discussion, the level of
anxiety decreases for the citizens. A quick response from the police acts as a form of
emotional support, reducing the residents' anxiousness about a safety issue.
Self-Focus: Most citizen-initiated posts are self-driven, using words like "I" frequently.
This indicates that citizens are primarily posting about issues directly affecting them or
their own experiences.
The study concludes that using social media data can significantly help improve policing by:
Early Warning System: The data can be used for predictive analytics to anticipate safety
issues and understand changes in public sentiment over time.
The video "Week 6.1: eCrime on Online Social Media" introduces the topic of e-crime
(cybercrime) specifically within the context of social media and outlines several common types
of scams and malicious activities.
The main part of the lecture focuses on various forms of crime and scams prevalent on online
social platforms.
Phishing: The act of tricking users into revealing their login details or credentials. This
traditional email-based scam has spread to social media, using links on platforms like
Twitter or Facebook to direct users to fake websites that mimic legitimate services.
Fake Customer Service Accounts: Scammers create accounts that closely resemble those
of real organizations (like banks) and reply to real customer posts (e.g., on Twitter) about
a problem. They then direct the user to a fake, "secure" sign-on channel to steal
credentials.
Fake Comments on Popular Posts: Scammers pretend to be legitimate users and post
malicious links in the comment sections of popular, trending content (like posts about
the Olympics or a major event) to maximize visibility and lure users.
Fake Live Streaming Videos: This scam lures users, especially during major events (like
sports matches), by posting a link claiming to offer a live stream. Clicking the link takes
the user to a fake page to potentially steal information or install malware.
o Discounts: Fake accounts that look like real businesses offer bogus discounts
(e.g., 10% off Netflix) to trick users into providing personal or financial details.
o Contests/Surveys: Scams that promise money or prizes for filling out surveys,
often leading to personal information theft.
Social Reputation Manipulation: The lecture discusses how social influence (e.g.,
number of likes, followers, endorsements, and product reviews) is manipulated through
fake accounts, followers, and reviews (e.g., on Amazon or Flipkart) to influence public
perception.
Hashtag Hijacking: Using a popular or trending hashtag (like a major news event) that is
irrelevant to a product or promotion to gain exposure for commercial or malicious
purposes.
The lecture concludes by emphasizing the importance of studying these crimes to develop
technological solutions that can help police and citizens interact better and make the
community safer.
The video "Week 6.2 eCrime on Online Social Media" focuses on link farming within the
context of Twitter and analyzes the presence of spammers based on their influence
(PageRank/indegree).
Link farming, traditionally a technique on the web to artificially boost a website's PageRank by
exchanging non-benign reciprocal links, has a similar parallel on Twitter:
Goal: To increase a user's indegree (number of followers) to make their tweets more
likely to appear high in search results, thus increasing their social reputation and
influence (which is sometimes measured by scores like Clout).
Mechanism: Spammers and even some legitimate/popular users follow other accounts,
betting on the high reciprocity rate—the probability that the followed user will follow
them back—to quickly inflate their follower count.
Similarity to Web Link Farming: Both methods are used to spam the index of a search
engine (Google or Twitter Search) by artificially boosting the link structure
(indegree/PageRank).
The lecture cites several research results to provide context on the scale and nature of spam on
Twitter:
1. Persistence of Spam: One study found that five spam campaigns, controlling 145,000
accounts, were able to persist for months at a time.
2. Malicious URLs: Another paper found that 8% of 25 million URLs posted on Twitter
pointed to malicious content like phishing, malware, and scams.
3. High Click-Through Rate: Twitter is a highly successful platform for coercing users to visit
spam pages, with a click-through rate of 0.13%, which is significantly higher than rates
previously reported for email spam. This is likely because the malicious content often
comes from a seemingly trustworthy source (a friend or verified account).
The lecture presents an analysis of a 2009 dataset with 54 million users to study the link
farming problem, using node rank (PageRank, based on indegree) as a measure of influence:
Finding: The analysis revealed that spammers themselves have surprisingly high
influence:
o 7 spammers were found within the top 10,000 users ranked by PageRank.
This conclusion highlights that spammers successfully create accounts with high indegree (many
followers) through link farming, meaning they are influential enough to have their malicious
content appear high in search results.
The video "Tutorial 4 Social Network Analysis" explains how to represent social media data as a
graph and introduces the fundamental concepts and metrics of Social Network Analysis (SNA).
Nodes (Vertices): Represent entities in a social network, such as users, pages, or groups.
o A directed edge from user A to user B can mean "A follows B" (e.g., on Twitter).
o An edge between user A and a page can mean "A likes the page."
The video discusses three common ways to represent a node-edge graph for a computer:
1. Adjacency Matrix:
o A two-dimensional square matrix where the size equals the number of nodes.
o A cell at row I and column J is 1 if an edge exists between node I and node J;
otherwise, it's 0.
o Drawback: Can be sparse and space-consuming for graphs with many nodes and
few edges.
2. GraphML Format:
o Each node has a distinct ID, and each edge has source and target attributes
identifying the connection's endpoints.
The tutorial demonstrates using a command-line tool called TwiCall to collect a Twitter follower
network graph in the GraphML format:
1. Initialization (TwiCall init): Authorize a Twitter application using the consumer key,
consumer secret, and a verification PIN.
2. Fetch (TwiCall fetch): Get the list of followers of your followers (friends-of-friends
information).
3. Edge List (TwiCall edgelist): Generate the edges (relationships) between your followers
and their followers, resulting in a GML file.
SNA metrics help analyze the structure and importance of nodes within the network:
1. Degree
o In a Twitter follower graph, this signifies the number of followers a user has.
o In a Twitter follower graph, this signifies the number of users a user is following.
2. Centrality
Centrality helps find the most important or central node in the graph:
In-degree Centrality: Finds the node with the highest in-degree (e.g., the most
influential node or the user with the most followers).
Out-degree Centrality: Helps locate the node with the highest out-degree.
Betweenness Centrality: The number of shortest paths between all other pairs of
vertices that pass through the specific node. It measures how often a node acts as a
bridge or intermediary.
Closeness Centrality: Helps find the node with the lowest total distance from all other
nodes. A node with high closeness can reach other nodes faster.
3. Community Detection
Community: A group of similar or strongly connected nodes.
The free and open-source tool Gephi is used to visualize and analyze the collected graph data:
It allows for the generation of network statistics (e.g., average degree, network
diameter).
It supports customization of node appearance (color and size) based on SNA metrics like
In-degree or Modularity Class to highlight important nodes or communities visually.
It includes a Data Laboratory tab to browse the raw node and edge data.
The video "Week-7.1: Link Farming in Online Social Media" continues the discussion on e-
crime by delving deeper into the characteristics and behavior of link farmers on platforms like
Twitter, often blurring the lines between malicious and legitimate activity.
The research analyzed in the video highlights the mechanics and success factors of link farming:
High Reciprocity: Spammers rely heavily on the reciprocity principle: the high
probability that a user will follow back if a spammer follows them first.
o The top-ranked spam followers (accounts that tend to follow spammers back)
are responsible for approximately 60% of all in-links acquired by spammers.
The analysis found that many accounts engaging in link farming were not traditional, malicious
spammers:
High Proportion of Real Accounts: Out of a sample of top link farmers, 76% of accounts
were still active (not suspended). A small portion included verified accounts (blue tick),
confirming they belonged to legitimate individuals.
Thematic Focus: The most common bio topics among these real link-farming users
included business, internet marketing, entrepreneurship, and social media, suggesting
their goal is to maximize visibility and self-promotion.
Surprising Participants: The study found that link farming behavior was exhibited by
legitimate, popular, and highly active users, including bloggers, experts, and even
celebrities (e.g., Britney Spears, Obama).
By comparing top link farmers to traditional spammers and a random user sample, the
following network characteristics were observed:
Comparison to
Top Link
Metric Spammers/Random Insight
Farmers
Sample
The overall conclusion is that link farming is an effective mechanism for increasing an account's
social capital and influence. The fact that seemingly legitimate and popular accounts engage in
this behavior complicates efforts to detect and eliminate the activity, as it benefits real users
seeking to boost their social standing.
This video, "Week-7.2: Nudges," discusses the problem of users not reading privacy policies and
explores the use of nudges—small informational interventions—to help people make more
informed and less regrettable decisions about their online disclosures, particularly on social
media.
Low Reading Rate: Research shows that users rarely read privacy policies.
Economic Cost: One study estimated that if every U.S. citizen read the privacy policy for
each website they visited once a month, the national opportunity cost would be $781
billion annually (equivalent to 244 hours per year per person). This cost justifies the
need for simpler, more immediate decision-support tools.
The core goal of using nudges is to help individuals avoid regrettable online disclosures by
providing contextual information at the point of action.
An early example from MIT was designed to prevent users from sending emails to the wrong
mailing list. When composing an email, the tool would:
Show profile pictures of the people who would receive the email.
Visually highlight the profile picture of the person the user had interacted with most
frequently in the past.
🎯 Conclusion
The research demonstrated that these small interventions are helpful for users in making better
decisions about sharing information, especially when it comes to controlling the audience and
the tone of their posts. However, the video concludes that more work is needed to understand
which types of nudges are most effective in specific contexts without becoming overly intrusive
or annoying to the user.
The video "Week-7.3: Semantic attacks: Spear phishing" discusses semantic security attacks,
focusing on phishing, particularly spear phishing, which leverages personal information from
social media to increase attack success.
The video introduces the concept of Semantic Attacks, a category of security threats targeting
human perception and meaning.
Semantic Attacks target the way humans assign meaning to content. Unlike physical
attacks (accessing a machine) or syntactic attacks (buffer overflows), semantic attacks
exploit the mental model of the user.
o The User's Mental Model: What the user thinks is happening (e.g., who is
sending the message, the meaning of the content).
o The System Model: What the technical system thinks the user is doing (e.g.,
which remote machine is being accessed, the URL).
Phishing is a common semantic attack where attackers attempt to acquire sensitive information
(like usernames, passwords, and credit card details) by disguising themselves as a trustworthy
entity in an electronic communication.
Types of Phishing:
Term Target/Method
Classical attack, often through bulk email (e.g., urgent notification from
Phishing
eBay).
Social fishing refers to conducting a phishing attack by first collecting personal information
from social networks to make the phishing attempt more credible.
The video details a classical study conducted at Indiana University to test the effectiveness of
social fishing:
2. Email Crafting: Spoofed emails were crafted, either from an unknown person at the
university (Control Group) or from an alleged friend (Experimental/Social Condition),
using details gleaned from the social data to appear highly relevant and trustworthy.
3. Authentication Attempt: When users clicked the link, they were directed to a fake login
page that checked their credentials against the university authenticator and returned a
non-committal error ("server overloaded," "try again later").
The study demonstrated that personalization significantly increased the success rate of the
attack:
Success Rate:
o Control Condition (Email from unknown sender at the university): 16% success
rate.
o Social Condition (Email crafted from an alleged friend): 72% success rate.
Urgency and Takedown Time: 70% of successful authentications occurred in the first 12
hours, highlighting the immediate risk and the need for phishing websites to be taken
down rapidly.
Repeated Attempts: Users exhibited high persistence, with some trying to authenticate
up to 80 times after receiving an error message, confirming their high trust in the
message's legitimacy.
👤 Vulnerability Factors
Gender: Females were found to be more vulnerable overall. The success rate was
highest when the email came from the opposite gender (78% from male to female; 68%
from female to male).
The study received significant negative backlash from participants who felt the experiment was
unethical, inappropriate, and fraudulent, highlighting the psychological costs of such research.
Education: Extensive educational campaigns are needed to inform users about the risks
of sharing personal information and recognizing phishing attacks.
Technological Solutions:
The video "Week 7 - Tutorial 6- Analyzing text with NLTK" provides a tutorial on using the
Natural Language Toolkit (NLTK) in Python to perform simple text analysis and gain insights
from a collected dataset of tweets.
The tutorial walks through a process to clean and analyze text, using tweets about the Serum
Institute as an example.
The goal is to move from raw text to meaningful word insights by systematically cleaning the
data:
1. Tokenization
Action: Breaks the raw text (the collected tweets) into individual units, or tokens (words
and punctuation).
Tool: [Link].word_tokenizer.
Result: A list of all individual words, but the data is noisy, including punctuation,
capitalization differences, and common, uninformative words.
2. Lowercasing
Result: Standardizes words so that "Fire" and "fire" are counted as the same token,
improving the total count for meaningful terms.
3. Punctuation Removal
Action: Deletes standard punctuation marks (like colons, commas, and full stops) from
the tokens.
Result: Removes uninformative characters, making the list of common tokens cleaner.
Action: Removes common English words (like "the," "of," "at," "in") that do not
contribute to the topic and removes any remaining single-character tokens.
Result: This final cleaning step significantly clarifies the data, revealing the core subjects
of the tweets.
🔍 Gaining Insight
After completing the cleaning process, the analysis of the top-occurring tokens provides a clear
picture of the tweet content:
Identified Topics: The most frequent terms included "serum institute," "india," "fire,"
"vaccine," "pune," "five lives loss," and "covid 19."
Conclusion: The analysis reveals that the tweets are centered on a tragic incident—a fire
at the Serum Institute in Pune that resulted in five lives lost, which was significant due
to the institute's role as the largest COVID-19 vaccine producer in the country.
This demonstrates the power of NLP and basic string operations to rapidly extract actionable
intelligence from unstructured text data.
The video "Week 8.1: Profile Linking on Online Social Media" explores the problem of profile
linking or identity resolution, which is the process of matching different accounts across various
online social networks (OSNs) to confirm they belong to the same individual.
The primary motivation is to track a user's digital footprint and avoid duplicating effort, such as
preventing advertisers from wasting resources by sending the same ad to a person's Facebook,
Twitter, and LinkedIn accounts.
Users often have multiple accounts on OSNs, and linking these profiles is challenging because:
Different Networks, Different Data: Different platforms (e.g., YouTube, Tinder, LinkedIn,
Facebook) offer different types of information (video opinions, dating data, professional
details, personal connections).
Attribute Change Over Time: Users frequently change their profile attributes, making
static matching difficult:
o Profile Pictures: Up to 40% of users changed their profile picture multiple times
within a two-month period.
Ambiguity: It is difficult to link accounts when they use different names (e.g., "Pam
Kumuru" vs. "pnr2468") or when a user is inactive on one platform.
To overcome these challenges, researchers use various attributes and features, often
incorporating the historical evolution of these attributes:
A user explicitly links their accounts on a public page (e.g., adding a Twitter link to their
Tumblr profile).
A user posts the same content (like a photo album link) simultaneously across two
different networks.
Researchers compare a diverse set of attributes, categorized by their structural and temporal
characteristics:
o Reused Patterns: Checking if the same core username is reused across platforms
(e.g., "swampson" on Twitter and Instagram).
Using the engineered features (current and historical), a classifier (like an SVM) can be trained
to calculate the probability that two handles belong to the same person. Studies have shown
that including past username information can significantly improve the accuracy of profile
linking.
Law Enforcement: To identify a single malicious actor who is operating under different
handles across multiple platforms (e.g., posting a malicious video on both YouTube and
Twitter) to avoid wasting resources chasing multiple false leads.
The video "Week 8.2: Anonymous Networks" discusses the nature, use, and characteristics of
anonymous social networks, using Whisper as the primary example.
Anonymous Networks are social platforms where it is difficult or impossible to identify the
person posting the content.
Motivation: They meet the growing demand for privacy and anonymity online, often
driven by high-profile privacy incidents (like the Snowden revelations) or consequences
faced by users for controversial posts on traditional networks. Users turn to them to
express thoughts they wouldn't want attributed to their real identity.
Whisper is a confessional app where users post "whispers" (a post with text layered over a
generated image, similar to a meme). It prioritizes anonymity through several features:
Changeable Identity: Users can change their temporary usernames whenever they
want, making long-term identity tracking extremely difficult.
Non-Persistent Social Links: Unlike Facebook or Twitter, Whisper does not maintain a
persistent graph of user relationships, even for actions like "hearts" (likes) or replies.
Analysis of Whisper data reveals that user interactions and the network structure are
fundamentally different from traditional online social networks (OSNs) like Facebook and
Twitter.
Short Attention Span: 54% of replies to a whisper arrive within the first hour of posting,
and 94% arrive within one day. If a post doesn't get attention quickly, it is unlikely to get
any later.
A network analysis comparing Whisper to Facebook and Twitter found that Whisper behaves
like a random graph, suggesting interactions are not based on existing relationships:
Whisper Facebook
Metric Implication for Whisper
(Anonymous) (Traditional)
Despite the number of new users increasing, the daily volume of new posts and replies in the
entire network remains relatively stable. This suggests that the lack of persistent identity and
strong social links prevents the formation of stickiness—the factor that keeps users continually
engaged in traditional OSNs. New users either quickly leave or do not contribute significantly,
leading to flat overall content generation.
🌐 What is Gephi?
Gephi is an open-source tool widely used in academia, journalism, and digital humanities for
analyzing and visualizing complex network data, such as social graphs from platforms like
Facebook and Twitter.
The tutorial walks through the three main tabs in Gephi and their functionalities:
1. Data Laboratory
Data Loading: Gephi supports various file formats, including GML, GraphML, and CSV
(requiring separate files for nodes with unique IDs, and edges with Source and Target
IDs).
Data Manipulation: You can view, edit, and modify attributes for both nodes (e.g.,
changing the number of followers) and edges.
2. Overview
Statistics: This panel is used to calculate various network metrics by clicking the "Run"
button for each:
o Average Degree: The average number of connections per node.
o Network Diameter: The average shortest distance between all pairs of nodes.
Appearance: Customize the visual properties of the nodes and edges, such as changing
size or color based on calculated attributes (e.g., sizing a node by its in-degree or
coloring it by its modularity class).
Layout: Apply various algorithms (e.g., Fruchterman Reingold, Force Atlas 2) to organize
the graph visually and separate overlapping nodes.
Filters: Apply filters to select specific subsets of nodes and edges based on attributes or
network topology (e.g., nodes with a certain range of followers).
3. Preview
Final Visualization: Allows for fine-tuning the visual representation (e.g., smoothing
curved edges, showing labels) for presentation purposes.
Export: You can export the final network visualization in formats like PDF, PNG, or SVG.
Additional Features
Workspaces: You can export a filtered subset of the graph to a new workspace to
perform separate, detailed analysis on a sub-graph without affecting the original data.
This video, "Week 9.1: Privacy in Location Based Social Networks Part 1," discusses privacy
concerns and research related to Location-Based Social Networks (LBSNs), using Foursquare as
a primary case study.
The lecture focuses on the shift in the course toward reviewing research papers to understand
how techniques learned previously are applied to draw meaningful inferences.
Privacy Concern: The core issue is the leakage of location information. Data from
surveys suggests that a significant portion of users (up to 39%) set their location sharing
to "Everyone" on platforms like Facebook. An example from the survey showed that 44%
of Indian participants felt it was privacy-invasive when a mobile service provider used a
regional language to report a switched-off phone, as it revealed their geographical
location.
Foursquare offers a gamified platform with the following key terms and public/private data:
Private (Usually
A user sharing their current location (Venue) via
Check-ins visible only to
GPS.
friends).
The research found that by analyzing only the publicly available information (Mayorships, Tips,
and Dones), they could easily infer the home city of approximately 75% of the analyzed users
within 50 kilometers.
The video also references the defunct website "Please Rob Me," which demonstrated the real-
world risk of location sharing by aggregating tweets that indicated users were not home,
effectively providing a target list for burglars.
This video, "Week 9.2: Privacy in Location Based Social Networks Part 2," continues the
analysis of the research paper "We Know Where You Live: Privacy Characterization of
Foursquare Behavior," detailing the methodology used to infer users' home locations from their
publicly available Foursquare data.
The lecture begins by characterizing the massive data set, which included approximately 13
million users and millions of tips, dones, and mayorships.
Power Law Distribution: The distribution of mayorships, tips, and dones per user and
per city was found to be highly skewed, following a Power Law (or Pareto Principle). This
means a small minority of users or cities account for the vast majority of the content
posted.
The core of the analysis was determining the accuracy of inferring a user's home location using
the public information (mayorships, tips, and dones) and a majority voting scheme.
1. Class 0: Users with only a single activity (one tip, one done, or one mayorship) in the
data set.
2. Class 1: Users with multiple activities but a predominant location that stands out as the
most frequent.
3. Class 2: Users with multiple activities where no single location is clearly predominant.
(Prediction efforts focused on Class 0 and Class 1, as the home location could not be
determined for Class 2).
Key Finding
By comparing the inferred home location with the user-declared home city, the research
established a high degree of accuracy for privacy leakage:
The model can correctly infer the home city of around 78% of the analyzed users within a 50-
kilometer distance, using only their publicly available activity data. This demonstrates a
significant privacy risk on location-based social networks, as home location can be inferred
without access to private check-ins.
The first part of the tutorial focuses on installing and using the Python Highcharts wrapper to
build interactive graphs.
Setting Up
1. Installation: The wrapper is installed using the command sudo pip install Python-
highcharts.
2. Basic Structure: Code involves importing the pychart module, defining a container,
setting chart options (like title and chart type), and finally adding the data series. The
script generates an HTML file that can be viewed in a web browser.
The tutorial provides step-by-step examples for creating four fundamental chart types:
1. Bar Chart:
o Purpose: Presents grouped data using rectangular bars, where the length of the
bar is proportional to the value.
o Example: Visualizing the number of followers for five different users across four
consecutive days.
o Interaction: Allows users to filter out specific data series (e.g., a particular day)
by clicking on the legend.
2. Line Chart:
o Purpose: Shows a series of data points connected by straight lines, typically used
to visualize data changes over time.
o Example: Tracking the follower growth of multiple users over the months of a
year.
o Interaction: Demonstrates how to hover over data points to see specific values
and filter users from the graph.
3. Scatter Plot:
o Purpose: Draws a point for each data point to show the relationship or pattern
between two variables without connecting the points.
4. Bubble Chart:
The video briefly introduces Highcharts Cloud, an alternative platform for visualization. Users
can simply copy and paste their data or upload a CSV file, and the platform automatically
generates the chart. This provides a non-coding method to create visualizations and allows for
easy customization through various templates.
"Week 10.1: Beware of What You Share Inferring Home Location in Social Networks,"
discusses a research paper that performs a large-scale inference study to determine a user's
home city and residence using public data from three popular social networks: Foursquare,
Google+, and Twitter.
The study analyzed millions of data points from the three networks:
Home City Inference
Social Network Primary Attributes Analyzed
Accuracy
Data Volume: The data sets were large, with Foursquare data comprising 13 million
users and 16 million venues, Google+ including 27 million profiles, and Twitter using 20
million geotagged tweets.
Geographic Granularity: For all three platforms, the city level was the most common
and precise level of location information available for user attributes (e.g., user home
city, venue city, education location).
Twitter's Advantage: The high accuracy for Twitter (82%) is primarily due to the
precision of geotagged tweets (latitude and longitude), which allowed for highly
accurate location inference.
The researchers grouped users into three classes based on their activity patterns, similar to the
Foursquare analysis in the previous lecture:
2. Class 1: Users with multiple activities but one clearly predominant location.
3. Class 2: Users with multiple activities and no single dominant location. (Inference
models are focused only on Class 0 and Class 1).
The study also analyzed the physical distance (error) between the inferred home location and
the user's declared home location, highlighting the precision of the privacy leak:
Home City (within 50 km): The model could make correct inferences within a 50-
kilometer radius of the user's declared home city with the following accuracies:
o Twitter: 87%
o Foursquare: 78.5%
o Google+: 64%
Home Residence (within 20 km): When attempting to infer the user's exact residence
(not just the city), the results were particularly stark for platforms with highly precise
location data:
o Twitter: 73.67% of inferences were within a 20-kilometer radius, with 35% being
exactly 0 km away from the residence.
The paper concludes that users must beware of what they share, as location-based social
networks leak significant privacy by allowing highly accurate inference of a user's home city and
residence using only publicly available data.
"Week 10.2: On the dynamics of username change behavior on Twitter," analyzes the patterns,
frequency, and motivations behind why users change their usernames on Twitter.
The study tracked 8.7 million users over two months and found that about 10% of users
changed their username at least once. The behavior follows a classic Pareto principle (power-
law distribution), showing that a small fraction of users are responsible for the majority of
changes.
Frequency: 20% of users were responsible for 85% of all username changes, with these
users often changing their names five times or more. One extreme user changed their
username 113 times in 14 months.
New vs. Old Names: About 65% of users choose a new username that is unrelated to
their old one, while 35% later reuse an old name.
The authors identified several reasons why users change their handles, some benign and some
malicious:
Space Gain: Since Twitter had a character limit, many users with long usernames
(greater than the median length of 11 characters) changed to shorter names to save
characters for their tweets.
Suit a Trending Event: Users change their handle to align with a current event (like a
sports season) to gain traffic or traction.
Anonymity: Users change to a more generic name (e.g., from a real name to an
anonymous phrase) to gain privacy, or vice versa, to make the account more personal.
Adjust to Real-Life Events: A user might update their handle to reflect a new profession
or status (e.g., changing from "gradstudent" to "professor").
Malicious Intent:
The study also observed instances of collaborative username sharing where a single username
was used by different accounts within a small group at different timestamps, suggesting
coordinated behavior.
"Week 10.3: Boston Marathon - Analyzing Fake Content on Twitter," examines the spread and
characteristics of fake, rumor, and malicious content on Twitter during and immediately
following the Boston Marathon bombing on April 15, 2013.
The study collected 7.9 million tweets using a set of event-related keywords and manually
annotated the most popular content to categorize it.
o 29% of the most viral content was classified as rumors and fake content.
o The frequency of tweets directly correlated with real-world events, such as one
hour after the blast, the release of suspect pictures, and the subsequent
manhunt.
o The authors identified over 6,000 malicious user profiles that were created right
after the blast and were later suspended by Twitter.
o These profiles were not isolated; they often formed connected networks (e.g.,
closed communities, star topologies) to collaboratively spread the fake content.
♂️Characteristics of Propagators
The analysis of user attributes revealed features of the accounts involved in spreading the fake
content:
Device Used: Approximately 75% of the fake tweets were propagated via mobile
phones (e.g., iPhone).
Follower Count: Fake content was often propagated by accounts with a low number of
followers (low popularity). However, it became viral because it was retweeted by users
higher up in the network chain, including some with verified accounts who did not check
the content's validity.
Predicting Virality: The study found that it is possible to predict how viral fake content
will become based on the attributes of the users currently propagating it, using features
like social engagement and credibility.
⚠️Misinformation Examples
Spreading false claims about victims (e.g., a non-existent child who died).
Tweets asking for retweets with the false promise that "$1 will be donated to Boston
Marathon victims for every retweet."