0% found this document useful (0 votes)
12 views39 pages

Understanding PSOSM in Social Media

The NPTEL PSOSM course focuses on the significance of social media in daily interactions and business, emphasizing concepts of privacy and security. It explores the predictive power of social media data, case studies of its impact, and methodologies for analyzing misinformation. The course also covers technical aspects of data collection using APIs and Python, alongside practical tutorials for platforms like Reddit.
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views39 pages

Understanding PSOSM in Social Media

The NPTEL PSOSM course focuses on the significance of social media in daily interactions and business, emphasizing concepts of privacy and security. It explores the predictive power of social media data, case studies of its impact, and methodologies for analyzing misinformation. The course also covers technical aspects of data collection using APIs and Python, alongside practical tutorials for platforms like Reddit.
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

NPTEL PSOSM Course

Content
Course Focus and Motivation

 Ubiquity of Social Media: The instructor emphasizes that almost everyone uses
platforms like Instagram, Facebook, LinkedIn, or Twitter, highlighting their role in daily
interactions and business.

 Data Volume and Activity: The massive increase in user activity on various platforms like
Tinder, Netflix, and Twitch between 2019 and 2020 demonstrates the phenomenal
growth and data generation.

 Platform Variety: Social media is a broad term, encompassing networks based on


different content types:

o Video: YouTube

o Images: Flickr, Pinterest

o Professional: LinkedIn

o Short Video: TikTok

o Anonymous: Sarahah, Whisper

o Ephemeral: Snapchat (where content disappears after a time).

Key Concepts for the Course

The lecture establishes the difference between the two main topics of the course:

 Privacy: Defined as the state of being not watched or tracked. It centers on the
expectations of information—the ability to be selective about sharing personal data
with only chosen people.

 Security: Concerns the protection of information during transfer and storage, ensuring
confidentiality, integrity, and availability.

 Course Objective: The class will use these concepts to study problems like fake news,
identifying fake/bot accounts, and determining a user's location from posted pictures.

Phenomena in Social Media

The lecture covers two foundational concepts from network science that are highly relevant to
social platforms:

1. Six Degrees of Separation: This is the finding that any two people in the world are
connected by an average of about six social connections or "hops". The instructor notes
that on modern platforms like Facebook, this distance has become even shorter,
estimated at around 3.5 hops.

2. Strength of Weak Ties: This concept suggests that acquaintances (weak ties)—people
you don't know well—are more likely to introduce you to new information and
opportunities compared to your close friends or family (strong ties).

Prediction and Case Studies


A significant portion of the lecture is dedicated to the predictive power and real-world impact of
social media data:

 Data Characteristics (5 Vs): Social media is characterized by Velocity (speed of


generation), Variety (different types of content), Volume (sheer amount of data),
Veracity (difficulty in content verification), and Value (the ability to generate business
and predictive insights).

 Predictive Power: Social media data can be used to predict various events before they
happen, such as protests, stock market value changes, or health pandemics (like the flu).
The speaker references the movie Minority Report for the concept of predicting crime.

 Case Studies:

o Hudson River Plane Landing (2009): One of the first incidents where social
media was used to disseminate information during a crisis.

o Arab Springs/Delhi Gang Rape (2012): Instances where social media was used
effectively to mobilize large groups of people for protest and societal change.

o Capitol Riots (2021): Mentioned as a recent event where platforms like Parler
were used for organization, illustrating the rapid spike in conversation around an
event.

Recommended Resources

The instructor highly recommends watching two documentaries for better course appreciation:

 The Social Dilemma (Netflix): Focuses on how social media algorithms manipulate user
behavior and perception.

 The Great Hack: Discusses the Cambridge Analytica incident, showing how collected
data can be used to profile users and influence political affiliations and elections.

The lecture also introduces the concept of a Shadow Profile, where a social network creates a
profile and collects information about a person who is not yet on the network.

Video URL: [Link]

Overview of Social Media

The lecture categorizes social media into different types of services based on the content they
generate:

 Social Networks: (e.g., Facebook, Twitter, LinkedIn)

 Publish/Crowdsource: (e.g., Wikipedia)

 Social Games

 Virtual World Games

The speaker notes that platforms are characterized by their primary content type: YouTube for
video, Flickr for images, Foursquare for location, and LinkedIn for professional connections.

Different Types of Social Networks

The lecture explores the fundamental characteristics of several popular platforms:


Primary Function / Building Blocks /
Platform Network Type
Key Feature Terminology

Bidirectional
Combination of Posts, Likes, Comments,
(Friendship
Facebook content types (text, Shares, Friends, Pages,
requires mutual
image, video). Groups.
agreement).

Micro-blogging
Unidirectional
(short content, max Tweets, Retweets, Replies,
(Following does
Twitter 140 characters at Likes, Followers/Following,
not require
the time of the Mentions, Hashtags, Trends.
permission).
lecture).

N/A (Focused on
Professional
professional
LinkedIn connections and job Connections.
profile and
activity.
activity).

Location-based Check-in, Tips (for N/A (Based on


Foursquare
social network. locations/restaurants). location data).

N/A (Focused on
Image sharing,
Pinterest Images, Boards. visual content
visually oriented.
organization).

N/A (Focused on
Live streaming of
Periscope Live video stream. ephemeral, real-
videos in real-time.
time video).

Location-based
N/A (Focused on
matchmaking for
Tinder Left/Right Swipe. relationship
connecting people
connections).
nearby.

Anonymous social N/A (Focused on


Whisper Anonymous posts.
network. anonymity).

The "Vs" of Social Media Data

The large-scale content generation on social media relates to the characteristics of Big Data,
summarized by the 5 Vs:

1. Velocity: The speed at which data is generated (e.g., 400 hours of video uploaded to
YouTube every 60 seconds).

2. Variety: The different types of content (text, image, video, location, etc.).

3. Veracity: The confirmation or legitimacy of the posted content, which is often hard to
verify.
4. Volume: The size or scale of the content being generated and stored.

5. Value: The utility of the data, which must be present to make analysis worthwhile.

Reality vs. Perception

The lecture concludes with a segment on the difference between reality and perception on
social media, using a short clip to illustrate how users curate their content to present a desired,
often exaggerated or false, image of their lives, such as pretending a bad presentation was great
or faking a morning run.

Video URL: [Link]

Key Incidents and Implications of Social Media

The lecture categorizes the impact of social media through various examples:

Positive Use (Crisis Management & Aid)

 Hudson River Plane Landing (2009): This was the first major incident where social media
(specifically Twitter) was used for crisis management. A civilian's post alerted people
before first responders arrived.

 Finding Missing Persons: Social media has been effectively used to help locate lost
children, often through the quick sharing of photographs and tagging relevant
authorities.

 Disaster Relief: Platforms were used to organize aid and connect citizens during events
like the Nepal earthquake.

Negative Use (Misinformation, Manipulation, and Security)

 UK Riots (2011): Social media was used to propagate and organize unrest. Instead of
reporting on an incident, the platforms were used to coordinate it.

 Misinformation (Fake Content): Misinformation is a major concern:

o Fake News and Hoaxes: False claims, such as the one about a child dying in the
Boston Marathon bombing or a tweet promising a donation for every retweet,
are spread rapidly.

o Fake Images: Edited or out-of-context images, like a picture of a crocodile on the


street during the Chennai floods or a shark during Hurricane Sandy, are used to
spread panic or false information.

 Security Problems (Compromised Accounts): Even verified, legitimate accounts are


vulnerable. The Associated Press Twitter account was compromised, leading to a false
report of an explosion at the White House, which impacted public perception and
credibility.

 Privacy Implications: Personal information and pictures posted by users or those around
them can have severe real-world consequences, as seen when a military intelligence
chief lost his job because of his wife's social media posts.

 Job Loss/Company Security: Employees have been fired for excessive social media use
or for posting sensitive information about their work projects, which jeopardizes
company security.

Video URL: [Link]


Conclusion of Week 1

The lecture wraps up the first week of the course by reviewing the material covered:

 Social Media Fundamentals: The scale, content types, and basic building blocks of
platforms like Facebook, Twitter, and LinkedIn.

 The 5 Vs of Data: Volume, Velocity, Variety, Veracity, and Value.

 Incidents: Case studies of both positive and negative impacts of social media.

 Next Steps: Students were instructed to set up their environments (Linux and Python
tutorials were provided) to begin hands-on data collection and analysis in the coming
weeks.

The video, "Week 2.1 OSM APIs and tools for data collection", introduces the tools and
techniques necessary for programmatically collecting and analyzing data from Online Social
Media (OSM) platforms like Facebook and Twitter.

Key Topics and Frameworks

The lecture covers the technical aspects of data collection:

1. Application Programming Interfaces (APIs)

 An API allows a program to interact with social media services to collect data, essentially
creating a secure channel between the program and the platform.

 The course focuses on the Facebook and Twitter APIs.

 Rate Limits are a key constraint, as social media companies restrict the amount of data a
user can collect within a specific period.

2. Programming Language (Python)

 Python is highlighted as the programming language used for collecting and analyzing
data due to its popularity and extensive libraries for interacting with APIs, parsing data,
and handling JSON objects.

3. Data Format (JSON)

 When a request is sent to a social media API (like Facebook's Graph API), the data is
typically returned in JSON (JavaScript Object Notation) format.

 The lecture shows how a JSON object, containing information like a user's ID and name,
is structured. Tools like a JSON viewer can be used to visually inspect this structured
data.

4. Data Storage and Visualization

 Once collected, the data must be stored:

o MySQL: A relational database used to store data in rows and columns, allowing
simple queries to be run (e.g., SELECT user_ID, user_name).

o NoSQL/MongoDB: An alternative, non-relational database for storage.

 Tools for viewing the stored data include phpMyAdmin (for MySQL) and MongoVUE (for
MongoDB).
5. Facebook's Graph Data Model

 Facebook's data is stored in a Graph format:

o Nodes: Represent objects such as users, friends, pictures, videos, and status
updates.

o Edges: Represent interactions or relationships, such as friendship, likes, and


comments.

 The Facebook API is therefore called the Graph API.

Video URL: [Link]

The video "Week-2.2 Trust and Credibility on OSM" explores the challenges of misinformation
on online social media (OSM) platforms like Twitter, particularly in the context of major real-
world events. The lecture focuses on methodologies for identifying and classifying content as
trustworthy or fake.

Key Concepts and Case Studies

1. The Problem of Rumors

 Analyzing the Boston Marathon bombing, the lecture demonstrates that rumors spread
significantly faster on Twitter than legitimate, true information.

 The key challenges are reducing the propagation of false information and ensuring true
information is posted as quickly as possible.

 Multiple examples of misinformation are provided, including fake tweets about the
Boston blast and the use of unverified or old images during Hurricane Sandy and other
events.

2. Methodology for Analysis

The general process for studying misinformation involves several steps:

1. Data Collection: Collecting data from platforms like Twitter (e.g., 1.7 million tweets
related to Hurricane Sandy).

2. Data Characterization: Understanding the volume, type, and source of the collected
data.

3. Ground Truth Generation: Annotating posts as true, false (rumor/fake), or general


content. This often involves human annotation by multiple people to ensure high inter-
annotator agreement (measured by Cronbach's Alpha).

4. Feature Extraction: Identifying characteristics that distinguish fake posts from real ones.

5. Model Evaluation: Using machine learning techniques like Naive Bayes or Decision
Trees to automatically classify posts.

3. Features for Classification

The lecture divides the distinguishing characteristics into three broad categories:

 User Features (Source-based): Characteristics of the user profile, such as the number of
friends/followers, the follower-to-following ratio, how many lists the user is on, whether
the user is verified, and the age of the user account.
 Tweet Features (Message-based): Characteristics of the post itself, such as the length of
the tweet, the number of words, and the presence of question marks, exclamation
marks, or emoticons.

 Network Features: The user's connections and how their content diffuses through the
social network.

The analysis of Hurricane Sandy data showed that combining tweet features and user features
performed best in classification. The top 10 influencing features include: number of characters,
tweet word count, user location, number of retweets, and age of the tweet.

4. Event Analysis Insights (Boston Marathon)

 Fake Accounts: New fake Twitter accounts were rapidly created around the time of the
event to propagate malicious content. Over 32,000 new accounts were created, with a
high percentage eventually suspended or deleted.

 Tweet Source: A higher percentage of fake posts were made through mobile devices
compared to true or general posts.

 Community: The users posting fake content were found to be closely connected,
suggesting a small, coordinated community of malicious actors.

5. Tweet Cred: A Real-Time Credibility Tool

The video presents Tweet Cred, a Chrome browser extension built on this research. It uses a
real-time model to calculate a credibility score (on a scale of 1 to 7) for tweets directly in the
user’s timeline. It also allows users to provide feedback to update and improve the underlying
classification model.

Video URL: [Link]

The video "Week 2 Reddit tutorial" provides a practical guide on how to collect data from the
Reddit social networking platform using the PRAW (Python Reddit API Wrapper) library.

Reddit Platform Overview

 Interactions: Reddit relies on upvotes and comments as its primary interaction patterns.

 Subreddits: The platform is organized into communities called subreddits (e.g.,


r/olympics), which focus on specific topics.

 Posts: A post contains a title, a score (upvotes), and a body, and records the user who
submitted it.

 Comments: Comments allow for multiple levels of interaction, forming nested


discussions.

 Flares: Reddit uses flares as a concept similar to hashtags, connecting a post to a specific
topic.

 Community Details: Each subreddit has a list of rules and moderators who enforce
them, as well as a list of related communities.

Data Collection using PRAW

The tutorial demonstrates the step-by-step process of setting up and using the PRAW Python
library to programmatically collect Reddit data:

1. Authentication Setup:
o To collect data, users must first sign up/log in to Reddit.

o They must then go to preferences > apps and create an app by selecting the
"script" option.

o This process generates the necessary Client ID (14-digit ID) and Client Secret
(long key) for authentication.

2. PRAW Configuration: The user authenticates the PRAW object by supplying the
following credentials:

o client_id

o client_secret

o password (Reddit password)

o user_agent (app name)

o username (Reddit username)

3. Collecting Posts:

o Posts can be collected from a specific subreddit (e.g., r/india) or from all
subreddits by setting the subreddit variable to "all".

o The PRAW object is used to retrieve posts, with options for limiting the number
(e.g., limit=100) and the sorting type (e.g., hot posts).

4. Data Structuring and Storage:

o The collected data points for each post include the title, subreddit, score, ID,
URL, number of comments, creation time, and body (selftext).

o The information is converted into a Pandas DataFrame for a structured and


efficient way to manage the data.

o It's advisable to save the collected data to a file, such as a CSV file, using the
Pandas to_csv() function.

5. Further Exploration: The documentation for PRAW ([Link]) allows users to


explore other functions, such as getting user-specific information (friends, karma points)
and collecting different data points like comments and submissions.

The video "Week 3.1 Misinformation on Social Media" discusses how the methodologies for
identifying misinformation developed for Twitter can be adapted and applied to other social
networks, focusing specifically on Facebook.

Transitioning from Twitter to Facebook

The core principles for detecting misinformation—such as data collection, feature extraction,
ground truth generation, and classification modeling—are carried over from Twitter. However,
the models must account for the structural differences between the platforms:

 Network Structure: Twitter is a unidirectional network (followers/followings), while


Facebook is a bi-directional network (friends).

 Trust Dynamics: Facebook connections are more personal. Users tend to believe a post
shared by a friend to be more truthful compared to a post from a random person on a
public platform. This difference necessitates adapting the features and the model to
weigh the influence of the source's credibility.
Facebook Inspector and Web of Trust

The video introduces a tool called Facebook Inspector, a browser plugin that functions similarly
to the earlier Tweet Cred tool.

Facebook Inspector

 Architecture: It uses a supervised learning model to take a post from the Facebook
Graph API, perform feature extraction, and compute a credibility score.

 Functionality: The plugin annotates posts directly in the user’s news feed with a visual
warning. For example, it may display a red mark to indicate a post is malicious or a
message that the confidence is low on the decision.

Web of Trust (WOT)

 A key feature integrated into the Facebook Inspector is the Web of Trust (WOT) score.

 WOT is an external service used to assess the credibility and safety of URLs or domains
that are shared in a post.

 It returns a rating (e.g., Excellent, Good, Satisfactory, Poor, or Very Poor) and a
confidence score, which is incorporated into the overall model to judge a post's
trustworthiness.

The Facebook Inspector is available as a browser extension for both Chrome and Firefox,
allowing users to get real-time feedback on the posts they view.

The video "Week 3.2 Privacy and Social Media" discusses the complex nature of privacy,
particularly in the context of online social media, and presents findings from a large-scale study
on privacy perceptions in India.

Defining Privacy

 Contextual Nature: Privacy is difficult to define because expectations are highly


dependent on the context (e.g., privacy at home vs. privacy in a public place).

 The Alan Westin Model: Professor Alan Westin's long-term research classified U.S.
citizens into three categories based on their privacy preferences:

o Fundamentalists: People who are unwilling to provide personal details and have
very strong privacy expectations (approx. 25% of the U.S. population).

o Pragmatists: People who make decisions about sharing information depending


on the situation and the value they receive in return.

o Unconcerned: People who do not care about privacy and may give away
personally identifiable information for minimal returns.

Privacy Perceptions in India

The lecture presents data from a large survey on privacy perceptions in India, covering over
10,000 respondents:

 Trust in Privacy Settings: When asked about the security of their personal information
on online social networks:

o 42% of respondents said their data is secure from a privacy breach because they
specified their privacy settings (highlighting a potentially misplaced confidence).

o 23% expressed concern about privacy even though they specified their settings.
 Accepting Friendship Requests: When asked which people they would add as friends on
their favorite social network (e.g., Facebook):

o 27% accepted a friend request simply because the person was of the opposite
gender.

o 10.12% accepted a request simply because the person had a nice profile picture.

o 3% would accept anyone.

Video URL: [Link]

The video "Week 3 Tutorial 3 1 Twitter API" provides a tutorial on how to collect data from
Twitter (now X) using its Application Programming Interface (API), specifically leveraging the
Python library Tweepy.

Twitter API Basics

 Twitter API: A third-party interface that allows you to write a program to interact with
Twitter to perform tasks like searching, posting, or collecting data.

 Authentication Keys: To use the API, you need four keys for authentication:

o Consumer Key and Consumer Secret Key (for authenticating the


program/application).

o Access Token and Secret Key (for authenticating you as a user to the API).

 Developer Account: You must go to [Link] and apply for a developer


account to get these keys.

 Rate Limits: Twitter enforces rate limits to prevent misuse and ensure smooth
operation. These limits restrict the number of requests you can send within a given time.

Data Collection with Tweepy

The tutorial uses the Tweepy library to connect to the Twitter API in a Python environment.

1. The Search API

The Search API is used to collect historical data that is already present on the platform.

 Query Strings: You can search for tweets using keywords, hashtags, or user mentions.

 Filters: Results can be filtered by:

o Location (if the user provided location data).

o Language.

o Result Type (e.g., most popular or most recent tweets).

o Time Restriction (using since_id or max_id).

 Limitation: The Search API generally only works for tweets that are less than 7 days old.

 Information Collected: Each collected tweet is a JSON object containing extensive


information, including:

o ID and Date/Time created.

o The tweet's Text.

o Source (e.g., iPhone, Android, or desktop).


o User information (ID, Username, Location, follower/friend count).

o Interaction metrics (number of replies and retweets).

2. The Streaming API

The Streaming API is used to collect data in real-time.

 Process: To use this, you must create a custom listener class with an on_status function
that defines what to do with a tweet the moment it is received (e.g., print the text or
save it to a file).

 Real-time Collection: The API starts filtering and delivering tweets to your listener class
as soon as they are posted, making it ideal for collecting live data.

 Filtering: You can filter the live stream by keywords, user IDs, or locations.

The video "Week-4.1 Privacy and Pictures on Online Social Media" continues the discussion on
privacy by focusing on the serious implications of sharing images on social media and how
publicly available data can be used to re-identify individuals and infer sensitive information.

Defining Privacy and Control

The lecture reinforces the difficulty of defining privacy, noting that it is contextual and can have
contradictory dimensions. The foundational definition of privacy is based on control over
information:

 Control over Information: An individual's claim to "determine themselves when, how,


and to what extent information about them is communicated to others."

 Balancing Act: Every individual constantly balances the desire for privacy with the
desire for disclosure/communication.

 Forms of Privacy: The lecture primarily focuses on information privacy (internet privacy)
but mentions communication privacy (telephones), territorial privacy (living space), and
bodily privacy (physical presence).

The Threat of Image Re-Identification

The core of the video highlights that four converging trends are making it easier to compromise
user privacy through images:

1. Increase in Self-Disclosure: Users are voluntarily uploading massive amounts of personal


photos, often with location data (e.g., uploading a selfie from a landmark).

2. Improving Face Recognition Accuracy: Technology, particularly deep learning models


like TensorFlow, is getting significantly better at identifying faces, even when compared
against a large user base (e.g., a social network's friend graph).

3. Cheaper Cloud Computing: The ability to store and compute large datasets of images
has become cheaper and more efficient.

4. Better Re-identification Techniques: Advanced techniques for de-anonymizing users are


constantly being developed.

Real-World De-anonymization Experiments

The video details two influential research experiments demonstrating how easily individuals can
be de-anonymized using public and offline data:

Experiment 1: Online-to-Online Re-identification


 Goal: To connect unidentified profiles from a popular dating website (where users use
pseudonyms) to identified profiles on Facebook.

 Method: Researchers used a face recognition tool to compare public Facebook images
(identified) against photos from dating site profiles (unidentified), focusing on a single
U.S. city.

 Finding: Approximately 10% of the dating site profiles were successfully re-identified
and linked back to the user's real Facebook profile, confirmed by crowdsourced workers.
This means 1 in 10 anonymous dating profiles could be mapped to a real identity.

Experiment 2: Offline-to-Online Re-identification

 Goal: To connect real-world, offline images to online social media profiles.

 Method:

1. Researchers took a picture of university students participating in a study on


campus (offline, unidentified image).

2. While the student filled out a survey, a system compared their photo against
25,000 public profile images scraped from the university's Facebook network.

3. The system then presented the student with a matching image and asked them
to confirm if it was their Facebook picture.

 Finding: 38% of the participants were successfully matched with their correct Facebook
profile photo taken on campus.

Experiment 3: Sensitive Information Inference

 Goal: To determine if facial recognition could lead to the inference of sensitive private
data, like a U.S. Social Security Number (SSN).

 Finding: Researchers were able to correctly predict the first five digits of the SSN for
27% of the subjects by combining face images with public data.

The collective results show that individuals' faces, even when captured offline or on an
anonymous site, can be easily linked to their fully identified social media profiles, leading to the
potential exposure of sensitive information.

Video URL: [Link]

The video "Week 4 Tutorial Part 1 numpy" is a tutorial on using the NumPy library in Python,
which is essential for scientific computing, particularly for efficient handling of arrays.

💻 Working with Jupyter Notebook

The tutorial begins by showing how to use Jupyter Notebook, an interactive environment well-
suited for Python coding and documentation.

 Launching Jupyter: You can launch the notebook by typing jupyter notebook in your
terminal. This opens a new session in your default browser.

 Running Code: You can execute a cell by pressing Shift + Enter.

 Documentation: Jupyter Notebook allows you to mix code, output, and documentation
(using Markdown), making it great for demos and tutorials.

📊 NumPy Arrays
The primary focus is the NumPy (Numerical Python) library, which provides powerful N-
dimensional array objects and tools for working with them.

Array Creation

NumPy arrays are more efficient than Python lists for numerical operations.

 Importing: The library is typically imported as import numpy as np.

 From Lists: You can create an array from a standard Python list using [Link]().

 Data Type: Arrays can hold elements of a single type (e.g., integer or float). You can
explicitly set the data type using the dtype argument, for example, dtype='float64'. You
can check the array's data type using .dtype.

 Built-in Functions: NumPy offers functions to quickly create arrays:

o [Link](size): Creates an array of a given size initialized with zeros.

o [Link](size): Creates an array of a given size initialized with ones.

o [Link](start, stop, step): Similar to Python's range, it creates an array of


evenly spaced values within a given interval.

o [Link](start, stop, num): Returns a specified number (num) of evenly


spaced samples between a start and end point.

o [Link](size): Creates an array with random values.

Array Properties and Indexing

 Shape and Dimension:

o .shape: Returns a tuple representing the size of each dimension (e.g., (3, 4) for a
3x4 matrix).

o .ndim: Returns the number of dimensions.

o .size: Returns the total number of elements in the array.

 Indexing: Accessing elements in an array.

o Single-Dimensional: Works like Python lists (array[2]).

o Negative Indexing: Access elements from the end (array[-1]).

o Multi-Dimensional: Accessed using comma-separated indices (matrix[row,


column]).

 Slicing: Selecting subsets of an array.

o Syntax: Uses the colon operator in the format [start:stop:step]. The element at
the stop index is excluded.

o Multi-Dimensional Slicing: You can slice across multiple axes, for example,
matrix[row_slice, column_slice]. Using a colon : alone selects all elements along
that dimension.

 Joining and Splitting: You can combine or divide arrays:

o Joining: Functions like [Link](), [Link]() (vertical stack), and


[Link]() (horizontal stack) are used to merge arrays.

o Splitting: Functions like [Link](), [Link](), and [Link]() divide an array into
multiple sub-arrays.
➕ Array Operations (Universal Functions)

NumPy's strength lies in its vectorized operations using Universal Functions (ufuncs), which
apply operations to every element of an array without the need for explicit Python loops,
resulting in significantly faster computation.

 Arithmetic Operations: Standard Python arithmetic operators (+, -, *, /) are overloaded


to perform element-wise operations:

o Array-Scalar: An operation with a scalar value is applied to every element in the


array (e.g., array * 2).

o Array-Array: Operations are performed element-by-element between two arrays


of the same shape.

 Mathematical Functions: NumPy provides functions for trigonometric ([Link](),


[Link]()), exponential ([Link]()), and logarithmic ([Link]()) operations, which are
applied element-wise.

 Statistical Operations: You can perform quick statistical analysis:

o [Link](): Calculates the sum of all elements. Can be used along a specific axis
(rows or columns) to find marginal sums.

o [Link](), [Link](): Find the minimum and maximum values.

o [Link](), [Link](): Calculate the mean and standard deviation.

o [Link](), [Link](): Return the index of the minimum and maximum


elements.

Video URL: [Link]

The video "Week 4 Tutorial Part 2 pandas and matplotlib" is a practical tutorial covering the
use of the Pandas and Matplotlib Python libraries for data analysis and visualization.

🐼 Pandas: Data Analysis and Manipulation

The Pandas library is introduced as a highly effective tool for working with tabular data. Its
primary structures are the Series (a one-dimensional labeled array) and the DataFrame (a two-
dimensional labeled structure, like a spreadsheet or SQL table).

Key Data Handling Functions

 Data I/O: The tutorial shows how to easily read data from common file types using
functions like pd.read_csv() and pd.read_excel().

 Inspection:

o [Link](): Used to quickly preview the first few rows of a DataFrame.

o [Link](): Provides a statistical summary of the numerical columns in the


DataFrame (count, mean, standard deviation, min/max, quartiles).

o [Link]: Used to check the data type of each column.

 Type Conversion: The function pd.to_datetime() is highlighted for converting columns


from text strings into proper datetime objects, which is crucial for time-series analysis.

Data Grouping and Aggregation

One of Pandas' most powerful features is grouping data to perform aggregated calculations:
 [Link](): This function is used to split the data into groups based on unique values
in a specified column (e.g., grouping by 'country' or 'category').

 Aggregation: Once grouped, you can apply aggregation functions like .mean(), .sum(),
or .count() to calculate results for each group.

 **[Link](): ** This allows you to apply a custom function to each group for more
complex operations.

Combining DataFrames

The video covers techniques for combining data from multiple sources:

 [Link](): Used to stack DataFrames either row-wise (adding more rows) or column-
wise (adding more columns).

 [Link](): Used to join DataFrames based on common key columns, similar to SQL
joins (e.g., 'inner', 'outer', 'left', 'right').

Handling Missing Data

 [Link](): Removes rows that contain missing (NaN) values.

 [Link](): Replaces missing values with a specified value, such as the mean or median of
the column.

📈 Matplotlib: Data Visualization

Matplotlib is the foundational plotting library in Python, typically used through its pyplot
interface (import [Link] as plt).

Visualization Basics

 Basic Plot: A simple line plot is created using [Link](x, y).

 Customization:

o [Link]() and [Link]() are used to label the axes.

o [Link]() is used to set the plot's title.

o [Link]() is used to show labels for multiple data series.

 Subplots: The tutorial touches on creating multiple plots within a single figure using
subplots.

Plot Types

Matplotlib supports various chart types for different data representations:

 Scatter Plots: [Link]()

 Bar Charts: [Link]()

 Histograms: [Link]() (useful for showing the distribution of a single variable)

The video "Week-5.1 Policing and Online Social Media" is a lecture from a course on Privacy
and Security in online social media. It covers the privacy implications of location-based social
networks and the increasing adoption of social media by police organizations, particularly in the
Indian context.

🔒 Privacy Implications of Location Data


The video first concludes the previous week's discussion on privacy by focusing on location
data:

 Inferring Home Location: Researchers have shown that information from location-based
social networks like Foursquare can be used to infer a user's home location with high
confidence, primarily based on their check-ins and mayor status.

 Mobility: A key finding in this research is that people's mobility is often limited and
predictable, making location inference easier.

👮 Policing and Online Social Media

The main topic of the lecture focuses on how police organizations around the world, and
specifically in India, are utilizing social media.

Early Crisis Response

 The first major instance of social media being used for real-world crisis response was
during the 2009 US Airways plane landing on the Hudson River. A person on the
riverside posted a picture and a tweet about the event before first responders, marking
a shift from social media being purely for personal updates.

Global and Local Adoption

 NYPD Campaign: The New York Police Department (NYPD) ran a campaign using the
hashtag #myNYPD, inviting the public to share photos with officers. While intended to
build rapport, the campaign was quickly repurposed by users to share photos of alleged
police misconduct, demonstrating the unpredictable nature of public interaction on
social platforms.

 Adoption in India: In India, police organizations like the Bangalore City Police, Delhi
Traffic Police, and Hyderabad City Police have widely adopted verified pages on
platforms like Facebook and Twitter.

 Typical Use Cases: These handles are used for:

o Keeping citizens informed about decisions and activities (e.g., traffic


synchronization).

o Posting about cash rewards for public help.

o General interaction and appreciation of police officers.

⚠️Challenges: Fake Accounts and Verification

A significant problem highlighted is the proliferation of fake handles and accounts that mimic
legitimate police organizations.

 Masquerading: These fake accounts often use the same profile pictures and names as
the official pages, making it very difficult for citizens to determine which account is
legitimate.

 Need for Verification: The speaker emphasizes that the blue verification tick (a verified
account) is crucial for police organizations on platforms like Twitter, as it is the only way
to confirm their authenticity to the public.

 Research Efforts: The lecture mentions research efforts to create a comprehensive list of
genuine police social media handles across India and to collect data from them to
analyze their activity (likes, comments, post timings, and trends).
The video "Week-5.2 Policing and Online Social Media" discusses a research study on how
police organizations, specifically the Bangalore City Police (BCP) in India, can use online social
media data to gather actionable information and understand public opinion about their
activities.

🎯 Research Objective

The study aimed to determine if online social media could help the police obtain:

1. Actionable Information about crime and local issues (e.g., traffic problems, car
breakdowns, potholes).

2. Residents' Opinions about policing activities in urban Indian cities.

📊 Methodology and Data Analysis

The researchers collected and analyzed posts and comments from the BCP's public Facebook
page.

Data Collected

 Posts: 255 posts made by the Bangalore City Police.

 Comments: Approximately 1,600 public comments and replies.

Analysis Categories

The analysis focused on three main aspects of the communication:

1. Content: What the public was talking about (e.g., misinformation, traffic details,
neighborhood concerns).

2. Style: The tone of the communication (formal vs. informal).

3. Police Response Types: How the police responded (e.g., acknowledge, reply, follow up,
ignore).

Key Findings

 Public Concerns: The majority of posts from citizens focused on neighborhood


concerns, followed by appreciation for the police's efforts.

 Engagement: Posts related to satisfaction, appreciation, and success stories received a


higher number of likes compared to other types of posts, showing the content the
public engages with most.

 Actionable Information: Citizens often provided posts with temporally and


geographically specific details (e.g., exact time, location) which the police could use to
take action.

 Communication Style: The police's responses were generally formal.

 Police Accountability: The platform creates mutual accountability because:

o Police are publicly responding to citizens' posts, making them accountable for the
issues raised.

o Citizens express their concerns and expectations (wants and needs), making their
role in city safety more public.

 Response Time: There was a large variance in police response times, ranging from a
minimum of 4 minutes to a maximum of 211 hours.

💡 Practical Applications
The study emphasizes that analyzing the textual content through techniques like word trees
(e.g., analyzing citizen posts starting with words like "worried," "why," or "need") can help
police:

 Understand Public Needs and Wants: Identify common fears, anxieties, and
expectations.

 Improve Communicative Policing: Increase interaction with the community, thereby


enhancing safety and productivity.

The video "Week 5.3 Policing and Online Social Media" continues the analysis of police-citizen
communication on social media by examining attributes of the interactions across different
police departments.

❓ Research Questions

The study explored the feasibility of using social media content to quantify communication
attributes and identify behavioral attributes such as emotional expression, engagement, and
cognitive response processes. The specific questions focused on:

 Topical Characteristics: The nature of content and topics discussed.

 Engagement Characteristics: How citizens and police engage in discussions.

 Emotional Exchanges: The nature of emotions and affective expression (e.g., positive,
negative sentiment).

 Cognitive and Social Orientation: The linguistic attributes characterizing the response
process.

📊 Key Findings on Interactions

The analysis used a large dataset of wall posts and status updates from 85 police organization
Facebook pages in India.

1. Topical Characteristics

 Police-Initiated Posts: Police focus on official content like advisories, case updates,
rules, and safety violations. These discussions are generally more focused and narrow.

 Citizen-Initiated Posts: Citizens' posts are more diverse and generally request the police
to take action (e.g., "please take action") or discuss issues like neighborhood problems,
missing people, and appreciation.

2. Engagement Characteristics

 Engagement Rate: Posts initiated by the police receive more comments and more likes
on average than posts initiated by citizens.

 Discussion Closure: When a citizen posts, and the police quickly respond and interact
(Police and Citizen comment), the discussion tends to close earlier. This is believed to
be because the police's response satisfies the initial query, leading to a lower average
number of comments in police-involved threads compared to citizen-only discussions.

3. Emotional Exchanges

 Negative Sentiment: Discussions where the police are involved in the comments on a
citizen-initiated thread (CPC) show a higher negative effect, anger, and arousal. This is
attributed to citizens using the platform to strongly express their views and push the
police to resolve crime or neighborhood problems.
 Reduced Anxiety: Conversely, when the police engage in a discussion, the level of
anxiety decreases for the citizens. A quick response from the police acts as a form of
emotional support, reducing the residents' anxiousness about a safety issue.

4. Cognitive and Social Orientation

 Self-Focus: Most citizen-initiated posts are self-driven, using words like "I" frequently.
This indicates that citizens are primarily posting about issues directly affecting them or
their own experiences.

🚀 Societal and Technological Implications

The study concludes that using social media data can significantly help improve policing by:

 Community Sensing: Collecting data to understand behavioral attributes, emotions, and


social support needs.

 Early Warning System: The data can be used for predictive analytics to anticipate safety
issues and understand changes in public sentiment over time.

 Enhanced Emotional Support: Police interaction reduces citizen anxiety, fostering a


stronger, more supportive relationship with the community.

The video "Week 6.1: eCrime on Online Social Media" introduces the topic of e-crime
(cybercrime) specifically within the context of social media and outlines several common types
of scams and malicious activities.

💻 eCrime on Social Media

The main part of the lecture focuses on various forms of crime and scams prevalent on online
social platforms.

Types of Scams and Crimes Discussed:

 Phishing: The act of tricking users into revealing their login details or credentials. This
traditional email-based scam has spread to social media, using links on platforms like
Twitter or Facebook to direct users to fake websites that mimic legitimate services.

o Examples include notifications about Facebook technical support issues or new


login systems.

 Fake Customer Service Accounts: Scammers create accounts that closely resemble those
of real organizations (like banks) and reply to real customer posts (e.g., on Twitter) about
a problem. They then direct the user to a fake, "secure" sign-on channel to steal
credentials.

 Fake Comments on Popular Posts: Scammers pretend to be legitimate users and post
malicious links in the comment sections of popular, trending content (like posts about
the Olympics or a major event) to maximize visibility and lure users.

 Fake Live Streaming Videos: This scam lures users, especially during major events (like
sports matches), by posting a link claiming to offer a live stream. Clicking the link takes
the user to a fake page to potentially steal information or install malware.

 Fake Online Discounts & Contests/Surveys:

o Discounts: Fake accounts that look like real businesses offer bogus discounts
(e.g., 10% off Netflix) to trick users into providing personal or financial details.
o Contests/Surveys: Scams that promise money or prizes for filling out surveys,
often leading to personal information theft.

 Fake Tips (Location-Based Social Networks): On platforms like Foursquare, scammers


post "tips" at specific locations that contain irrelevant links, often for advertisements or
malicious sites.

 Social Reputation Manipulation: The lecture discusses how social influence (e.g.,
number of likes, followers, endorsements, and product reviews) is manipulated through
fake accounts, followers, and reviews (e.g., on Amazon or Flipkart) to influence public
perception.

 Click-Baiting: Using sensational or misleading headlines/images to get a user to click a


link, which directs them to a fake or malicious website.

 Hashtag Hijacking: Using a popular or trending hashtag (like a major news event) that is
irrelevant to a product or promotion to gain exposure for commercial or malicious
purposes.

 Compromised Accounts: Gaining unauthorized access to a legitimate account (e.g., a


news agency) to post false, often high-impact, information (like a fake breaking news
alert).

 Impersonation: Creating a fake profile of an individual or organization, often using their


publicly available details and pictures, to deceive others.

 Work-from-Home Scams: Posting lucrative, often too-good-to-be-true, job offers (e.g.,


filling out surveys for high pay) to collect personal information.

The lecture concludes by emphasizing the importance of studying these crimes to develop
technological solutions that can help police and citizens interact better and make the
community safer.

The video "Week 6.2 eCrime on Online Social Media" focuses on link farming within the
context of Twitter and analyzes the presence of spammers based on their influence
(PageRank/indegree).

🌐 Link Farming on Twitter

Link farming, traditionally a technique on the web to artificially boost a website's PageRank by
exchanging non-benign reciprocal links, has a similar parallel on Twitter:

 Goal: To increase a user's indegree (number of followers) to make their tweets more
likely to appear high in search results, thus increasing their social reputation and
influence (which is sometimes measured by scores like Clout).

 Mechanism: Spammers and even some legitimate/popular users follow other accounts,
betting on the high reciprocity rate—the probability that the followed user will follow
them back—to quickly inflate their follower count.

 Similarity to Web Link Farming: Both methods are used to spam the index of a search
engine (Google or Twitter Search) by artificially boosting the link structure
(indegree/PageRank).

📊 Research on Twitter Spam

The lecture cites several research results to provide context on the scale and nature of spam on
Twitter:
1. Persistence of Spam: One study found that five spam campaigns, controlling 145,000
accounts, were able to persist for months at a time.

2. Malicious URLs: Another paper found that 8% of 25 million URLs posted on Twitter
pointed to malicious content like phishing, malware, and scams.

3. High Click-Through Rate: Twitter is a highly successful platform for coercing users to visit
spam pages, with a click-through rate of 0.13%, which is significantly higher than rates
previously reported for email spam. This is likely because the malicious content often
comes from a seemingly trustworthy source (a friend or verified account).

4. Automation: Up to 16% of active accounts were found to exhibit a high degree of


automation. Simple detection techniques include analyzing the unusually high frequency
of posts, which is not typical for human users.

📈 Spammers and Influence

The lecture presents an analysis of a 2009 dataset with 54 million users to study the link
farming problem, using node rank (PageRank, based on indegree) as a measure of influence:

 Spam Targets (accounts receiving spam) were about 30 million.

 Spam Followers (accounts used to follow spammers) were about 248,000.

 Finding: The analysis revealed that spammers themselves have surprisingly high
influence:

o 7 spammers were found within the top 10,000 users ranked by PageRank.

o 2,131 spammers were found within the top 1 million users.

This conclusion highlights that spammers successfully create accounts with high indegree (many
followers) through link farming, meaning they are influential enough to have their malicious
content appear high in search results.

The video "Tutorial 4 Social Network Analysis" explains how to represent social media data as a
graph and introduces the fundamental concepts and metrics of Social Network Analysis (SNA).

📊 Representing Social Media Data as a Graph

A graph is a data structure consisting of nodes and edges:

 Nodes (Vertices): Represent entities in a social network, such as users, pages, or groups.

 Edges (Links): Define the relationships between nodes.

o A directed edge from user A to user B can mean "A follows B" (e.g., on Twitter).

o An edge between user A and a page can mean "A likes the page."

Graph Representation Methods

The video discusses three common ways to represent a node-edge graph for a computer:

1. Adjacency Matrix:

o A two-dimensional square matrix where the size equals the number of nodes.

o A cell at row I and column J is 1 if an edge exists between node I and node J;
otherwise, it's 0.
o Drawback: Can be sparse and space-consuming for graphs with many nodes and
few edges.

2. GraphML Format:

o An XML file format for graphs.

o It contains sequences of node and edge elements.

o Each node has a distinct ID, and each edge has source and target attributes
identifying the connection's endpoints.

3. CSV Files (mentioned, but not detailed).

Collecting Twitter Data (Demo)

The tutorial demonstrates using a command-line tool called TwiCall to collect a Twitter follower
network graph in the GraphML format:

1. Initialization (TwiCall init): Authorize a Twitter application using the consumer key,
consumer secret, and a verification PIN.

2. Fetch (TwiCall fetch): Get the list of followers of your followers (friends-of-friends
information).

3. Edge List (TwiCall edgelist): Generate the edges (relationships) between your followers
and their followers, resulting in a GML file.

This GML file is then used with a visualization tool.

🔍 Social Network Analysis (SNA) Metrics

SNA metrics help analyze the structure and importance of nodes within the network:

1. Degree

In a directed graph, the degree is split into two components:

 In-degree: The number of edges entering a node.

o In a Twitter follower graph, this signifies the number of followers a user has.

 Out-degree: The number of edges leaving a node.

o In a Twitter follower graph, this signifies the number of users a user is following.

 Total Degree: The sum of the in-degree and out-degree.

2. Centrality

Centrality helps find the most important or central node in the graph:

 In-degree Centrality: Finds the node with the highest in-degree (e.g., the most
influential node or the user with the most followers).

 Out-degree Centrality: Helps locate the node with the highest out-degree.

 Betweenness Centrality: The number of shortest paths between all other pairs of
vertices that pass through the specific node. It measures how often a node acts as a
bridge or intermediary.

 Closeness Centrality: Helps find the node with the lowest total distance from all other
nodes. A node with high closeness can reach other nodes faster.

3. Community Detection
 Community: A group of similar or strongly connected nodes.

 Modularity: A measure used to define the strength of a community, representing the


fraction of edges that fall within the given group compared to those connecting to other
groups.

📈 Graph Visualization (Gephi)

The free and open-source tool Gephi is used to visualize and analyze the collected graph data:

 It allows for the generation of network statistics (e.g., average degree, network
diameter).

 It can run the Modularity algorithm to identify communities.

 It supports customization of node appearance (color and size) based on SNA metrics like
In-degree or Modularity Class to highlight important nodes or communities visually.

 It includes a Data Laboratory tab to browse the raw node and edge data.

The video "Week-7.1: Link Farming in Online Social Media" continues the discussion on e-
crime by delving deeper into the characteristics and behavior of link farmers on platforms like
Twitter, often blurring the lines between malicious and legitimate activity.

🔄 Reciprocity and Link Farmer Behavior

The research analyzed in the video highlights the mechanics and success factors of link farming:

 High Reciprocity: Spammers rely heavily on the reciprocity principle: the high
probability that a user will follow back if a spammer follows them first.

 Targeting Responsive Users:

o The top-ranked spam followers (accounts that tend to follow spammers back)
are responsible for approximately 60% of all in-links acquired by spammers.

o Counterintuitively, users with low in-degree (fewer than a thousand followers)


are less likely to reciprocate. Responsiveness to follow requests increases with
the number of followers (in-degree) a user has.

👤 Characteristics of Top Link Farmers

The analysis found that many accounts engaging in link farming were not traditional, malicious
spammers:

 High Proportion of Real Accounts: Out of a sample of top link farmers, 76% of accounts
were still active (not suspended). A small portion included verified accounts (blue tick),
confirming they belonged to legitimate individuals.

 Thematic Focus: The most common bio topics among these real link-farming users
included business, internet marketing, entrepreneurship, and social media, suggesting
their goal is to maximize visibility and self-promotion.

 Surprising Participants: The study found that link farming behavior was exhibited by
legitimate, popular, and highly active users, including bloggers, experts, and even
celebrities (e.g., Britney Spears, Obama).

📈 Network Metrics Analysis

By comparing top link farmers to traditional spammers and a random user sample, the
following network characteristics were observed:
Comparison to
Top Link
Metric Spammers/Random Insight
Farmers
Sample

In-degree Link farmers successfully


Very High Significantly higher
(Followers) inflate their influence.

Out-degree Link farmers are highly


Very High Significantly higher
(Followings) active in following others.

This ratio mimics that of


Close to One many active, legitimate
In-degree/Out- Closer to one than
(Followers ≈ users, making link farmers
degree Ratio spammers
Followings) difficult to detect using
this measure alone.

💡 Conclusion on Link Farming

The overall conclusion is that link farming is an effective mechanism for increasing an account's
social capital and influence. The fact that seemingly legitimate and popular accounts engage in
this behavior complicates efforts to detect and eliminate the activity, as it benefits real users
seeking to boost their social standing.

This video, "Week-7.2: Nudges," discusses the problem of users not reading privacy policies and
explores the use of nudges—small informational interventions—to help people make more
informed and less regrettable decisions about their online disclosures, particularly on social
media.

📑 The Problem with Privacy Policies

 Low Reading Rate: Research shows that users rarely read privacy policies.

 Economic Cost: One study estimated that if every U.S. citizen read the privacy policy for
each website they visited once a month, the national opportunity cost would be $781
billion annually (equivalent to 244 hours per year per person). This cost justifies the
need for simpler, more immediate decision-support tools.

💡 Nudges for Informed Disclosure

The core goal of using nudges is to help individuals avoid regrettable online disclosures by
providing contextual information at the point of action.

Email Nudge Example

An early example from MIT was designed to prevent users from sending emails to the wrong
mailing list. When composing an email, the tool would:

 Show profile pictures of the people who would receive the email.

 Visually highlight the profile picture of the person the user had interacted with most
frequently in the past.

Facebook Nudge Experiments


A study built a Chrome browser extension for Facebook to test the effectiveness of three types
of nudges when a user was about to make a post:

Nudge Type Description Observed Effect

Caused participants to change their


Temporarily stops the user and
inline privacy settings (e.g., from
Picture displays profile pictures of the
"Public" to "Friends"), or cancel the
Nudge friends/public who can see the
post altogether, due to realizing the
post.
wide visibility.

Displays a countdown (e.g., 10


Participants found it "annoying and
Timer seconds) before the post is
handy"; it forced them to reflect on
Nudge finalized, giving the user a chance
the post before publishing.
to cancel.

Participants canceled posts


Analyzes the text and informs the
(especially negative ones), and overall
user if the post carries a negative
Sentiment post frequency was reduced.
sentiment (e.g., "Other people
Nudge However, the accuracy of the
may perceive your post as
sentiment analysis was a concern for
negative").
users.

🎯 Conclusion

The research demonstrated that these small interventions are helpful for users in making better
decisions about sharing information, especially when it comes to controlling the audience and
the tone of their posts. However, the video concludes that more work is needed to understand
which types of nudges are most effective in specific contexts without becoming overly intrusive
or annoying to the user.

The video "Week-7.3: Semantic attacks: Spear phishing" discusses semantic security attacks,
focusing on phishing, particularly spear phishing, which leverages personal information from
social media to increase attack success.

🧠 Semantic Attacks and the Semantic Barrier

The video introduces the concept of Semantic Attacks, a category of security threats targeting
human perception and meaning.

 Semantic Attacks target the way humans assign meaning to content. Unlike physical
attacks (accessing a machine) or syntactic attacks (buffer overflows), semantic attacks
exploit the mental model of the user.

 Semantic Barrier: This is the difference between:

o The User's Mental Model: What the user thinks is happening (e.g., who is
sending the message, the meaning of the content).

o The System Model: What the technical system thinks the user is doing (e.g.,
which remote machine is being accessed, the URL).

o A larger semantic barrier makes it easier for attackers to deceive users.


🎣 Phishing and Social Fishing

Phishing is a common semantic attack where attackers attempt to acquire sensitive information
(like usernames, passwords, and credit card details) by disguising themselves as a trustworthy
entity in an electronic communication.

Types of Phishing:

Term Target/Method

Classical attack, often through bulk email (e.g., urgent notification from
Phishing
eBay).

Spear Highly targeted attack, often using personal information to craft a


Phishing convincing message.

Targeted attack specifically aimed at Chief Executive Officers (CEOs) or


Whaling
high-level executives.

Vishing Phishing conducted over voice (phone calls).

Smishing Phishing conducted over SMS (text messages).

🔍 Social Fishing (Spear Phishing via Social Media)

Social fishing refers to conducting a phishing attack by first collecting personal information
from social networks to make the phishing attempt more credible.

The video details a classical study conducted at Indiana University to test the effectiveness of
social fishing:

1. Data Harvesting: Researchers collected publicly available personal information (birth


dates, interests, friends) from blogging and social networking sites related to Indiana
University students.

2. Email Crafting: Spoofed emails were crafted, either from an unknown person at the
university (Control Group) or from an alleged friend (Experimental/Social Condition),
using details gleaned from the social data to appear highly relevant and trustworthy.

3. Authentication Attempt: When users clicked the link, they were directed to a fake login
page that checked their credentials against the university authenticator and returned a
non-committal error ("server overloaded," "try again later").

📊 Key Results of the Social Fishing Study

The study demonstrated that personalization significantly increased the success rate of the
attack:

 Success Rate:

o Control Condition (Email from unknown sender at the university): 16% success
rate.

o Social Condition (Email crafted from an alleged friend): 72% success rate.
 Urgency and Takedown Time: 70% of successful authentications occurred in the first 12
hours, highlighting the immediate risk and the need for phishing websites to be taken
down rapidly.

 Repeated Attempts: Users exhibited high persistence, with some trying to authenticate
up to 80 times after receiving an error message, confirming their high trust in the
message's legitimacy.

👤 Vulnerability Factors

 Gender: Females were found to be more vulnerable overall. The success rate was
highest when the email came from the opposite gender (78% from male to female; 68%
from female to male).

 Age/Status: Freshmen (younger targets) were significantly more vulnerable than


upperclassmen.

 Department: Participants from Science departments showed the largest difference


between social and control conditions (higher vulnerability to social fishing), while those
in Technology had the smallest difference (lower vulnerability).

🛑 Ethical Concerns and Solutions

The study received significant negative backlash from participants who felt the experiment was
unethical, inappropriate, and fraudulent, highlighting the psychological costs of such research.

The video concludes with necessary steps to mitigate phishing:

 Education: Extensive educational campaigns are needed to inform users about the risks
of sharing personal information and recognizing phishing attacks.

 Technological Solutions:

o Faster takedown of malicious websites.

o More prevalent use of digitally signed emails to confirm sender authenticity.

o Browser solutions that identify and flag fraudulent websites.

The video "Week 7 - Tutorial 6- Analyzing text with NLTK" provides a tutorial on using the
Natural Language Toolkit (NLTK) in Python to perform simple text analysis and gain insights
from a collected dataset of tweets.

The tutorial walks through a process to clean and analyze text, using tweets about the Serum
Institute as an example.

Text Analysis Steps with NLTK

The goal is to move from raw text to meaningful word insights by systematically cleaning the
data:

1. Tokenization

 Action: Breaks the raw text (the collected tweets) into individual units, or tokens (words
and punctuation).

 Tool: [Link].word_tokenizer.

 Result: A list of all individual words, but the data is noisy, including punctuation,
capitalization differences, and common, uninformative words.
2. Lowercasing

 Action: Converts all tokens to lowercase.

 Tool: Python's .lower() string function.

 Result: Standardizes words so that "Fire" and "fire" are counted as the same token,
improving the total count for meaningful terms.

3. Punctuation Removal

 Action: Deletes standard punctuation marks (like colons, commas, and full stops) from
the tokens.

 Tool: Python's [Link] list and the .translate() function.

 Result: Removes uninformative characters, making the list of common tokens cleaner.

4. Stop Word Removal and Length Filtering

 Action: Removes common English words (like "the," "of," "at," "in") that do not
contribute to the topic and removes any remaining single-character tokens.

 Tool: [Link]('english') to get the list of common words, and


filtering tokens with length less than 2.

 Result: This final cleaning step significantly clarifies the data, revealing the core subjects
of the tweets.

🔍 Gaining Insight

After completing the cleaning process, the analysis of the top-occurring tokens provides a clear
picture of the tweet content:

 Identified Topics: The most frequent terms included "serum institute," "india," "fire,"
"vaccine," "pune," "five lives loss," and "covid 19."

 Conclusion: The analysis reveals that the tweets are centered on a tragic incident—a fire
at the Serum Institute in Pune that resulted in five lives lost, which was significant due
to the institute's role as the largest COVID-19 vaccine producer in the country.

This demonstrates the power of NLP and basic string operations to rapidly extract actionable
intelligence from unstructured text data.

The video "Week 8.1: Profile Linking on Online Social Media" explores the problem of profile
linking or identity resolution, which is the process of matching different accounts across various
online social networks (OSNs) to confirm they belong to the same individual.

The primary motivation is to track a user's digital footprint and avoid duplicating effort, such as
preventing advertisers from wasting resources by sending the same ad to a person's Facebook,
Twitter, and LinkedIn accounts.

🧩 The Challenge of Profile Linking

Users often have multiple accounts on OSNs, and linking these profiles is challenging because:

 Different Networks, Different Data: Different platforms (e.g., YouTube, Tinder, LinkedIn,
Facebook) offer different types of information (video opinions, dating data, professional
details, personal connections).
 Attribute Change Over Time: Users frequently change their profile attributes, making
static matching difficult:

o Usernames: A study tracking 376 million users showed that 7% of accounts


changed their username at least once during the tracking period.

o Profile Pictures: Up to 40% of users changed their profile picture multiple times
within a two-month period.

o Descriptions: About 35% of people changed their profile description at least


twice.

 Ambiguity: It is difficult to link accounts when they use different names (e.g., "Pam
Kumuru" vs. "pnr2468") or when a user is inactive on one platform.

🤝 Approaches to Profile Linking

To overcome these challenges, researchers use various attributes and features, often
incorporating the historical evolution of these attributes:

1. Explicit Self-Identification (The Easiest Way)

 A user explicitly links their accounts on a public page (e.g., adding a Twitter link to their
Tumblr profile).

 A user posts the same content (like a photo album link) simultaneously across two
different networks.

2. Common/Past Attribute Comparison (Feature Engineering)

Researchers compare a diverse set of attributes, categorized by their structural and temporal
characteristics:

 Username Creation Behavior: Features derived from the usernames themselves:

o Similarity: Similar length, choice, or arrangement of characters (e.g., using


"Jaccard distance" to measure how close two handles are).

o Evolution: Tracking the changes in username length or character choice over


time.

o Reused Patterns: Checking if the same core username is reused across platforms
(e.g., "swampson" on Twitter and Instagram).

 Profile Metadata: Comparing attributes like:

o Profile Picture (even if changed, a sequence of similar pictures can be a feature)

o Description, Location, and Name

3. Machine Learning Classification

Using the engineered features (current and historical), a classifier (like an SVM) can be trained
to calculate the probability that two handles belong to the same person. Studies have shown
that including past username information can significantly improve the accuracy of profile
linking.

🚨 Broader Motivations (Beyond Advertising)

Identity resolution is critical for several other applications:


 Measuring Public Sentiment: By identifying unique individuals, analysts can accurately
measure the true volume of positive or negative sentiment about a topic, preventing
duplication when one person expresses the same opinion on two different networks.

 Law Enforcement: To identify a single malicious actor who is operating under different
handles across multiple platforms (e.g., posting a malicious video on both YouTube and
Twitter) to avoid wasting resources chasing multiple false leads.

The video "Week 8.2: Anonymous Networks" discusses the nature, use, and characteristics of
anonymous social networks, using Whisper as the primary example.

👻 What Are Anonymous Networks?

Anonymous Networks are social platforms where it is difficult or impossible to identify the
person posting the content.

 Examples: Whisper, Secret, 4chan, Yik Yak, and Wicker.

 Motivation: They meet the growing demand for privacy and anonymity online, often
driven by high-profile privacy incidents (like the Snowden revelations) or consequences
faced by users for controversial posts on traditional networks. Users turn to them to
express thoughts they wouldn't want attributed to their real identity.

📱 Case Study: Whisper

Whisper is a confessional app where users post "whispers" (a post with text layered over a
generated image, similar to a meme). It prioritizes anonymity through several features:

 No Personal Information: It claims not to collect or associate personal information like


email or contacts with a user ID.

 Changeable Identity: Users can change their temporary usernames whenever they
want, making long-term identity tracking extremely difficult.

 Non-Persistent Social Links: Unlike Facebook or Twitter, Whisper does not maintain a
persistent graph of user relationships, even for actions like "hearts" (likes) or replies.

📊 Interaction Analysis: Anonymous vs. Traditional Networks

Analysis of Whisper data reveals that user interactions and the network structure are
fundamentally different from traditional online social networks (OSNs) like Facebook and
Twitter.

1. User Interaction and Attention

 Short Attention Span: 54% of replies to a whisper arrive within the first hour of posting,
and 94% arrive within one day. If a post doesn't get attention quickly, it is unlikely to get
any later.

 Low Engagement: 55% of all whispers receive no replies at all.

 Content Deletion: 18% of content generated on Whisper is deleted, compared to only


4% on Twitter. The peak deletion time is typically between 3 and 9 hours after posting,
with most deletes occurring within the first 24 hours. Categories like "sexting," "selfie,"
and "chat" were most frequently deleted.

2. Network Structure (Random Graph)

A network analysis comparing Whisper to Facebook and Twitter found that Whisper behaves
like a random graph, suggesting interactions are not based on existing relationships:
Whisper Facebook
Metric Implication for Whisper
(Anonymous) (Traditional)

Average Users interact with a larger


Highest (9.47) Lowest (1.78)
Degree sample of other users.

Users are likely to interact with


Clustering complete strangers who are
Lowest (0.033) Highest (0.59)
Coefficient highly unlikely to interact with
each other.

Any two users are quickly


Average Path Highest
Lowest (4.28) reachable, characteristic of a
Length (10.13)
random graph.

Confirms the network is random,


Assortativity Lowest (-0.011) Higher (0.038) as nodes do not link to other
nodes with similar degrees.

3. User Engagement and Stickiness

Despite the number of new users increasing, the daily volume of new posts and replies in the
entire network remains relatively stable. This suggests that the lack of persistent identity and
strong social links prevents the formation of stickiness—the factor that keeps users continually
engaged in traditional OSNs. New users either quickly leave or do not contribute significantly,
leading to flat overall content generation.

🌐 What is Gephi?

Gephi is an open-source tool widely used in academia, journalism, and digital humanities for
analyzing and visualizing complex network data, such as social graphs from platforms like
Facebook and Twitter.

⚙️Core Functions and Tabs

The tutorial walks through the three main tabs in Gephi and their functionalities:

1. Data Laboratory

This tab allows you to manage the raw data.

 Data Loading: Gephi supports various file formats, including GML, GraphML, and CSV
(requiring separate files for nodes with unique IDs, and edges with Source and Target
IDs).

 Data Manipulation: You can view, edit, and modify attributes for both nodes (e.g.,
changing the number of followers) and edges.

2. Overview

This is the main workspace for visualization and calculation.

 Statistics: This panel is used to calculate various network metrics by clicking the "Run"
button for each:
o Average Degree: The average number of connections per node.

o Network Diameter: The average shortest distance between all pairs of nodes.

o Graph Density: Measures how close the network is to a complete graph.

o Modularity: A measure used to identify communities or clusters within the


network.

o PageRank & Centrality: Algorithms to determine the importance of nodes (e.g.,


Eigenvector Centrality).

 Appearance: Customize the visual properties of the nodes and edges, such as changing
size or color based on calculated attributes (e.g., sizing a node by its in-degree or
coloring it by its modularity class).

 Layout: Apply various algorithms (e.g., Fruchterman Reingold, Force Atlas 2) to organize
the graph visually and separate overlapping nodes.

 Filters: Apply filters to select specific subsets of nodes and edges based on attributes or
network topology (e.g., nodes with a certain range of followers).

3. Preview

This tab is for advanced adjustments and export.

 Final Visualization: Allows for fine-tuning the visual representation (e.g., smoothing
curved edges, showing labels) for presentation purposes.

 Export: You can export the final network visualization in formats like PDF, PNG, or SVG.

Additional Features

 Workspaces: You can export a filtered subset of the graph to a new workspace to
perform separate, detailed analysis on a sub-graph without affecting the original data.

 T-Cloud: A feature to generate a tag cloud representation of the node labels.

This video, "Week 9.1: Privacy in Location Based Social Networks Part 1," discusses privacy
concerns and research related to Location-Based Social Networks (LBSNs), using Foursquare as
a primary case study.

📍 Location-Based Social Networks (LBSNs) and Privacy

The lecture focuses on the shift in the course toward reviewing research papers to understand
how techniques learned previously are applied to draw meaningful inferences.

 Examples of LBSNs: Foursquare, Yelp, Gowalla, and geo-location features on platforms


like Facebook and Twitter.

 Privacy Concern: The core issue is the leakage of location information. Data from
surveys suggests that a significant portion of users (up to 39%) set their location sharing
to "Everyone" on platforms like Facebook. An example from the survey showed that 44%
of Indian participants felt it was privacy-invasive when a mobile service provider used a
regional language to report a switched-off phone, as it revealed their geographical
location.

Research Case Study: Inferring Home City from Foursquare


The paper "We Know Where You Live: Privacy Characterization of Foursquare Behavior"
investigates how a user's private information, specifically their home location, can be inferred
using only publicly available LBSN data.

Foursquare Terminology and Public Data

Foursquare offers a gamified platform with the following key terms and public/private data:

Term Description Privacy Status

Private (Usually
A user sharing their current location (Venue) via
Check-ins visible only to
GPS.
friends).

A specific location (e.g., airport, restaurant,


Venue
monument).

The status given to the user who has checked into


Mayorship a specific venue the most times in the last 60 Publicly Available
days.

User-posted comments, feedback, or reviews


Tips Publicly Available
about a venue.

Dones/To Marking a tip as "done" (agreed/useful) or "to do"


Publicly Available
Do (a place to visit).

Findings on Privacy Leakage

The research found that by analyzing only the publicly available information (Mayorships, Tips,
and Dones), they could easily infer the home city of approximately 75% of the analyzed users
within 50 kilometers.

The video also references the defunct website "Please Rob Me," which demonstrated the real-
world risk of location sharing by aggregating tweets that indicated users were not home,
effectively providing a target list for burglars.

This video, "Week 9.2: Privacy in Location Based Social Networks Part 2," continues the
analysis of the research paper "We Know Where You Live: Privacy Characterization of
Foursquare Behavior," detailing the methodology used to infer users' home locations from their
publicly available Foursquare data.

📊 Data Set Characterization and Analysis

The lecture begins by characterizing the massive data set, which included approximately 13
million users and millions of tips, dones, and mayorships.

 Power Law Distribution: The distribution of mayorships, tips, and dones per user and
per city was found to be highly skewed, following a Power Law (or Pareto Principle). This
means a small minority of users or cities account for the vast majority of the content
posted.

 User Activity Patterns:


o Time: The distribution of time between consecutive tips or dones by a user
showed a clear daily pattern, with activity spiking at multiples of 24 hours. This
reflects the common behavior of checking into locations like an office or an
institute every day.

o Distance: The displacement between consecutive tips or dones showed that


most users (about 70%) have an average travel displacement of at most 150
kilometers, indicating movement within or between local cities.

🏠 Inferring Home Location

The core of the analysis was determining the accuracy of inferring a user's home location using
the public information (mayorships, tips, and dones) and a majority voting scheme.

User Classification for Prediction

Users were categorized to better target the prediction model:

1. Class 0: Users with only a single activity (one tip, one done, or one mayorship) in the
data set.

2. Class 1: Users with multiple activities but a predominant location that stands out as the
most frequent.

3. Class 2: Users with multiple activities where no single location is clearly predominant.
(Prediction efforts focused on Class 0 and Class 1, as the home location could not be
determined for Class 2).

Key Finding

By comparing the inferred home location with the user-declared home city, the research
established a high degree of accuracy for privacy leakage:

The model can correctly infer the home city of around 78% of the analyzed users within a 50-
kilometer distance, using only their publicly available activity data. This demonstrates a
significant privacy risk on location-based social networks, as home location can be inferred
without access to private check-ins.

"Tutorial 7: Visualization - Highcharts," introduces how to use Highcharts, an interactive


JavaScript charting library, to create visualizations for data collected from online social media.
The tutorial covers using a Python wrapper to generate the charts and briefly demonstrates a
cloud-hosting platform.

💻 Visualization with Python and Highcharts

The first part of the tutorial focuses on installing and using the Python Highcharts wrapper to
build interactive graphs.

Setting Up

1. Installation: The wrapper is installed using the command sudo pip install Python-
highcharts.

2. Basic Structure: Code involves importing the pychart module, defining a container,
setting chart options (like title and chart type), and finally adding the data series. The
script generates an HTML file that can be viewed in a web browser.

Chart Types Demonstrated

The tutorial provides step-by-step examples for creating four fundamental chart types:
1. Bar Chart:

o Purpose: Presents grouped data using rectangular bars, where the length of the
bar is proportional to the value.

o Example: Visualizing the number of followers for five different users across four
consecutive days.

o Interaction: Allows users to filter out specific data series (e.g., a particular day)
by clicking on the legend.

2. Line Chart:

o Purpose: Shows a series of data points connected by straight lines, typically used
to visualize data changes over time.

o Example: Tracking the follower growth of multiple users over the months of a
year.

o Interaction: Demonstrates how to hover over data points to see specific values
and filter users from the graph.

3. Scatter Plot:

o Purpose: Draws a point for each data point to show the relationship or pattern
between two variables without connecting the points.

o Example: Analyzing the Friends-to-Followers pattern for celebrity accounts


versus normal accounts on Twitter.

4. Bubble Chart:

o Purpose: Similar to a scatter plot, but it incorporates a third dimension


represented by the size of the bubble. It is also used to demonstrate how to
handle and format date/time objects for the axes.

o Example: Visualizing a Twitter user's tweeting pattern, where:

 X-axis: Represents the timestamp (Date/Time).

 Y-axis: Indicates whether a tweet was posted (1 for tweet, 0 for no


tweet).

 Bubble Size (Z-axis): Represents the volume or number of tweets posted


on that day.

☁️Highcharts Cloud Platform

The video briefly introduces Highcharts Cloud, an alternative platform for visualization. Users
can simply copy and paste their data or upload a CSV file, and the platform automatically
generates the chart. This provides a non-coding method to create visualizations and allows for
easy customization through various templates.

"Week 10.1: Beware of What You Share Inferring Home Location in Social Networks,"
discusses a research paper that performs a large-scale inference study to determine a user's
home city and residence using public data from three popular social networks: Foursquare,
Google+, and Twitter.

📊 Data Sets and Accuracy of Inference

The study analyzed millions of data points from the three networks:
Home City Inference
Social Network Primary Attributes Analyzed
Accuracy

Foursquare Mayorships, Tips, Likes/Dones, Friends 67%

Friends, Places Lived, Education,


Google+ 72%
Employment

Twitter User Location, Geotagged Tweets 82%

Key Data Insights:

 Data Volume: The data sets were large, with Foursquare data comprising 13 million
users and 16 million venues, Google+ including 27 million profiles, and Twitter using 20
million geotagged tweets.

 Geographic Granularity: For all three platforms, the city level was the most common
and precise level of location information available for user attributes (e.g., user home
city, venue city, education location).

 Twitter's Advantage: The high accuracy for Twitter (82%) is primarily due to the
precision of geotagged tweets (latitude and longitude), which allowed for highly
accurate location inference.

🏠 Inference Methodology and Results

The researchers grouped users into three classes based on their activity patterns, similar to the
Foursquare analysis in the previous lecture:

1. Class 0: Users with only one location-based activity.

2. Class 1: Users with multiple activities but one clearly predominant location.

3. Class 2: Users with multiple activities and no single dominant location. (Inference
models are focused only on Class 0 and Class 1).

Error Distance Analysis

The study also analyzed the physical distance (error) between the inferred home location and
the user's declared home location, highlighting the precision of the privacy leak:

 Home City (within 50 km): The model could make correct inferences within a 50-
kilometer radius of the user's declared home city with the following accuracies:

o Twitter: 87%

o Foursquare: 78.5%

o Google+: 64%

 Home Residence (within 20 km): When attempting to infer the user's exact residence
(not just the city), the results were particularly stark for platforms with highly precise
location data:

o Twitter: 73.67% of inferences were within a 20-kilometer radius, with 35% being
exactly 0 km away from the residence.

o Foursquare: 77% of inferences were within a 20-kilometer radius.


o Google+: Only 5.23% of exact residence inferences were possible, as the data
used (employment/education) is less tied to daily home location.

The paper concludes that users must beware of what they share, as location-based social
networks leak significant privacy by allowing highly accurate inference of a user's home city and
residence using only publicly available data.

"Week 10.2: On the dynamics of username change behavior on Twitter," analyzes the patterns,
frequency, and motivations behind why users change their usernames on Twitter.

📈 Key Findings on Username Change Dynamics

The study tracked 8.7 million users over two months and found that about 10% of users
changed their username at least once. The behavior follows a classic Pareto principle (power-
law distribution), showing that a small fraction of users are responsible for the majority of
changes.

 Frequency: 20% of users were responsible for 85% of all username changes, with these
users often changing their names five times or more. One extreme user changed their
username 113 times in 14 months.

 Correlation: There is only a weak positive correlation between a user's popularity


(number of followers) or activity (number of tweets posted) and the frequency of their
username changes. This suggests that change behavior is not strictly driven by high
visibility.

 New vs. Old Names: About 65% of users choose a new username that is unrelated to
their old one, while 35% later reuse an old name.

🎯 Reasons for Username Changes

The authors identified several reasons why users change their handles, some benign and some
malicious:

 Space Gain: Since Twitter had a character limit, many users with long usernames
(greater than the median length of 11 characters) changed to shorter names to save
characters for their tweets.

 Suit a Trending Event: Users change their handle to align with a current event (like a
sports season) to gain traffic or traction.

 Anonymity: Users change to a more generic name (e.g., from a real name to an
anonymous phrase) to gain privacy, or vice versa, to make the account more personal.

 Adjust to Real-Life Events: A user might update their handle to reflect a new profession
or status (e.g., changing from "gradstudent" to "professor").

 Malicious Intent:

o Username Squatting: Registering a popular or desired username and holding it


until a popular figure or brand wants it, often to monetize the handle.

o Obscurity/Promotion: Changing a handle to one very similar to a popular user to


get promotion or avoid being tracked.

The study also observed instances of collaborative username sharing where a single username
was used by different accounts within a small group at different timestamps, suggesting
coordinated behavior.
"Week 10.3: Boston Marathon - Analyzing Fake Content on Twitter," examines the spread and
characteristics of fake, rumor, and malicious content on Twitter during and immediately
following the Boston Marathon bombing on April 15, 2013.

🚨 Key Findings on Fake Content

The study collected 7.9 million tweets using a set of event-related keywords and manually
annotated the most popular content to categorize it.

 Rumors and Fake Content Dominated Virality:

o 29% of the most viral content was classified as rumors and fake content.

o 51% consisted of general comments, opinions, and condolences (neutral/other).

o Only 20% was confirmed true information.

 Temporal Spread of Information:

o Rumor information generally started to spread earlier and propagated faster


than true information.

o The frequency of tweets directly correlated with real-world events, such as one
hour after the blast, the release of suspect pictures, and the subsequent
manhunt.

 Malicious Account Creation:

o The authors identified over 6,000 malicious user profiles that were created right
after the blast and were later suspended by Twitter.

o These profiles were not isolated; they often formed connected networks (e.g.,
closed communities, star topologies) to collaboratively spread the fake content.

‍♂️Characteristics of Propagators

The analysis of user attributes revealed features of the accounts involved in spreading the fake
content:

 Device Used: Approximately 75% of the fake tweets were propagated via mobile
phones (e.g., iPhone).

 Follower Count: Fake content was often propagated by accounts with a low number of
followers (low popularity). However, it became viral because it was retweeted by users
higher up in the network chain, including some with verified accounts who did not check
the content's validity.

 Predicting Virality: The study found that it is possible to predict how viral fake content
will become based on the attributes of the users currently propagating it, using features
like social engagement and credibility.

⚠️Misinformation Examples

Examples of the fake and malicious tweets that circulated include:

 Spreading false claims about victims (e.g., a non-existent child who died).

 Tweets asking for retweets with the false promise that "$1 will be donated to Boston
Marathon victims for every retweet."

You might also like