Understanding Data
Data Types
Structured Data
Data in a Organized form & confirm to data model
Semi Structured Data
Does not confirm to data model but has some structure
Unstructured Data
Data not in a organized form & does not confirm to data model
Digital Data
10% 10% structured data Semi structured data 80% Unstructured data
Structured Data
Conforms to a data model
Similar entities are grouped STRUCTURED DATA Attributes in a group are same
Data is stored in rows and columns
Data resides in fixed fields in a record or file
Dfn,format and meaning of data is explicitly known
Structured Data- Sources
Databases
Spreadsheets SQL OLTP Systems
Ease with Structured Data
Storage
Scalability Security Update and Delete
Retrieving Information
Indexing and searching Mining Data BI Operations
Semi-structured Data
Does not Conforms to any data model but contains tags and elements Similar entities are grouped Cannot be stored in rows and columns
Semi-structured Data
Attributes in a group may not be the same
Not sufficient meta data
The tags and elements describe data is stored
Semi-Structured Data Sources
Email XML Zipped Files TCP/IP Packets Mark-Up Languages Integration of data from heterogeneous sources
Semi-Structured Data
Challenges Storage Cost RDBMS Irregular and partial Structure Evolving Schemas Schemas and Data Possible Solutions XML RDBMS Special Purpose RDBMS OEM
Unstructured Data
Does not Conforms to any data model Has no easily identifiable structure UNSTRUCTURED DATA Does not follow any rules or semantics Not in any particular format or sequence Cannot be stored in rows and columns
Not easily usable by a program
Unstructured Data Sources
Web Pages Memos Videos Images Content of Mail Surveys Word Doc PPTs Chats Reports White Papers Etc.,
Unstructured Data Challenges
Challenges Storage Space Scalability Retrieve Information Security Update and Delete Indexing and Searching Solutions Change Formats New hardware RDBMS/BLOBS XML CAS
Measurements-Properties
Distinctness
Different objects receive different scores ( = or )
Order
Ordering of the numbers reflects ordering of the variable ( ,)
Addition &Subtraction
The difference in each situation is identical (+ or -)
Multiplication & Division
Assigning a value of zero indicates absence of the variable being measured ( X & )
Nominal Scale
Numbers assigned as labels or tags for identifying and classifying objects Numbers do not reflect the amount of characteristic possessed by the objects Permissible operation on nominal scale is counting as the numbers are arbitrary Limited statistical summarize measures can be used to
Ordinal Scale
Numbers are assigned to objects to indicate the relative position These numbers also indicate whether an object has more or less characteristic than some other object Orders categories logically Few statistical measures used to summarize these numbers
Interval scale
Will have equal intervals in the scale
Arbitrary zero point
Almost all the descriptive statistical measures can be used to summarize and analyze
Ratio Scale
Possess all the properties of nominal, ordinal and
Interval scale
It has an absolute zero point indicating the absence of the variable Categories has equal intervals Almost all the descriptive measures can be used to summarize and analyze
Scales of Measurement
Nominal Examples Ethnicity Religion, ZIP Gender, ID Distinctness Properties Ordinal Interval Ratio Weight, age, Height, Time , Temp in Kelvin Distinctness , Order , difference and Multiplication All arithmetic operations Class rank, Temperature in Letter Grade, Celsius or Mineral grading Fahrenheit Marks Distinctness & Order Distinctness , Order and difference
None Mathematical Operations Statistical Measures Mode, Contingency table, Chi square
Rank Order ( , )
Add Subtract
Weighted Mean, Median , rank correlation , percentile
All Descriptive Measures ( Mean, SD, Pearsons correlation & t
All Descriptive Measures ( GM, Percent Variation etc)
Data Availability & Preparation
Data Availability , Seeking Permission & Third party data
Data Understanding Phases
Collect Initial Data
Locating ,assessing and obtaining data
Describe Data
Properties of data Amount of data, data type etc
Explore Data
Cross tabulations, Associations & Descriptive Statistics
Verify Data Quality
Complete data, data on all variables & missing values
Data Preparation
Select Data
Selection of variables and data
Clean Data
Handle Missing values
Construct Data
Transformation of Variables
Integrate Data
Merging Data from different sources
Format Data
Structure of Variables
Explore
The Data Warehouse- Digital Chapter , HBSP SPSS Introduction [Link] Review of BRM/MR course