Data Analytics Overview for COMP 333
Data Analytics Overview for COMP 333
Data wrangling ensures data readiness for further analysis by transforming messy raw data into a structured format, which involves cleansing errors, handling missing values, normalizing data for comparability, and ensuring data consistency and integrity. This results in 'Tidy Data,' making it conducive for sophisticated analysis and model building .
Data wrangling is crucial because it addresses the 'messy' nature of raw data, adhering to the principle 'Garbage-In-Garbage-Out (GIGO).' The process involves steps such as discovering, structuring, cleansing, enriching, validating, and publishing data. This often constitutes over 60% of the time and effort in data analytics . The goal is to output 'Tidy Data,' where each variable and observation are clearly defined and organized .
Understanding the business goal is foundational in the data analytics process as it sets the direction for what problems the analytics initiatives are trying to solve. This understanding shapes the approach to data collection, feature generation, model building, and evaluation to ensure that the analytics outputs directly contribute to achieving the business objectives .
Identifying and dealing with outliers during data wrangling involves challenges such as distinguishing between genuine outliers and errors, which requires thorough domain knowledge and statistical analysis. Outliers can disproportionally affect the model's performance, thus necessitating robust methods like trimming, transformation, or using resistant statistic measures. Moreover, decisions on handling outliers often require balancing between data integrity and model performance .
The main purpose of data analytics in organizations is to add value to data so it becomes actionable and aids in decision-making. In a business context, this is often termed 'monetization of data,' meaning the data is leveraged to generate insights or economic benefits, driving decisions that potentially increase profits or efficiencies .
Exploratory Data Analysis (EDA) contributes significantly to data analytics by helping identify underlying patterns, relationships, and anomalies within the data through an iterative approach. It involves data wrangling and descriptive data analysis, enabling a deeper understanding of the data types and distributions, handling missing values, and transforming features to improve model outputs. EDA is a critical step for developing a data-driven solution, as it explores which data features are beneficial for modeling and ultimate decision-making .
Feature engineering is crucial in Exploratory Data Analysis because it involves creating new features from raw data that can make machine learning models more effective. This process includes adding, changing, or removing data features to improve model performance by discovering features that provide significant insights or patterns, thus influencing the model's explanatory power and accuracy .
The factors determining the type of descriptive statistics used in evaluating a dataset include the nature of the data—whether it is categorical or continuous, structured or unstructured, and the specific characteristics such as central tendency, variation, outliers, skewness, and kurtosis. These factors dictate whether metrics like mean, median, mode, standard deviation, or inter-quartile range are appropriate for describing the dataset effectively .
Descriptive data analytics focuses on summarizing past data to understand trends and patterns, answering 'What has happened?' Predictive data analytics uses models to forecast future outcomes based on historical data, providing insights into 'What could happen?' Prescriptive data analytics evaluates potential interventions and solutions, recommending actions to optimize outcomes, addressing 'What should we do?' Each type plays a role based on distinct objectives—descriptive for understanding, predictive for forecasting, and prescriptive for decision-making .
The iterative nature of the data analytics process enhances model deployment effectiveness by allowing continuous refining and re-evaluation of data inputs, models, and assumptions. Each iteration helps uncover new insights, adjust strategies based on model feedback, and ensure the model aligns closely with the evolving business goals and data characteristics, ultimately increasing the accuracy and usefulness of the deployed solution .