0% found this document useful (0 votes)
47 views295 pages

STA 121 Textbook

Introduction to descriptive and inferential statisics for year 1 students .
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
47 views295 pages

STA 121 Textbook

Introduction to descriptive and inferential statisics for year 1 students .
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
A FirstiCourse in Department of Mathematics, _ University of Lagos, Akoka- Yaba, , Lagos, Nigeria. CHAPTER ONE INTRODUCTION TO STATISTICS 1.1 WHAT Is STATISTICS? is the science of collecting, classifying, presenting, and data. Our society has developed into one where science ology affect everything around us. Statistics is one of the nt of these scientific tools. Virtually all facets of our cted by statistics. Statistics has become a necessary academic fields including the sciences, engineering, i Science, economics, psychology, sociology, €, nursing, and other health-related areas. 1€ universal language of the sciences. Statistics is more f tools”. As potential users of statistics; we need to “of using these tools correctly. Careful use of is enables us to: the findings of scientific research is the science of collecting, classifying, interpreting data. s the following: collection of data, they are presented in ‘ble and graphical form. a and presentation of data, the next To analyze the data we use average, t Bree is the interpretation of the result Scannaguith amScener 1.2 Two types of Statistics tatistics is subdivided into two general ai ae The ea ra tatica and inferential statistics. Descriptive teens descat most people think of when they hear the word Statistics, ‘y panicdes the collection, presentation, and description of ‘dat! Inferential statistics refers to the technique of interpreting the y. ne resulting from the descriptive techniques and then using them i make decisions and draw conclusions about the population. Example: yal , Bier event of tossing dice. The dice is rolled 100 times. Descriptive Statistics is used to grouping the sample data to the following table Outcome of the roll Frequency in the sample data 10 20 18 16 ob 25 Inferential statistics can now be used to verify whether the dice is fair or not. 1.3 INTRODUCTION OF BASIC TERMS Population and Sample Population and sample are two basic concepts of statistics. Population is the entire collection, or set, of objects whose properties are to be analyzed. The ion is the most fundamental idea in ¢ population of concern must be carefully defined dered fully defined only when its membership list of . The set of “all students who take Algebr# y ha an example of a well-defined population all lecturers in University of Lagos”. people but also a collection of animal, ° There are two types When the members in a populatio? can be physically counted, the population i infinite when the idea Barat Racotens sph ICID intents in University of Lagos is a finite population. "The ser er Continuous Ordinal Categorical Non-numeric data are ee that cannot be quantified. For example, either “number, tribe, country, etc. Data in this form are $ categorical or ordinal. Examples of ordinal non-numeric data are students’ height, Age group, while sex, country, a examples of categorical non-numeric data. inet ete are Numeric data can be subdivided into two classifications: (1) | Discrete numeric data and (2) Continuous numeric data. Counts will always yield 4 numeric data, e.g. the number of students oe _the | in a school, 4 measure of a quantity will usually be continuous, e.g. weight weight lifters. ol e) Statistic A statistic is a quantity whose numerical values can be obtained from data. A statistic is a value that describes a sample. Most sample statistics are found with the aid of formulas. For example, mean, median, mode etc. f) Experiment A planned activity whose results yield set of data is known as experiment. 1.4 DATA COLLECTION One of the first problems a statisticians faces is obtaining data. Data can be collected directly from respondents or from established data bank. Data collected directly from the source or respondents are known as primary data and those from established data bank are known as secondary data. _ Primary data collection for statistical analysis in an involved process and includes the following important steps. Defining the objectives of the survey or experiment. Example: mm the average height of female students in UNILAG. the target population ! the strategy and method to be used for data-collection oe ‘ Ss 2. Ide b) 9 qd 6 mere ave two methods used to collect data 5, In an experiment, the investigat: 4 survey! He environ) Thi ese are experiments or c ment and observes the effect on the oe or modifies Se variable. ig js common in laboratories. In a survey, d: : Tpling some population of interest. varios ea ee by Seed in order £0 obtain sample data from surveys are peers pelow- when selecting 4 sample for a survey; it is necessary t ct a sampling frame. A sampling frame is a list of the ns' 5 Sfements that belongs to the population from which the sample is drawn. AD example is a list of all students in year one, Mathematics department. EXERCISES 1.5 1. Select fifty students currently enrolled in your department and collect data for these three variable. number of courses enrolled in i 9, total cost of textbooks 7 method of payment used for textbooks a) What is the population? b) Is the population finite or infinite ) What is the sample? d) Classify the responses for each of the three variables as non-numeric data, discrete data, or continuous data. 2. Identify each of the following as examples of 1. Non-numeric 2: Discrete 3. Continuous variables: i The hair colour of people in a concert show. ) ~The number of hours required to heal a patient of a disease. c) The length of time required answering a telephone call at en business center. ie num jt i fein ber of pages per job coming off a computer rnc kind of trees used as Christmas tree. . number of voters in a community. ‘ a statement is tru o ie or false. hs pauraber of books in a library. ScannaguithcamScener 3. Define and explain the following terms: Population Sample Statistic Statistics Variable Data Experiment CHAPTER TWO EXPLORATORY DATA ANALYSIS isting large set of data does not present much of a picture to the Exploratory data analysis (EDA) can help to condense the re manageable form and therefore provide a better overall picture of the data. This can be accomplished with the aid of Frequenct) distribution (tabular), numerical graphical EDA methods. The specific method of appropriat: that are being investigated. e EDA depends on the type of data 'THODS (FREQUENCY DISTRIBUTION) 2.1 TABULAR EDA ME’ A table listing all classes and their frequencies is called frequency distribution. by using us demonstrate the concept of a frequency distribution Let the following set. eae Be Oi et oa oes 658 Let x represent these data values, we can us' istributi i ea frequency distribution to represent this set of data by listi i i isting thi fre Se rerisio ry i the x values with their makes sense to combin an “4 ine the data in [ps nstrate this with this example: ak Ss oo 61 SP "ar asso 78 72 Beeeyo7e ck 49 «65 79 #83 41 86 Bee 47 8 7 64 7847 54 43 73 85 48 66 48 85 86 82 4g a ee cc) The following guidelines and terminology will be used to group continuous-type data into classes of equal length. These guidelines can also be used for sets of discrete data that have a large range. Determine the largest (maximum) and smallest (minimum) observations. The range is the difference, R = maximum — minimum A frequency distribution should have a minimum of 5 classes and a maximum of 20. For small data sets, use between 5 and 10 classes. For large data sets, use up to 20 classes. aS Each data entry must fall into one and only one class. There should be no gaps. Moreover, if there are no entries fora particular class, that class must still be included with a frequency of 0. The first interval should begin about as much below the smallest value as the last interval ends above the largest. The intervals are called class intervals and the boundaries are called class boundaries. The class limits are the smallest and largest possible observet mark is the midpoint of a class. classes for the above data 30 - 39, 40 - 49, 9 a summary table below in Table 2.2. qable 2.2: Frequency distribution a Class limits Tally Frequency Class Mark Relative Frequ uency 30-39 tit 5 foo49 ISLINLIM 13 a eee 50-59 9 Warm 55 18% 60-69 IN I 65 12% 70-79 THT (is 16% s0-89 HEIN 85 16% 90-99 1 95 (2% 100% Class show us how the data are spread out or distributed; ution table or simply a frequency ‘Tables like this frequency distribi we call this @ distribution. ‘The relative frequency for divided by the total number of entries. It is calculated as Relative frequency frequency in the class Total number of observations ‘a class is the number of entries in the class Percent = 100 x relative frequency lative frequency of class 50 - 59 is 9 x 100% = 18% 50 For example the rel Example: Sample size = number of observations = n = 200 Category Frequency relative ] percentage frequency a quency ood 50 0.25 25% 5% ‘Ties 10 0.05 0 0.1 10% Linoleum 2 gt 100% Ec ae display is known as a cumulative frequency be hich (as its name suggests) contains a column for the total of frequencies for all classes. The cumulative frequency of a class is the total of all class frequencies to and including the present class. S up The cumulative frequency distribution of the example given above is as follows: Table 2.3: Cumulative Frequency Table Relative Cum. Frequency 10% 36% 54% 66% 82% 98% 100%. Class Class limits Frequency Cumulative Frequenc 30-39 5 40-49 50 —5S9. 60-69 70-79 80-89 D099 Relative Cumulative Frequency is also called Percentage Cumulative Frequency. For example the Relative Cumulative Frequency for class 60 - 69 is 33. x 100% = 66% 50 2.2 GRAPHIC EDA METHODS One of the most helpful ways to become acquainted is to use an initial exploratory technique that will result in a pictorial representation of the data. The displays visually reveal patterns of behaviour of the variable being studied. There are several graphic (pictorial) ways to describe data. The type of data and the idea to be presented determines the method used. Data can be Baesoted graphically in many ways as, line graph, bat chart, pie chart, histogram, c ti ive) and ue a er umulative frequency curve (Ogive) a! To struct a b: eit cons} @ bar chart, we start with horizontal and vertical axes: isslabel sthe Sey being studied horizontally from left to right s markings jong the horizontal axis should correspond to thé limits of the classes in the frequency distribution. The correspondins in each class is measured vertically upward. A vertical ba” 12 awn across each class interval with hei; ight e for that class. We could also draw a bar chart ae ae quencies instead of the frequencies for each class. a uencies are measured along the vertical seis ‘as js then frequency relative fre relative freq’ percentages: ple Oy: Use table 2.2 to construct a frequency bar chart and a bar chart. Understanding Basic Statistics Solution Frequency 30 39 49 #59 69 79° 89 99 Figure 2.1: Bar chart Example 2. A computer anxiety questionnaire was given to 300 children in a computer course. One of the questions was “T enjoy .” The responses to this particular question were Strongly | Disagree 35 30 ‘Seannaawith GemScanner Solution Frequency . , a ue i 100 Bt ead 80 $s s 2 60 r 40 20 > Strongly Agree Slightly Slightly Disagree Slightly Response Agree Agree Disagree Disagree ies Figure 2.2: Bar chart of responses a : Example 2.3: The following table shows the intake through JAMB by the Faculty of Science of a certain University in three consecutive years. Pe § 3 Table 2 5 bara i Department 2002 2003-2004 i 43 40 35 i 28 35 42 45 40 35 33 25 28 38 45, Solution Frequency NOR o8SSeSns5, Department nent bar chart for JAMB Admission Figure 2.3: Comps (ii) aI ca 2002 e i % B 2003| £0 1.2004 > 202 eee Bee cet ee foes bo a fs ® é 6 3 Department Multiple bar chart for JAMB Admission Pie Chart Scannaguith amScaner graph) is used to display relative frequencies cies for the data. We draw a circle and 10 a series of wedges or slices to represent each clag, 8g n divide it int : 1 bathe relative frequency distribution. The size of each slice jg proportional to the percentage of the data that fall into the ponding class. corres] gxarapic 24 ooRepresent the question in example 2.4 in a pie chart Solution Angle for each class = Number in the class x 360° Total number of observations Strongly Disagree Strongly 10% Agree 9 Disagree anne 12% Slighty Disagree \, 17% 7 Agree 28% Slighty Agree 13% Figure 2.5: Pie chart for response Schuur a St r Number Angles i rongly Agree 60 spe SS Slight £5 102° ighty Agree 40 i Slighty Disagree 50 48 Disagree 35 ee Strongly Disagree 30 ye a Histogram a type of bar chart representing an entire set of dat: lata. The histogram is de up of the following components: A histogram is mai |Atitle, which identified the f a. 7 fi population of concert ‘a vertical scale, which identifies the reer Vite vata nas b. classes. A horizontal scale, which identifies the variable. c values for the class boundaries, class limits, or class marks may be labeled along the x-axis. Use whichever one of these sets of class numbers best represents the variable. Using the Table 2.2. We draw the histogram. Table 2.6 Glass Class limits Frequency Class. boundaries Class center 1 30 - 39 5 29.5 - 39.5 34.5 2 40-49 13 39.5 - 49.5 44.5 3 50-59 io 49.5 - 59.5 54.5 4 60-69 6 59.5 - 69.5 64.5 5 70-79 8 69.5 - 79.5 74.5 6 80 - 89 8 79.5 - 89.5 84.5 7 90-99 a 89.5-99.5 94.5 Frequency 14 12 10 8 6 4 2 29.5 495 69.5 89.5 Marks ScannaguithCamScener a ingle vertical line between the first two dpe ee ee cea 39 and 40. However, it is not clear ae a Cee itae should represent ~ is it 39 or 40 or what? To Tesolye this eepenemineaeeeniatithe|vertical line Tepresents 39.5, which ae nee boundary between the two classes. In the same way, the next vertical line represents 49.5 and so forth. Another type of graphical display is the frequency polygon, construct this type of graph, we first determine the measurem corresponding to the midpoint of each class. This value is called ¢lass mark, or class center, or class midpoint and is given by Class center | = lower limit + upper limit 2 To ent the For example, in the class 29.5 to 39.5, the class center is Class center = 29.5 + 39.5 2 34.5 er. 147 2 10 + 84 Frequency Marks Figure 2.7: Frequency Polygon The mode is the value of the Piece of data that occurs with the Greatest frequency. From Figure 2.9, the mode is 46, To obtain this, Use a ruler to connect both right and left edges of the tallest bar y the bars on both sides of the tallest bar, then locate their point Ei ection and trace this down to the horizontal axis using a verticé line. Where this line meets the x-axis is the mode. 18 The modal class is the class with the highest fre ith two modes is called bimodal. A data set feauency. A data set ‘modal; if there are more than three modes, ea is iy called called tri multimodal. present a relative frequency histogram. Note that the total s histogram is equal to one. The shape of this and that of We now am is the same. The relative frequency histogram g(x) is area of thi the histogré Total number of observation x class interval example 2.5 ‘The following 30 gains were recorded to the nearest 1 million Naira of some private entrepreneurs. fae er i ome es 10 Hi} 1 1 36 39 40 3 3 4 ap 14 28 32 34 1 3 3 12 12 13 Construct the relative frequency histogram. Solution Let co, ci, €2, c3, and ca represent the class boundaries. Let co = 0.5 and c: = 3.5 with 14 observations in between; co = 10.5 with 6 observations; c3 = 29.5 with 5 observations and cs = 40.5 with 5 observations. This yield the following relative frequency histogram: oa 0.5 ee SBs8sa i 8s Cumulative Frequency sss oS ° iy i) BR $ co Figure 2.8: Cumulative frequency curve a) Quartles Qi = 25th percentile = 46.5 50% percentile 59.5 75% percentile = 72 ») median is the 50% percentile and it is equal to 59.5 ‘) 30% percentile = 640s 70% percentile = 68 di, Range = lates mark - Lowest mark = 95-31 = 64 from the raw data in section 2.1 a il oes ii, Interquartile range 7 iii, Semi-Interquartile range = Qs 2 ' l2e4OIS) O55 2 By eons) e) At 60% mark this intercept the curve at cum 25 students and at 80% mark this intercept the frequency of 43. Therefore, the nu: between 60% and 80% mark are 43 — tulative frequency of curve at cumulative mber of students that Scored 25 = 18 students f) If 60% of the students failed, the pass mark will be from the 60% percentile mark. Trace this to the curve and the Pass mark will be 67. Stem-and-Leaf Stem-and-Leaf display combines the visual im; bar chart with the detail of the original list of data entries. This technique, very simple to create and use, is a combination of a graphic technique and a sorting technique. The data values themselves are used to do this sorting. The stem is the leading digit(s) of the data, and the leaf is the trailing digit(s). For example, the numerical data value 325 might be split 32 - 5 as shown: pact of the histogram or Leading Digits Trailing Digits | 82) 5 | Example 2.7 Construct the stem-and-leaf of the following sets of data if 52.33 44 48 49 36 6 50) 61 65 72 68 55 60 53 33 41 68 70 82 85 Bomeol 37 45). 58 65.43. | 45 61 81 meee 49 3.2 7.5. 3,3 38.4.5 5. BEARS! 5.7 OHETS “1S408S,7 8:3 4.12 1 ee ees 74. 35 25.56 1 Solution i, In a stem-and-leaf plot, we consider all entries. Let’s look at the group of entries in the 2 Ape SS).A99).0B6ew37 AOs: 41. 44° 48 949 948. 4: 50 55 53 51 a Biles 50s: 52 60s: 61 65 68 60 68 65 70s: 72 70 et 80s: 82 85 81 We separate the last digit of each entry from the prima: 40, 50, 60, 70, 80 and we display the results in oe ae, the values: Bra-o 7 feced=5 0 870"9 Ghi-203's8 et 5 5 8 8 02 ergo Leaf Stem For this data set, the stem is the whole number including the decimal point while the leaf is the trailing decimal digit. ‘The groups entries are: fo elo 2-8 91-8 ego) Dee 28 2.4 2.5 3: 35 38 3.3 3.4 Bileg 3.3) 4d 4.5 4.8 4.1 4.5 Eres aiSsGen 5.7 5.6 ee mes 7.8 7-1 Spee. 8.3 ‘The corresponding stem-and-leaf is: 66889 a8 9 OPIDARODEH NDeyNee sn Dawn Naer Noa oun oo Unfortun; lately, not all fate eth ae i data can be organized into a stem-and- 10000, dary, cee be too much spread in the data e.g. > ere is very little spread. Further, the in the data sho uld not be extremely large. For example, if 23 ge value: were in hundreds of thousands, such as 345,005 ang ,281, then just separating the last digit would be meaningless, Line Graph Line graphs are diagrammatical representation of the relation between two variables x and y. The co-ordinate points of these variables are joined together to have the line graph. Example 2.2.8 Draw a line graph to represent the information below: 14) 2021) 24" 22 25" 26 [Before [14 20 21 24 22 25 2 After Tomeesmesome2s 30.27 34 Solution Figure 2.9: Line graph BOX-AND-WHISKER DIAGRAM Boxplots are extremely useful graphical devices for d and numerical variables. This plot is based on the five nu summary of a set of data (smallest, first quartile, median or second quartile, third quartile and the largest) that is called a ee whisker diagram or more simply as a box plot. The three values ea - Qi, Q2 and Qs - are sometimes called [Link] is also useful eo identifying outliers. The ends of the box are the lower and 8 sample quartiles and the length of the box is the IOR for the varial a The lines a is The median is marked by a line inside the box. ae git from the box extend up to the smallest and largest observation the are within Es the interval (Qi-1.5IQR, Q3+1.5IQR). Points that .5IOR, ; i at interval (Qi-3IQR, Q:-1.5IQR) are negative outliers and points ti lescribing interv4 mbet rt] 24 in the interval (Q3+1.5IQR, Qs+3IQR) are positi ; ; side the interval (Qi-3IQR, Qs+3IQR) aa pa vile Figure 9 Box Plot with those out ‘Minimum Qa Mae, a horizontal box-and-whisker diagram, draw a horizontal caled to the data. Above the axis draw a rectangle box with the left and right sides drawn at Qu and Qs with a vertical line segment drawn at the median, Qo = median. A left whisker is drawn as a horizontal line segment from the minimum to the midpoint of the eft side of the box, and a right whisker is drawn as a horizontal line segment from the midpoint of the right side of the box to the seem, Note that the length of the box is equal to the interquartile range (Qs ~ Qi). The left and right whiskers contain the first and fourth quarters of the data. Example 2.2.9: Draw the Box Plot of the data below m4 31 31 40 45 4748 48 48 49 50 50 50 50 50 50 5153 53 56 60 70 71 76 To construct axis that is s Solution ‘The five number summary are minimum = 24 Q: = 47.25, Q2=50, Q3 = 53, and the maximum = 76 20 40 60 80 Figure 10: Box Plot a EXERCISES 2.3 - A police Constable, usi thy were teeing down seas checked the speed of cars as PU fo, 38 40 ss 40 55 ey 55 40 38 55 60 50 38 68s 60 Construct a dot plot of these data it day of last semester, 30 students were asked fy Geren aay oricliincs fom home to the University (tg the nearest 5 minutes). The resulting data were ag pA 25 710 20: 2, as ec 200s 3 4 8S 15 5 20 30 25 45 40 40 35 20 Construct a stem-and-leaf display follows: The following 45 amoun cl ed for delivering small fr morning: 2.57 4.21 1.05 3.06 4.50 5.05 3.45 2.15 0.92 3.12 2.67 0.76 4.13 5.93 4.15 2.03 0.57 1.85 4.10 3.41 1.86 2.53 1.46 3.85 5.12 3.24 1.89 2.51 0.95 1.04 2.21 5.86 3.57 2.18 4.29 3.50 0.91 0.82 1.47 4.25 3.81 2.48 1.27 5.35 3.33 Classify these data into a grouped frequency distribution by using classes of 0.01 — 1.00, 1.01 - 2.01 5.01 - 6.00 ts are the fees that Fast Delivery eight items last Mo: nday D} AF Find the class width For the class 4.01 — (@) the class center (b) the class limit, (c) the class boundaries 5.00, name the value of: Construct a relative frequency histogram of these data. ? The incomes of 80 employees of a company are recorded as follows in 4000 per annum. 480 650 730 450 357 379 680 880 720 500 535 600 710 375 481 639 700 850 650 400 885 730 650 480 537 300 495 755 800 450 633 741 839 395 485 631 737 810 561 492 Beeiet29 810 750° 653° 495 849 675 800 795 Beeetl 65,1791 a46" 66s 713 874 815 873 pee 414 312 481! 672 411 813 817 361 845 Meme eset cx: 76h G5 Soy rs 81! 26 Use an appropriate class interval to construc! an ¢ J t distribution. Draw a histogram to represent fn ey may frequency polygon on your histogram. Estimate the mode ea your histogram. = e following table shows the frequency distribution of marks dents in a mathematics examination. 10-19 20-29 30-39 40-49 50-59 60-69 70-79 80-89 Mark _ Frequency _4 2652 _16 _36 _34 _20 12 i) Draw a cumulative frequency curve and estimate the Quartiles. Calculate the interquartile and semi-interquartile range from your graph. iii) Find the pass mark if only 20% of the students should pass. iv) How many of the students scored between 60% and 85%. 5. Th of 200 stu ii) 6. Use the table in Exercise 2.3.5 to answer the following questions Draw a bar chart of the frequency distribution Draw a pie chart of the frequency distribution Draw a histogram of the frequency distribution iii, 7. The following table shows the number admitted into the postgraduate programme for 2 years. ieee Department 2003 2004 41 37 Chemical Engn. Electrical Engn. 40 48 Surveying 35 25 peeerty. 45 48 lechanical Engn. 50 45 Geolo; 35 45 Di 7aw a component and multiple bar charts for this data 8. The year, x, and the birthrate, y, for 1980 - 2000 were as follows: Se a 1980 Birthrate 25,004 1981 ce 25,100 3 24,345, 1p 24,850 23,563 sali. serie are a = 23,236 1985 24,450 1986 18,053 198%) 19,245 1988 18,348 1989 15,434 1990 13,347 1991 1992 14,111 i. 1993 15,243 ii 1994 16,172 1995 18,815, 3 1996 17,345 1997 16,457 in 1998 18,413 a 1999 19,400, e 2000 18,721 a pmememes mans ey 1W' SKE ee ANE line 3 Construct a line graph of these birthrates. Interpret your output a i. A country’s foreign reserve are as given for some g period of months in ‘000,000 of dollars a eso) as siemaquotti Gey : Peco 46 7 7 lll lg Soe seS BU Shy sling He Gg) “g Beet 24 R88" 32 49 4) 4] 4 100 120 150 200 250 300 350 400 450 500 i, Group these data into six to eight unequal classes ii. Construct a relative frequency histogram iii. Describe the distribution of the reserve 28 CHAPTER THREE NUMERICAL EDA METHODS Numerical EDA methods is of two parts namely: of location or central tendency Measures tral f dispersion, variation or spread Measures 0! | MEASURES OF LOCATION OR CENTRAL TENDENCY 4 asures of central tendency are numerical values that tend to locate i the middle of a set of data. The term average is often in some sense aeeiated with these measures. Each of the several measures of acl tendency can be called the average value. They are the mean, median, and mode. i. ii. 3.1 THE SIGMA (2) NOTATION ‘The Greek capital letter sigma (8) is used in Mathematics to indicate the summation of a set of addends. Each of these addends must be of the form of the variable following . For example, i, =x means sum the variable x = (x- 3) means sum the set of addends that are 3 less than the values of each x When large quantities of data are collected, it is usually convenient to index the response so that at a future time its source will be known. This | indexing is shown on the notation by using i (or j or k) and affixing the index of the first and last addend at the bottom and top of the. For example, Means ; acai ie rpc values of square of x’s starting with the er | and proceeding to source number 4 Example 3.1 Find (i) 3x. (Hag (ii) x2 ScannaguithcamScener Solution Ex = 14+24+446454+74+3 = = 28 Ext = 1444+16436+254+49+9 - 149 (x)? = (28)2 = 784 Example 3.2 Simplify 3 (Sxi + 1) and find its value when x; = x) = x =1 Solution 3 D Gm + 1)= (3x + 1+ (Bx2+ 1) + (8x5 + 1) = (8x1 + 3x2 + 3xq) + (l+1+1 SSP Dy soe 38 fa =3(14+1+1)4+3 = 943 =12 3.2 MEAN To find the mean, X (read “x bar variable x and divide by the nui this in formula form as ”), you will add all the values of the mber of these values, n. We express Samplemean = x = (3.1) Example 3.3: Find the mean of the following numbers 2, 3, 4, 2,3, 2, 4,8 an = 24+344424+34+2+4+8 8 =) ).28- = 3.5 8 The div 30 the sample data has the form of a frequency distribution, we Mey. ed to make a slight adaptation in order to find the mea will oe the frequency distribution of Table 3.1. Cons! Jeg.1: ungrouped frequency distribution le 3.1: eos 4s] Pmiames 5 4 7 ‘Tab! To calculate the mean x using the above formula; we have Fe e1t1F1FIt2+2+ FDS HA SHAE ET HLT Sx = 4 (1) +8 (2) +5 (3) + 4 (4) +7 (5) = 86 = Six therefore, the mean of a frequency distribution may be found by dividing the sum of the data, 2fx, by the sample size, Ef. We can ing eat SRY. Be Senet oe formula (3.1) for use with a frequency distribution as: 1 (3.2) Table 3.2 paiesHiy += Xf a a 4 ss Gese 46 x= Ext ee 2 ae 16 Gi 16,07. 5 3 28 Total 28 86 3.2.1 Mean of Grouped Data The class centers (mark) are now being used as representative values the observed data. Example; a ee Rab isthe mean of this distribution? Table 3.3 30-39 Class cent, a E er (x fx 49 1G 34.5 172.5 44.5 aan 50-59 15 54.5 ee 64.5 60-69 10 372.5 o-79 74H BTS Total Df= 45 = 2452.5 Moan (20S), = oie ay easerg oe by 45 Table 3.3, Mark Frequen Class center (x) fx 30-39 5 34.5 172.5 40-49 10 44.5 445 50-59 15 54.5 817.5 60 - 69 10 64.5 645 ie ee ee a 82S [PS ere eee Mean =X = Zf = 24525 = 54.5 ee 45 3.2.2 Using Assumed Mean The method of using an assumed mean makes strenuous calculations of large numbers to be easier. For the ungrouped data, we use x = At Id = A+ Xd (3.3) N N and for the grouped data, we use x = A+ Did, = A + Sta (3.4) Zhi =f pecs Ais the assumed mean, di = x;~- A are the deviation of xi from Example 3.5: Using the data in Example 3.4, let A = 44.5 Table 3.4 Mark Frequency () Class d=x-A fd 30-39 ee 5 34.5 -10 -50 eae 10 44.5 0 o ae 15 54.5 10 150 10 64.5 20 200 70-79 5 74.5 BS eee aT EE 32 x= At Lid = 44.5 + 450 of 45 pea + 10 = 54.5 which is the same as in previous example 3.2.3 Harmonic Mean he reciprocal of the average of reciprocals. It is usually This #8 tad by xx and defined by represente c z petra) a2 tbP >. Sena x, | 6: Find the Harmonic mean for the following data 2, 5, 3, (3.5) Example 3. 6,7. Solution Soe a 0.5 + 0.2 + 0.33 + 0.167+ 0.143 ———————— 1/2+1/5+1/3+1/6+1/7 z 5 S"3.73 1.34 3.2.4 Geometric Mean This Js the nth root of the product of the n numbers in a data set. This is usually represented by xc and defined by Yo = YK e..aK, = ([Link]...2X,)" (3.6) ps a Find the Geometric mean for the data above in © SPST = 60= 4.17 52.5 Atithmetic Me Re This has be, Deen dealt iL 7 ey “SStated in (3.1) With earlier. It is represented by xa and defined arithmetic Me: an of the sample above is aii ScannaguithcamScaner = 4,6) Note: that the expression Xu S Xo SXa en is true for any data 3.3 MEDIAN The median is the value of the data that occupies the middle POSition when the data are ranked in order according to size. The depth (number of Positions from either end), or position, of the median is determined by the formula. Depth of median = n+ il (3.8) 2 If the number of measurement n is an odd number, the median is the middle value. If the number of measureme: the median is the average of the middle two find the median of these numbers 2, 4, 6, 8, In our example, n = 5, and therefore the depth of the median is depth = 541 = 3 That is, the median is the third number from either end in the ranked data, ie, median is 6 Lets look at these data 4, 6, 7, 8, 10, 12. Here n = 6, and therefore the median depth is depth = 6+] 2 FAPSt5' This is to say that the Median is halfway between the third and fourth ta, the number halfway between any two pee 8, then qo ’2Ues together and divide by 2. In this case, add 7 a gee Ly 2. The median is 7.5 __ For grouped data, the Median is obt, ive ained by interpolation and giv’ r- L+ FE IC (3.9) Median where Li is lower class boundary of the median class, c - Size (width) of the median class interval, N - Total frequency, F, - Sum of frequencies of all classes below the median class. Re = Frequency of median class. gxample 3.8: Find the median mark in the table below: | ‘Marks 30-39, 40-49 50-59 60-69 70-79. | Frequency, 5 10 15 10 5 Solution Table 3.5 Mark Class Frequency Cumulative _Boundaries — Frequency 30-39 29.5 - 39.5 o 5 40-49 39.5 - 49.5 10 15 50-59 49.5 - 59.5 15 30 60-69 59.5- 69.5 10 40 70-79 69.5 79.5 D 45 Te Bf= 45 Median. e l= eee ooo x 2135 wile 71° P= 15 Median = a a? Pere, Scannaguith amScaner 45 15 3 2 \x10 = 4954/27 = 49.5 + (22.5 - 15) x 10 15 = 49.5 + 0.5 x10 = 49.5+ 5= 54.5 Comparing mean and median The mean for the data is the same as the median due to symmetry of data. In a symmetric distribution mean and median are equal. In a posively skewed distribution the mean is greater than the median. In a negatively skewed distribution the mean is smaller than the median. 3.4 MODE The mode for a set of data is the value that occurs most frequently. Example 3.9: Find the modes of the following data. 1, 1, 1, 2, 2, 2,2, Senos ta a. Solution The values with the highest number of occurrence are 2 and 4. Thej both have equal frequency of 4. That is, we have a bimodal case. For group data, the mode is obtained by Mode= Li + _fi+ fa — (L2- Li) (3.10) 2f; + fo + fo Where fo ie the frequency of the group before the group that appears most often, fi = the frequency of the group that appears mos' f= the frequency of the group after the group that appears most often, li = the lower limit of the group with fi and bb = the upper limit of the group with fi t often, 36 OR mote = eLe (=a le . A, +4, (3.11) e wee = lower class boundary of modal class, apace excess of modal frequency over next lower class ee excess of modal frequency over next higher class and os, size of modal class interval gxample 3.10. Find the mode of the data given in example 3.8. Using the two methods given above. Solution MethodI - (3.10) Mode = 49.5 + mista 10" 4(59-5— 49.5) 2(15) + 10 + 10 Elo eeg5ex 10 50. = 49.5+5 = 54.5 Method I - (3.11) Ly = 49.5, Ag= 15-10 = 5 fo = 15-10 =5 C = 59.5 - 49.5 = 10 Mode = = 49.5 + 5} S45) = 49.5 +_5 x 10 10 eo B= 545 Note: The m lean = ons 8 to symmetry erect = Median of the data considered above due Also Mean — Mode = © 3 (Mean - Median) (3.12) RCENTILES, QUARTILES red from small to large, the TESUItin, rder statistics of the sample. Lets hay re 3.5 DECILES, PP i are orde! observations Raced data are called the 01 the following data 47 31 31 40 45 33 3g 48 49 50 50 50 50 50 50 51 53 53 56 60 70 71 76 ive ranks to these ordered statistics and use the rank as the Becca x. The first order statistic x1= 24 has rank 1 the Secong order statistic x2 = 31 has rank 2, the third order statistic x, ~ 31 has rank 3, ...; and the 24* order statistic x24 = 76 has rank 24. j; is clear here that x1 < x2 < S Xo4. From these order statistics, it is rather easy to find the Sample percentiles. If 0 < p < 1, the (100p)th sample percentile ha approximately np sample observations less than it and also n(1-p) sample observation greater than it. One way of achieving this is 1 take the (100p)th sample percentile as the (n+1)pth order statistic provided that (n+1)p is an integer. If (n+1)p is not an integer but equal to r plus some proper fraction, say a/b, use a weighted average of the rth and the (r+1)st order statistics. That is, define the (100pith sample percentile as Mp = xe + (a/b) (ir+1— xr) = (1 — a/b)xr + (a/b) xe (3.13) Note: that this is simply a linear interpolation between x, and x1 For illustration, consider the 24 ordered examination scores. With p = %, we find the 50% percentile by averaging the 12th and 13¢ ordet Statistics, since (n+1)p = 25/2 = 12.5 Toso = (%) x12 + (4) x13 = (50 + 50)/2 =50 With p = %, we have (n+1)p = =6.25; beat percentile is (n+1)p = 25/4 =6.25; and thus the 25% sam?! Tos oe 0.25) xs + 0.25 x7 (0.75) (47) + ©, = ee ) + (0.25) (48) = 35.25 + 12 a fs So that (n+1) p = (25) (3/4) = 18.75, the 75 sample Hors = (1- 0.75) xis + (0.75) x19 = (0.25) (53) + (0.75) (53) (13.25 + 39.75 = 53 at approximately 50%, 25% and 75% of i tea are ae than 50, 47.25, 53, respectively. the, sible iscussed in chapter two, 50‘ percentile is the median of iene 25%, 50°, and 75! percentiles are the first, second rtiles of the sample, denoted as Q:, Qo, and on ese 20% 30%, ....., 90" percentiles are the deciles ample. So note that the 50! percentile is also the median, the tne quartile, and the fifth deciles. seoortcample, the 2% and 9" deciles would be calculated as thus: oF 9 5)(2/10)= 5 for the second deciles and = (25)(9/10) = 22.5 for the ninth deciles. = (1-0) x5 + Oe = x5 = 45 qua’ (n+p Descriptive Analysis (1- 0.5) x22 + 0.5x29 = (0.5) 60 + (0-5) (70) 30+35 =65 Example 3.12: Let x denote the concentration of acid on milligrams per liter. Twenty observations of x are: 115 116 117 118 118 118 119 121 122 125 126 128 129 129 130 131 ives Hse le+ Toso = Find the mid range, interquartile range and median (b) Draw a box-and-whisker diagram. Solution 8) Midrange = average of the extremes eee = 11S + 13e = 249 a 2 2 ease With p = eis’ We have (n + Ip = 21/4 = 5.25 and the 25% sample Q=Tyas = ue. Bii7 075 x + (0.25) xe an “7S) (118) + (0.25) 118 = 118 ps we have i . (n + 1)p = 21/2 = 10.5 and the 50‘ sample +0.5x11 Q2= Toso = es) fue 28) = 02-6) +463 = 125.5 With p = %, we have (n + 1)p = 21x % = 15.75 and the 75th Sample percentile is a = (1-0.75) xis + 0.75x16 eon tee + 0.75x16 = (0.25) (130) + (0.75) (131) 5 32.5 + 98.25 = 130.7: Interquartile range = Q3-Q: = 130.75-118 =) 12.75 Median = Q2 = 125.5 + ot 110 115 120 125 130 135 Figure 3.3 box plot Tukey suggested a method for defining outliers that is resistant to the effect of one or two extremes values and makes use of the interquartile range. In a box-and-whisker diagram, construct inner fences to the left and right of the box at a distance of 1.5 times the interquartile range. Outer fences are constructed in the same way at a distance of 3 times the interquartile range. Observations that li between the inner and outer fences are called suspected outliers Observations that lie beyond the outer fences are called outliers. 3.6 MEASURES OF SPREAD OR VARIATION It is not enough just to report a number that describes the centre Sample. The spread in a sample is also an important characteristic” @ sample. Once the middle of a set of data has been determined, oF Search for information immediately turns to the measures : dispersion (spread). The measures of dispersion include the ee taeance, and standard deviation. These numerical values d¢5°H the amount of spread, or Variability, that is found among the dat® 40 3.7 MEAN ABSOLUTE DEVIATION (MAD) js the average amount by which values in a distribution diff iffer ‘This from the mean. Mean ‘Absolute Deviation for ungrouped data MADAcer: Dpal oe = +i (3.14) ial n Absolute Deviation of ungrouped data with frequency and of Meal Group Data map = > flx-*l (3.15) eh Descriptive Analysis ja pxample 3.13: Find the mean deviation for the following data: 3 4.69805 First, we find the mean of the data Mean= % =x = 3+4+5+8+15 n 5 = 35. 5 -7 This impii les that th sa di mean is 3.6, le average distance that this piece of data is from Exam dag NMED Find tte mean absolute deviation of the following 40 3.7 MEAN ABSOLUTE DEVIATION (MAD) i the average amount by which value ) es in a distributioi n differ ‘This from the Mean Absoli wad Yelm al n viation of ungrouped data with frequency and of mean. ute Deviation for ungrouped data (3.14) Mean Apsolute De’ Group Data oe ys tx - + (3.15) i Descriptive Analysis xf pxample 3.19; Find the mean deviation for the following data: 345815 First, we find the mean of the data Mean= X =X = g+4+5+8+15 n 5 = 35. THe = 3.6 implies that the is the mean is 3.6, e average distance that this piece of data is from Example i danuple 8.14: Find the mean abi : absolute deviation of the following ee xe ees & 316 24 “1.16 40 -0.16 7 1.84 322.84 183.84 Lfx_= 129 Example 3.15: The following distribution of commuting distances was obtained for a sample of employees. Table 3.7 EE ee ee Distance (Kilometer) Frequency 1.0= 2.9 2 3.0- 4.9 6 5.0- 6.9 12 7.0- 8.9 50. 9.0-10.9 35 11.0 -12.9 15 13.0 -14.9 5. (TE ere Find the mean deviation for the commuting distances. oeogir 35 9.95 348.25 1.2 iow 11.0-12.9 15 11.95, 179125 8;2 3.2 48 Total f=125 ‘fx =1093.75 (ere eee 1093.75 paces a 125 = 875 = Yflx-xJ = 232 Mean dss ae = 1.856 VARIANCE AND STANDARD DEVIATION 3.8 Variance is a useful measure of the spread of the original values about the mean. When we are concerned with a population, the variance is written in terms of the Greek letter (lower case sigma) ind is denoted by o?, Thus, we can summarize the above calculations with the following formula: Population variate o? = 5 («~ 1)? N N2 where N is the size of the population. OR N(px?) - (2x)? ar arise (3.16) However, a far more useful measure of the spread or variability in a set of data is the standard deviation, which is defined as the square root of the variance. Standard Deviation (SD) = Variance (3.17) Since ee ane ‘standard deviation is the square root of the variance 0”, the leviation is denoted by o and is found from the formula. Populati: °pulation standard deviation o- = (3.18) One or /NQ)-en? wht SPecial advan We in te ce Working with the standard deviation is that fe units as the original data. Thus, if the 4 original set of numbers represent weights of a certain type of origins’th the mean and standard deviation are measured in wei” ights” ‘The larger that o is for a set of numbers, the greater the spreqg variability among those numbers. The smaller the value of ¢ 4° smaller the amount of variation in the data. » the All the above ideas for the variance and standard deviation y developed in the context of a population. Very similar ideas exist the variance and standard deviation of a sample drawn fron” population, with one significant difference. When we deal wit, ° sample, we cannot average the sum of the squared deviations, Fs (— x)?, over the entire set of data, Instead, it is necessary to mak, the following modification: ta kk Sample variance (s?) = Y(x~y)? OR n (Sx) - (Ex)? a (n=1) oe nl (8.19) and ieee Sample standard deviation (s) er (3.20) n ae oF 1 n(n=l) That is, instead of dividing by n data points, we divide by n-1. Just as o? and o represent the variance and standard deviation of a population, respectively, we use the symbols s? and s to stand for the variance and standard deviation, respectively, of a sample. Variance and standard deviation with frequency counts and of Group data are o2 = Sf(x- x)? _ or Sfx? - (Pix)? is ui vf (3.21) i and s? = Sf(x- x)? or Pix? - fx)? 92) Dae xf ee? yr-1 Standard deviations o and s are the square roots of (3.21) and" respectively. variation ani below where ¢ Standard dev respectively. 3.9 Variance Scaling Fact, The foregoing down of d to | The formulas and

You might also like