Since visualization of the data is so essential, we're starting with more basics.
Statistics for Business:
Decision Making and Analysis,
3rd. Edition.
VISUALIZATION
bar chart: A display using horizontal or vertical bars to show the distribution of a categorical variable. We use bar charts to show the frequencies of a categorical variable. Should be displays of a frequency table.
Pareto chart: A bar chart with categories sorted by frequency.pie chart: A display that uses wedges of a circle to show the distribution of a categorical variable.
doughnut chart: A pie chart with the center of the disk removed.
Bar vs Pie Charts: Because it’s easier to compare the sizes of categories using a bar chart, we prefer bar charts.
histogram: A plot of the distribution of a numerical variable as counts of occurrences within adjacent intervals. (The distribution of a numerical variable is the collection of possible values and their frequencies (just as for categorical variables)).
histogram mode: Position of an isolated peak in the histogram. A histogram with one mode is unimodal, two is bimodal, and three or more is multimodal.
uniform: A uniform histogram is a flat histogram with bars of roughly equal height.
tails: The left and right sides of a histogram where the bars become short.
skewed: An asymmetric histogram is skewed if the heights of the bins gradually decrease toward one side.
Histogram vs Bar Chart: The two are similar because they both display the distribution of a variable. Histograms describe numerical data, such as the sizes of these songs, that can take on any value. Bar charts show counts of discrete categories and are poorly suited for numerical data. A bar chart of the sizes of these songs would need to show thousands of bars, one for each size found in the data.
White Space Rule: If a plot has too much white space, refocus it.
boxplot: A graphic consisting of a box, whiskers, and points that summarize the distribution of a numerical variable using the median and quartiles.
bell shaped: A bell-shaped distribution represents data that are symmetric and unimodal.
contingency table: A table that shows counts of the cases of one categorical variable contingent on the value of another. (visually: think purchase values on Y axis, hosts on the X axis where the hosts are further broken down per host.)
cross-sectional: Data that measure attributes of different objects observed at the same time.
STATISTICS
median: Value in the middle of a sorted list of numbers; the 50th percentile.
quartile: The 25th or 75th percentile.
interquartile range (IQR): Distance between the 25th and 75th percentiles.range: Distance between the smallest and largest values.
five-number summary: The minimum, lower quartile, median, upper quartile, and maximum.
mode: The mode of a categorical variable is the most common category. The category with the highest frequency.
mean: The average, the ratio of the sum of the values to the number of values. Shown as a symbol with a line over it, as in y¯. (In general, a bar over a symbol or name denotes the mean of a variable.)
variance: The average of the squared deviations from the mean, denoted by the symbol s2.
standard deviation: A measure of variability found by taking the square root of the variance. It is abbreviated as SD in text and identified by the symbol s in formulas.
coefficient of variation: The ratio of the SD to the mean, s/y¯. The coefficient of variation cv has no units because the mean and SD share the same units. The units cancel. The coefficient of variation is more useful for data with a positive mean that’s not too close to 0. Values of cv larger than 1 suggest a distribution with considerable variation relative to the typical size. Values of cv near 0 indicate relatively small variation. In these cases, even though the SD may be large in absolute terms, it is small relative to the mean.
Empirical Rule: 68% of data within 1 SD of mean, 95% within 2 SDs, and almost all within 3 SDs. Caution: The catch to using the Empirical Rule is that this relationship among y¯, s, and a histogram works well only when the distribution is bell shaped.
z-score: The distance from the mean, counted as a number of standard deviations.
standardizing: Converting deviations into counts of standard deviations from the mean.
* Bar chart Confidence intervals tell you how much higher or lower the percent could be. The I-bar shows, and the tip of each bar illustrates, the spread between the lowest and highest value you are likely to see if you were to survey the entire population. *
Comments