Hi, I'm Adriene Hill, and this is Crash Course Statistics.
So, for the last few episodes we've discussed ways to summarize data using numbers.
We used measures of central tendency and measures of spread.
But sometimes it can be helpful to actually *see* your data in addition to having numbers
to describe it.
Data visualizations are important to understand because you'll see them everyday.
In the news, on Facebook, in magazines.
Maybe I'll make an infographic of all the places we see data visualizations.
INTRO
There are two main types of data that we might encounter: categorical and quantitative.
Quantitative data are quantities, numbers that have both order and consistent spacing.
For example, how many ounces of olive oil are in each American home.
If three families told you how many ounces of olive oil they have, you could put them
in a meaningful order--from least to greatest, or greatest to least.
This order also has consistent spacing, an increase in 1 ounce of olive oil is the same
whether you go from 0 to 1 ounce, or from 100 to 101 ounces.
These properties allow us to do simple math with the data--like taking the mean or calculating
the standard deviation.
Categorical data doesn't have a meaningful order or consistent spacing.
For example, favorite kind of pasta.
You might like penne, rotini, linguine, or even Angel Hair, but there's no objective
way to put those pastas into a meaningful order.
Is penne truly better than linguine?
Where does rotini fit in?
It would be pasta madness to try to put them in order.
The simplest way to display categorical data is to make a frequency table.
A frequency table shows you all of the categories and the number of data points that fall in
that category (in other words, its frequency).
To change a frequency table into a relative frequency table, we just need to take each
raw frequency and divide by the number of total points to get a decimal between 0 and 1.
Some of you may be used to reading decimals as percentages, but if you're not, just
multiply by 100 to get the percentage.
For linguine we have 10/50 which is 0.2 or 20% of the group.
Relative frequency tables have the benefit of being easy to compare.
No matter what we're measuring or how many data points we have, it's easy to compare
percentages.
If 20% of people like linguine, we can see that's a smaller percent than the 67% of
people who like pineapple on pizza or greater than the 10% of my family who thinks statistics
are scary.
The relative frequency table for favorite pasta might look like this.
We can also add more than one variable to our frequency table.
We could ask people to rate their favorite pasta sauce and make a combined frequency
table, or a contingency table, of both pasta and sauce preference.
If I were planning a party, and needed to pick some pasta for the group, my best bets
would be Rotini with Red Sauce and Penne with Red or White sauce.
And because I'm planning a party and because I'm having food, I did look it up: the chance
of death by choking on food in the US in a given year is 1 in 100,686
But, sometimes we don't want just numbers in our visualization.
Earlier in the series, I talked about how it can be hard to wrap your head around numbers--especially
when they get really big or really small.
There are other more visual ways to represent categorical data.
One way to do this is with a bar chart.
A bar chart uses the frequencies that we saw in our frequency table to create bars that
have a height equal to the frequency.
That way, we can compare the height of bars instead of looking at raw numbers.
Here's a bar chart representing the pasta data we saw in our original frequency table.
You can see that penne is by *far* the most chosen pasta, and how it compares to Angel Hair.
Bar charts display a lot of information in a very simple graph, they can also display
the frequencies of multiple variables.
Let's say we want to compare each of these pasta types with either white or red sauce.
We can either stack frequencies so it gives us the same information as our contingency
table, or we can have bar charts side by side.
Pie charts are another way of displaying categorical data.
They use the relative frequency of categories to portion out pieces of a Circle, just like
a pie.
The higher the relative frequency, the bigger the slice of pie a category gets.
Pie charts are useful because our eyes are pretty good at comparing slices.
Our pasta data in a pie chart looks like this.
Pie charts are great at visually displaying one variable.
But they struggle to effectively display more than one variable, like our pasta and sauces
contingency table.
Another way to display categorical data is a pictograph.
Pictographs represent frequency with pictures.
A picture, like the ball in this basketball participation graph, will represent some number
of units, say 100 kids.
So if Riverdale High had 550 students participate in their basketball programs, then the graph
would show 5.5 basketballs.
Sometimes pictographs represent frequencies by increasing the size of the picture instead
and it's not wrong, but it's more difficult for us to visually compare, especially for
small differences, which can be misleading.
Plus, at a casual glance, we don't know what the size difference means.
Are we comparing the diameter of the basketballs?
Or are we comparing their areas?
*BREAKING NEWS*
This is Channel 2 News.
Looks like all you students out there are really hitting the books!
Data from the US Department of Education shows the graduation rate has been climbing!
So way to go everybody!
You're passing the test of life with flying colors!
Let's push that stack of books even higher!
So, that last pictograph...not at all to scale.
See how the stacks of books are not proportionate?
It shows a difference of 5% (from 75% - 80%) with a stack of books that is over *double*
the height of the 75% stack.
This makes the difference seem huge because the axis doesn't start at 0.
And yet, an increase of 80-81% is shown by two stacks that are BARELY different in height,
even though the 5% difference looks huge.
Always keep on eye on those axes.
Let's loop back to quantitative data, which as you'll remember, have a meaningful order
and consistent spacing.
Frequency tables can be used to display quantitative data, like age, or height, or ounces of olive
oil in your house.
We just have to create categories out of our quantitative data first.
We do that with a process called "binning".
Binning takes a quantitative variable and bins it into categories hthat are either pre-existing
or made up.
For example I can say that 0-15 oz of olive oil is "Very Little", 16-32 oz is "Average",
33-49 oz is "A Lot" and 50+ oz is "Excessive"--like suspiciously Excessive.
Like Will's 14 cats excessive.
Why do you need so much olive oil?
Anyway, once I've binned my data, I can create a frequency table or relative frequency
table, just like with our pasta example.
It might look something like this.
Binning is most useful when there's pre-existing "bins" for our data.
Like, you can divide age-in-years into the bins "Child", "Teen", "Adult"
and "Older Adult" because those are pre-existing categories.
We can also take a score on a depression test and create two bins: "clinically depressed"
and "not clinically depressed".
You can see from this example that bins don't HAVE to be equally spaced, but if you see
quantitative data that has been binned, make sure that the way it was divided up was appropriate
for the situation.
Unequally spaced bins can be misleading unless there's a real world distinction to back
it up.
Say politician X wants to make himself look popular, but it seems like people in their
30's really hate him.
(probably because he said that the reason they can't afford a house is their brunch
habit).
Politician X wants to hide the fact that over 80% of people in their 30's said they won't
vote for him.
So he does some "re-binning".
Traditionally the data are binned roughly by decade 18 years old to 29 years old, 30
years old to 39 years old, 40 to 49...you get the point.
But Mr. X needs to hide these hateful 30-somethings in the data.
The old chart looked like this:
But Politician X decided to split up the 30-somethings to make his numbers look better:
He moved the data around to hide the glaring group of 30 year old dissenters.
Instead of showing the truth that 30-somethings despise him, we see a more...positive view
of his popularity.
By splitting the 30-somethings and putting some of them into two other, larger groups,
he can obscure their political dissatisfaction.
Looking at this new table, he'd win the popularity vote in each of the 5 new bins.
If I don't show you the number of voters per bin, it seems legit...
Another categorical graphing method we can apply to quantitative data is bar charts.
When we use bar charts for quantitative data, we squish the bars together so that they're
touching and we call them histograms.
The bars are squished together because the data are 'continuous' which means the
values in one bar flow into the next bar, there's no separation like in our categorical
bar charts.
In histograms, like bar charts, the height of the bars tell us how frequently data in
a certain range occur.
A histogram also gives us information about how the data is distributed.
We can estimate where the mean, median and mode of our data are as well as see how spread
out the data is.
Look at this histogram for our olive oil data.
For this histogram, we can see that the range of the data is approximately 85 since it covers
value 0-85 ounces and that it's right skewed (the tail is to the right), and that it's
center is around 25 ounces.
The histogram gives us more information about the data than a frequency table does, but
they're still obscuring WHAT the specific data values are.
If you read the news--or watch the news--you will see these representations over and over
and over.
You will likely see far more of these charts and graph than you will create.
The big take away here, as a consumer of these things, is to look closely at what the visualization
is actually telling you.
Or maybe trying to hide from you.
These charts and graphs give us another way to comprehend numbers--to see the big picture.
Thanks for watching!
I'll see you next week.
No comments:
Post a Comment