The Renaissance Biologist: Just Plain Data Analysis

My reading notes from the book of this title by Gary M. Klass (2nd ed., 2012 Rowman and Littlefield Publishers), because many of us need practical data analysis skills regardless of our position.

1. Measuring Political, Social, and Economic Conditions

When presented with data, how do we respond? Do we cry "opinion"? Do we analyze our motives and presuppositions, and those of the ones interpreting the data?
Be aware of reliable sources of statistics such as The National Center for Health Statistics and the U.S. Census Bureau.
To interpret social indicators, one needs to understand the survey questions used, and what standards were used to determine counts. What is the numerator? The denominator? The form of comparison (cross-sectional, cross-time, or cross-demographic)? Why was each chosen?
Measurement validity relates to an indicator's context and how the numerator and denominator were determined. That is, we look at how well a measure assesses an underlying idea.
Measurement reliability, the repeatability of a measurement whose value is relatively constant, is affected by sampling error (random? Large enough?), response rate, and reasons for response or non-response.

2. Measuring Racial and Ethnic Inequality

Shift the discussion away from personal motives toward evidence of data/numbers.
The U.S. uniquely classifies citizens by race and asks about both race and ethnicity. Most data comes through the Current Population Survey (also for education) and the Annual Social and Economic Supplement, of non-institutionalized individuals. However, net family wealth is rarely reported (varies among families with the same income).
Most health data are in the CDC's National Center for Health Statistics.
Crime data mainly come from the Uniform Crime Reporting program (FBI) and National Crime Victimization Survey (NCVS, subjective)

3. Statistical Fallacies, Paradoxes, and Threats to Validity

Some fallacies in daily conversations and media:

Ad hominem - something is false because its author has unacceptable character or motives
Post hoc, ergo propter hoc - because something occurred after something else, it was caused by that temporally prior thing
Appeals to authority - the majority is always right
Appeals to loyalty - we should support others doing things for us
Slippery slope
Straw man
Begging the question
Red herring
Hasty generalization

Statistical fallacies to be aware of:

Cherry picking - choosing evidence that supports one's claim (compounded by opposition bias, i.e. people accept data that agrees with their position)
History - ignoring the potential for other past events to have influenced results
Reverse causation - an unclear cause-effect relationship that could go either way
Self-selection - remember that comparison groups might not be equivalent to start
Sample mortality - disproportionate dropout in one group versus another group
Maturation - potential for change due to simple participant aging
Simpson's paradox - subgroup differences disappear when the whole sample is looked at
Regression fallacy - participants gravitate toward the mediocre middle naturally (especially when extreme scores are selected at the outset)
Instrumentation/measurement reliability - can one trust the tools?
Ecological fallacy - making conclusions about individuals based on a geographical group's data
External validity - do randomized experiments apply to real-world situations?

4. Examining a Relationship: New York City Crime Rates

Case: Rudy Giuliani claimed responsibility for a major reduction in crime rates in NYC. However, this claim can be weakened by several fallacies:

Regression artifact: he may have been elected when crime was high. Evidence does not support.
Maturation and long-term processes: a general national effect might have been taking place at the same time, including an aging population
Historical events: actions of his predecessors
Instrumentation: possibility of manipulation of statistics (or changing definitions) of crime. Likely - a police hiring binge took place years before he took office.
Other causes: could have included semi-related policy changes, decades before

Conclusion (p. 59): "In cities across America in the 1990s, mayors touted their success in fighting crime in their reelection campaigns. For most, it was dumb luck; they just happened to be in office at the right time. As for Giuliani, the evidence presented here offers no final proof that the mayor's policies reduced crime, but most of the counterarguments, with the exception of Nevin's lead paint hypothesis, do not hold up."

5. Tabulating the Data and Writing about the Numbers

Two kinds of people read research: those who focus on the text over the tables/charts, and those who focus on the tables/charts while skimming the text. Both should complement each other.
General writing principles:

Meaningful measures/comparisons

Rates and ratios compare between groups and over time
If measuring over time, select an appropriately long time frame (e.g., 5 years)
Clearly differentiate whether you are referring to net, percentage, and percentage point change

Unambiguous data presentation

Organize by rows and columns with precise headers
Define both numerator and denominator when applicable (rates/ratios) and always the count, divisor, and comparison
Vary the amount of detail by the intended audience
Labels should be brief while complete
Cite sources precisely in footnotes to allow fact-checking

Efficient communication of key ideas about the data

Organize rows and columns to present similar types of data in any one table
Sort data by high-low numbers, not alphabetically
For large numbers, use 2-3 significant digits and 1 decimal place at the most. "There is no need for any correlation coefficient, R-Square, standardized regression coefficient, or even a measure of statistical significance to be displayed with more than two decimal places" (p. 73).
When writing in-text about numbers, round them even more than in a table
If using ordered categories in multiple tables, keep the same order
Use as neutral a table title as possible
Highlight critical numbers in tables for comparisons

If you have a paragraph with 5 or more numbers, use a table! If you want more precision than a chart, use a table! If you use a table, reference it in a text! Get to the point in your writing.

6. The Graphical Display of Data

"Good information design is clear thinking made visible, while bad design is stupidity in action" (p. 79, quoting Edward Tufte, Visual Explanations)
Avoid the problems of both hiding information and distracting the reader

D. Huff, How to Lie with Statistics
E. Tufte, The Visual Display of Quantitative Information

Rules:

Self-explanatory charts
Precisely and concisely defined data
Meaningful, interesting numerical comparisons
Efficient presentation of numerical information (no 3-D effects)
Organized, sorted data by most to least meaningful variables
Show data unclouded by design and scale
Be scrupulously honest
Use the most appropriate type of chart for whatever data
Be consistent across chart formatting

Chart parts (never 3-D):

Title - neutral definition
Axis titles/labels - vertical text for y axis
Axis scale - limit to 5 increments
Data labels - may make y axis labels and gridlines unneeded
Legends - for charts with multiple data series; label the trendline
Gridlines - minimize ink
Sources - use complete citations

Pie charts:

Avoid them - only for data summing to a relevant 100%
Avoid legends and cross-chart comparisons
Prefer pie charts over doughnut, cone, pyramid, radar, and cylinder

Bar charts:

Minimize ink, color, and shading
Sort data by most important variable; left-right time
Place legends in the plot area
Avoid scaling distortions
If 8-10 or more categories, use rotated charts

Time series/line charts:

Make lines distinct and directly labeled
Avoid for unordered categorical data; time goes left-right on x axis

Stacked charts:

Use only for meaningfully ordered data series; each stack must be a meaningful addition
Place the most meaningful data on the bottom of the stack

Scatterplots:

Use 2 fully defined, interval-level variables
Title should state both variables and units of analysis
If an independent variable exists, place it on the x axis
Adjust axis scale to maximize area for data points; prefer labels to dots

Boxplots:

Shows median and 4 data quartiles for interval-level variables
Use to compare one variable's distribution across multiple groups or time points
Can compare one case with many other cases

7. Voting and Elections

Are American voters really disengaged (indicating an association with bad government all around)? A better explanation is voter fatigue due to number of voting opportunities and number of offices; other citizen participation opportunities also exist (e.g., contacting officials).
How to measure voter turnout? Is it the number of people who went to the polls, or the number of valid votes for the highest office? Do we measure the voting-age population, or the voting eligible population?
Election day registration has potential to increase turnout while reducing fraud and costs, but only in non-presidential elections.

8. Measuring Educational Achievement

Unique features of American education (which make comparison of scores and achievements statistically difficult internationally) include high localization, self-selection bias, and inclusion of students with disabilities in the same classrooms.
"A general finding of much of the research on educational achievement is that school resources, measured by factors such as the amount of money spent per pupil, teacher salaries, and class size, have little effect on what students learn" (p. 139). However, family resources are much more of a determining factor.
Are standardized tests valid, culturally biased, or predictive of academic achievement? Conclusions depend on the closeness to the test's intended use. No Child Left Behind has been subject to severe reliability and validity issues, as well as cherry-picking misinterpretation.

9. Measuring Poverty and Inequality

"You are entitled to your own opinion, but you're not entitled to your own facts." (p. 157, quoting Daniel Patrick Moynihan)
Poverty is defined relatively, measured differently in developing vs developed nations. In the U.S., the Consumer Price Index is periodically adjusted, based on a not-necessarily-representative hypothetical family of four.
Statistics require thought to interpret.

The Renaissance Biologist

Sunday, May 10, 2020

Just Plain Data Analysis

No comments:

Pinterest Analytics

Google Analytics

Subscribe To The Renaissance Biologist

Sunday, May 10, 2020

Just Plain Data Analysis

No comments: