October 25, 2017

General

Annoucements

Reading posts

  • New reading posted, note that we're using the stats-based textbook now

  • The count of question and answer posts has reset for the second half of the semester

  • 5 question posts (2 questions each), 5 answer posts (1 answer each)

Data Collection: Case study

Treating Chronic Fatigue Syndrome

  • Objective: Evaluate the effectiveness of cognitive-behavior therapy for chronic fatigue syndrome.

  • Participant pool: 142 patients who were recruited from referrals by primary care physicians and consultants to a hospital clinic specializing in chronic fatigue syndrome.

  • Actual participants: Only 60 of the 142 referred patients entered the study. Some were excluded because they didn't meet the diagnostic criteria, some had other health issues, and some refused to be a part of the study.

Reference: Deale et. al. Cognitive behavior therapy for chronic fatigue syndrome: A randomized controlled trial. The American Journal of Psychiatry 154.3 (1997).

Study design

  • Patients randomly assigned to treatment and control groups, 30 patients in each group:

    • Treatment: Cognitive behavior therapy – collaborative, educative, and with a behavioral emphasis. Patients were shown on how activity could be increased steadily and safely without exacerbating symptoms.

    • Control: Relaxation – No advice was given about how activity could be increased. Instead progressive muscle relaxation, visualization, and rapid relaxation skills were taught.

Results

The table below shows the distribution of patients with good outcomes at 6-month follow-up. Note that 7 patients dropped out of the study: 3 from the treatment and 4 from the control group.

Yes No Total
Treatment 19 8 27
Control 5 21 26
Total 24 29 53

Results

Yes No Total
Treatment 19 8 27
Control 5 21 26
Total 24 29 53


  • Proportion with good outcomes in treatment group: \(19 / 27 \approx 0.70 \rightarrow 70\%\)

  • Proportion with good outcomes in control group: \(5 / 26 \approx 0.19 \rightarrow 19\%\)

Understanding the results

Do the data show a "real" difference between the groups?

  • Suppose you flip a coin 100 times. While the chance a coin lands heads in any given coin flip is 50%, we probably won't observe exactly 50 heads. This type of fluctuation is part of almost any type of data generating process.
  • The observed difference between the two groups (70 - 19 = 51%) may be real, or may be due to natural variation.
  • Since the difference is quite large, it is more believable that the difference is real.
  • We need statistical tools to determine if the difference is so large that we should reject the notion that it was due to chance.

Generalizing the results

Are the results of this study generalizable to all patients with chronic fatigue syndrome?

These patients had specific characteristics and volunteered to be a part of this study, therefore they may not be representative of all patients with chronic fatigue syndrome. While we cannot immediately generalize the results to all patients, this first study is encouraging. The method works for patients with some narrow set of characteristics, and that gives hope that it will work, at least to some degree, with other patients.

Overview of data collection principles

Populations and samples

  • Research question: Can people become better, more efficient runners on their own, merely by running?
  • Question: What is the population of interest?
  • Answer: All people
  • Study Sample: Group of adult women who recently joined a running group
  • Question: Population to which results can be generalized?
  • Answer: Adult women, if the data are randomly sampled

Anecdotal evidence and early smoking research

  • Anti-smoking research started in the 1930s and 1940s when cigarette smoking became increasingly popular. While some smokers seemed to be sensitive to cigarette smoke, others were completely unaffected.

  • Anti-smoking research was faced with resistance based on anecdotal evidence such as "My uncle smokes three packs a day and he's in perfectly good health", evidence based on a limited sample size that might not be representative of the population.

  • It was concluded that "smoking is a complex human behavior, by its nature difficult to study, confounded by human variability."

  • In time researchers were able to examine larger samples of cases (smokers), and trends showing that smoking has negative health impacts became much clearer.

Source: Brandt, The Cigarette Century (2009), Basic Books.

Sampling from a population: Census

  • Wouldn't it be better to just include everyone and "sample" the entire population?
  • This is called a census.
  • There are problems with taking a census:
  • It can be difficult to complete a census: there always seem to be some individuals who are hard to locate or hard to measure. And these difficult-to-find people may have certain characteristics that distinguish them from the rest of the population.
  • Populations rarely stand still. Even if you could take a census, the population changes constantly, so it's never possible to get a perfect measure.
  • Taking a census may be more complex than sampling.

Exploratory analysis to inference

  • Sampling is natural
  • Think about sampling something you are cooking — you taste (examine) a small part of what you're cooking to get an idea about the dish as a whole.
  • When you taste a spoonful of soup and decide the spoonful you tasted isn't salty enough, that's exploratory analysis.
  • If you generalize and conclude that your entire soup needs salt, that's an inference.
  • For your inference to be valid, the spoonful you tasted (the sample) needs to be representative of the entire pot (the population).


  • If your spoonful comes only from the surface and the salt is collected at the bottom of the pot, what you tasted is probably not representative of the whole pot.
  • If you first stir the soup thoroughly before you taste, your spoonful will more likely be representative of the whole pot.

Sampling bias

  • Non-response: If only a small fraction of the randomly sampled people choose to respond to a survey, the sample may no longer be representative of the population.

  • Voluntary response: Occurs when the sample consists of people who volunteer to respond because they have strong opinions on the issue. Such a sample will also not be representative of the population.

Source: cnn.com, Jan 14, 2012

  • Convenience sample: Individuals who are easily accessible are more likely to be included in the sample.

Sampling bias example: Landon vs. FDR

A historical example of a biased sample yielding misleading results:

In 1936, Landon sought the Republican presidential nomination opposing the re-election of FDR.

The Literary Digest Poll

  • The Literary Digest polled about 10 million Americans, and got responses from about 2.4 million.

  • The poll showed that Landon would likely be the overwhelming winner and FDR would get only 43% of the votes.

  • Election result: FDR won, with 62% of the votes.

  • The magazine was completely discredited because of the poll, and was soon discontinued.

The Literary Digest Poll – what went wrong?

  • The magazine had surveyed
  • its own readers,
  • registered automobile owners, and
  • registered telephone users
  • These groups had incomes well above the national average of the day (remember, this is Great Depression era) which resulted in lists of voters far more likely to support Republicans than a truly typical voter of the time, i.e. the sample was not representative of the American population at the time.

Large samples are preferable, but…

  • The Literary Digest election poll was based on a sample size of 2.4 million, which is huge, but since the sample was biased, the sample did not yield an accurate prediction.

  • Back to the soup analogy: If the soup is not well stirred, it doesn't matter how large a spoon you have, it will still not taste right. If the soup is well stirred, a small spoon will suffice to test the soup.

Practice

A school district is considering whether it will no longer allow high school students to park at school after two recent accidents where students were severely injured. As a first step, they survey parents by mail, asking them whether or not the parents would object to this policy change. Of 6,000 surveys that go out, 1,200 are returned. Of these 1,200 surveys that were completed, 960 agreed with the policy change and 240 disagreed. Which of the following statements are true?

  1. Some of the mailings may have never reached the parents.
  2. The school district has strong support from parents to move forward with the policy approval.
  3. It is possible that majority of the parents of high school students disagree with the policy change.
  4. The survey results are unlikely to be biased because all parents were mailed a survey.

  5. Only I
  6. I and II
  7. I and III
  8. III and IV
  9. Only IV

  • Solution: I and III

Explanatory and response variables

Explanatory and response variables

  • To identify the explanatory variable in a pair of variables, identify which of the two is suspected of affecting the other:

  • explanatory variable \(\xrightarrow{might~affect}\)response variable

  • Labeling variables as explanatory and response does not guarantee the relationship between the two is actually causal, even if there is an association identified between the two variables. We use these labels only to keep track of which variable we suspect affects the other.

Observational studies and experiments

Observational studies and experiments

  • Observational study: Researchers collect data in a way that does not directly interfere with how the data arise, i.e. they merely "observe", and can only establish an association between the explanatory and response variables.

  • Experiment: Researchers randomly assign subjects to various treatments in order to establish causal connections between the explanatory and response variables.

  • If you're going to walk away with one thing from the second half of this class, let it be "correlation does not imply causation".