Statistical analysis
Statistical Analysis is used to calculate mean and standard deviation, t-Test, and correlation between data sets. We do not cover this as a specific unit, however, the information will be incorporated into our curriculum as well your Internal Assessments. This information has been modified from the old IB Biology curriculum.
Guidance for Statistical Analysis in Practical Work
1. Importance of Statistical Analysis
1. Importance of Statistical Analysis
- Statistical analysis helps to determine the significance of experimental results, providing insights into the reliability and validity of findings.
- It allows students to make data-driven decisions and strengthens conclusions by quantifying differences and patterns.
- Mean and Standard Deviation: Used to calculate the central tendency and dispersion of data. Essential for understanding the spread and reliability of the data.
- T-Test: Compares means between two groups to assess if they are statistically different from each other.
- Chi-Squared Test: Assesses the association between categorical variables. Useful in genetics or categorical datasets.
- ANOVA (Analysis of Variance): For comparing the means of more than two groups to see if at least one mean is different. This is especially useful in ecological studies with multiple factors.
- Correlation and Regression: Used to determine relationships between variables and predict outcomes. Helpful in environmental science and ecological studies.
- Clearly define hypotheses before testing.
- Select an appropriate statistical test based on the research question and data type.
- Ensure correct use of software or manual calculations and follow guidelines on significance thresholds (e.g., p < 0.05 for significance).
Developing Statistical Hypothesis
Overview Guide: Developing Statistical Hypotheses
- Purpose of Hypotheses in Statistical Analysis
- Hypotheses provide a clear focus for your investigation and define the relationships or differences you expect to find.
- They guide the choice of statistical tests and the interpretation of data.
- When to Develop a Hypothesis
- Before Data Collection: Hypotheses should be created during the planning phase of practical work, before data collection, to ensure unbiased data analysis.
- After Observing Patterns: If observations suggest trends, you can form hypotheses to test these patterns formally.
- Types of Hypotheses
- Null Hypothesis (H₀): Assumes no effect or no difference; it is what you test against.
- Example: "There is no difference in growth rate between plants in sunlight and shade."
- Alternative Hypothesis (H₁): Suggests an effect or difference exists.
- Example: "Plants in sunlight grow at a faster rate than those in shade."
- Null Hypothesis (H₀): Assumes no effect or no difference; it is what you test against.
- Key Considerations in Hypothesis Formation
- Measurability: Hypotheses must be based on measurable data and clear variables.
- Specificity: Be precise about the population or variables (e.g., age group, species, environmental condition).
- Relevance: Ensure hypotheses relate directly to your research question or aim.
- Examples of Hypothesis Statements
- Comparison: "There is a significant difference in the heart rates of students before and after exercise."
- Relationship: "There is a positive correlation between sunlight exposure and plant height."
- Final Tips
- Always define your hypothesis prior to analysis.
- Align your hypothesis with the aims of your study for a focused investigation.
Understanding the P-Value, Confidence Level, and Significance
What is a P-Value?
Interpreting the P-Value
Significance Threshold
Confidence Level:
Significance Level (α):
Using the P-Value in Your Analysis
- The p-value is a probability metric that helps determine the statistical significance of your test results.
- It indicates the likelihood of obtaining results as extreme as the observed ones, assuming the null hypothesis (H₀) is true.
Interpreting the P-Value
- Low P-Value (p < 0.05): Strong evidence against the null hypothesis. You can reject H₀, suggesting a significant effect or difference.
- High P-Value (p ≥ 0.05): Weak evidence against the null hypothesis. You do not reject H₀, suggesting that observed differences might be due to chance.
- What It Tells You: Indicates whether observed results are likely due to chance but does not show the effect size or practical relevance.
Significance Threshold
- A common significance level is 0.05, meaning there’s a 5% chance that results are due to random variation. However, levels like 0.01 or 0.1 might also be used depending on the study.
Confidence Level:
- Reflects certainty in your results (e.g., a 95% confidence level means the true effect would appear within the same range in 95 out of 100 repeats).
- Higher confidence levels (e.g., 99%) suggest more certainty but require larger sample sizes.
Significance Level (α):
- The threshold for determining statistical significance, often set at 0.05.
- Decides whether to accept or reject H₀ based on the p-value’s comparison to α.
Using the P-Value in Your Analysis
- Calculate the p-value after performing your statistical test (e.g., t-test, chi-square).
- Compare the p-value to your chosen significance level to determine if your results are statistically significant.
- Example 1: You test whether two groups have different means. If p = 0.03, it’s less than 0.05, so you reject H₀ and conclude a significant difference.
- Example 2: Testing for correlation, if p = 0.08, it’s greater than 0.05, so you fail to reject H₀, indicating no significant correlation.
Making Conclusions and Understanding Significance
- Statistically Significant ≠ Practically Important:
- A significant result (low p-value) does not mean the effect is large or meaningful outside statistical context.
- Consider the real-world implications and context when evaluating significance.
- Non-Significant Results Aren’t Always Unimportant:
- Non-significance might suggest small sample size or high variability, not necessarily the absence of any effect.
- Even without statistical significance, the observed trend might still have practical relevance.
- Discuss the potential impact of the findings, regardless of p-value, and consider reporting effect size and confidence intervals for a fuller analysis.
- When Results Are Not Statistically Significant:
- Do not assume no effect exists. Instead, consider sample size, potential experimental errors, and if further investigation is warranted.
- Reporting non-significant findings transparently adds value to the research by documenting potential influences and insights gained.
Applying Statistical Tests and Presenting Data
- Common Tests:
- Mean and Standard Deviation: For understanding data spread and variability.
- T-Test: To compare means between two groups.
- Chi-Squared Test: For assessing associations in categorical data.
- ANOVA: For comparing means across more than two groups.
- Correlation and Regression: For assessing relationships and predicting outcomes.
- Data Presentation:
- Use tables for organized data display with units, averages, and uncertainties.
- Visualize data with labeled graphs and error bars to clearly represent findings.
Correlation and Causation
One of the most common errors we find is the confusion between correlation and causation in science. In theory, these are easy to distinguish — an action or occurrence can cause another (such as smoking causes lung cancer), or it can correlate with another (such as smoking is correlated with alcoholism). If one action causes another, then they are most certainly correlated. But just because two things occur together does not mean that one caused the other, even if it seems to make sense.
Correlation describes the strength and direction of a linear relationship between two variables
Is positive (x = y) or negative (x = - y)
Causation describes the relationship between two variables, where one variable has a direct effect on another
Correlation does not automatically indicate causation – just because two variables change in relation to one another, does not mean they are linked
E.g. CO2 levels and crime have both risen, but CO2 levels don't cause crime
Correlation describes the strength and direction of a linear relationship between two variables
Is positive (x = y) or negative (x = - y)
Causation describes the relationship between two variables, where one variable has a direct effect on another
Correlation does not automatically indicate causation – just because two variables change in relation to one another, does not mean they are linked
E.g. CO2 levels and crime have both risen, but CO2 levels don't cause crime
Mean
The sum of all the data points divided by the number of data points.
Measure of central tendency for normally distributed data.
DO NOT calculate a mean from values that are already averages.
DO NOT calculate a mean when the measurement scale is not linear (i.e. pH units are not measured on a linear scale
Measure of central tendency for normally distributed data.
DO NOT calculate a mean from values that are already averages.
DO NOT calculate a mean when the measurement scale is not linear (i.e. pH units are not measured on a linear scale
Standard Deviation
Averages do not tell us everything about a sample. Samples can be very uniform with the data all bunched around the mean or they can be spread out a long way from the mean. The statistic that measures this spread is called the standard deviation. The wider the spread of scores, the larger the standard deviation. For data that has a normal distribution, 68% of the data lies within one standard deviation of the mean
How to Calculate the Standard Deviation:
- Calculate the mean (x̅) of a set of data
- Subtract the mean from each point of data to determine (x-x̅). You'll do this for each data point, so you'll have multiple (x-x̅).
- Square each of the resulting numbers to determine (x-x̅)^2. As in step 2, you'll do this for each data point, so you'll have multiple (x-x̅)^2.
- Add the values from the previous step together to get ∑(x-x̅)^2. Now you should be working with a single value.
- Calculate (n-1) by subtracting 1 from your sample size. Your sample size is the total number of data points you collected.
- Divide the answer from step 4 by the answer from step 5
- Calculate the square root of your previous answer to determine the standard deviation.
- Be sure your standard deviation has the same number of units as your raw data, so you may need to round your answer.
- The standard deviation should have the same unit as the raw data you collected. For example, SD = +/- 0.5 cm.
|
|
|
|
Where To Start
It can be very hard to know which statistical tests are the most relevant for your research question and your data. Here is a flow chart that help guide you in the best direction
Student t-Test
What is a Student’s T-Test?
When to Use a T-Test
Types of T-Tests
Important Considerations
In any significance test, there are two possible hypothesis:
- A Student’s T-Test is a statistical test used to determine if there is a significant difference between the means of two groups.
- It assesses whether the observed difference is likely due to chance or reflects a true effect in your data.
When to Use a T-Test
- Comparing Two Independent Groups: e.g., comparing plant growth in sunlight vs. shade.
- Paired (Dependent) T-Test: For comparing two measurements from the same group (e.g., before and after treatment in the same subjects).
- Conditions:
- Data is approximately normally distributed.
- Groups have similar variances (in cases where variances differ, consider using a Welch’s T-Test).
Types of T-Tests
- Independent (Two-Sample) T-Test: For comparing the means of two independent groups.
- Paired (Dependent) T-Test: For comparing means within the same group across two conditions (e.g., before and after intervention).
Important Considerations
- Effect Size: While p-values indicate significance, consider calculating effect size to understand the magnitude of the difference.
- Assumptions: Ensure data meets t-test assumptions, like normality and equal variances.
- Limitations: The t-test is sensitive to outliers, which can skew results, and is best suited for small sample sizes.
In any significance test, there are two possible hypothesis:
Null Hypothesis:
"There is not a significant difference between the two groups; any observed differences may be due to chance and sampling error." |
Alternative Hypothesis:
"There is a significant difference between the two groups; the observed differences are most likely not due to chance or sampling error." |
How to calculate T:
A p-value s the probability of concluding there is a significant difference between the groups result when the null hypothesis is true (meaning, the probability of making the WRONG conclusion). In biology, we use a standard “p-value” of 0.05. This means that five times out of a hundred you would find a statistically significant difference between the means even if there was none.
- Calculate the mean (X) of each sample
- Find the absolute value of the difference between the means
- Calculate the standard deviation for each sample
- Square the standard deviation for each sample
- Divide each squared standard deviations by the sample size of that group.
- Add these two values
- Take the square root of the number to find the "standard error of the difference.
- Divide the difference in the means (step 2) by the standard error of the difference (step 7). The answer is your "calculated T-value."
- Determine the degrees of freedom (df) for the test. In the t-test, the degrees of freedom is the sum of the sample sizes of both groups minus 2.
- Determine the “Critical T-value” in a table by triangulating your DF and the “p value” of 0.05.
- Draw your conclusion:
If your calculated t value is greater than the critical T-value from the table, you can conclude that the difference between the means for the two groups is significantly different. We reject the null hypothesis and conclude that the alternative hypothesis is correct.
If your calculated t value is lower than the critical T-value from the table, you can conclude that the difference between the means for the two groups is NOT significantly different. We accept the null hypothesis.
A p-value s the probability of concluding there is a significant difference between the groups result when the null hypothesis is true (meaning, the probability of making the WRONG conclusion). In biology, we use a standard “p-value” of 0.05. This means that five times out of a hundred you would find a statistically significant difference between the means even if there was none.
|
|
ANOVA
What is ANOVA?
When to Use ANOVA
Types of ANOVA
Steps to Conduct One-Way ANOVA
Reporting Results
Example Calculation (One-Way ANOVA)
Important Considerations
After you have run a one-way ANOVA and found significant results, then you should run Tukey’s HSD Test to find out which specific groups’s mean is different. The test compares all possible pairs of means rather than comparing pairs of values.
- ANOVA (Analysis of Variance) is a statistical test used to compare the means of three or more groups to determine if at least one group mean is significantly different from the others.
- It tells you whether any observed differences among group means are likely due to chance.
When to Use ANOVA
- Use ANOVA when comparing three or more independent groups (e.g., plant growth in different light conditions: full sunlight, partial shade, and full shade).
- Conditions for ANOVA:
- Data is approximately normally distributed within each group.
- Variances among groups are similar (homogeneity of variance).
- Samples are independent.
Types of ANOVA
- One-Way ANOVA: Compares means across one independent variable with multiple levels (e.g., testing plant growth across three types of light).
- Two-Way ANOVA: Compares means across two independent variables (e.g., testing effects of both light and water levels on plant growth).
- Repeated Measures ANOVA: Used when the same subjects are measured across multiple conditions (e.g., testing the same plants under different light levels over time).
Steps to Conduct One-Way ANOVA
- Step 1: State the hypotheses.
- Null Hypothesis (H₀): Assumes all group means are equal (e.g., “There is no difference in plant growth across different light conditions.”).
- Alternative Hypothesis (H₁): At least one group mean is significantly different from the others.
- Step 2: Choose the significance level (α), commonly set at 0.05.
- Step 3: Calculate the F-Statistic.
- The F-Statistic compares the variance between groups to the variance within groups.
- Formula for F-Statistic: F=Variance between groupsVariance within groupsF = \frac{\text{Variance between groups}}{\text{Variance within groups}}F=Variance within groupsVariance between groups
- Step 4: Find the p-value.
- The p-value is based on the F-Statistic and degrees of freedom.
- A low p-value (p < α) indicates that at least one group mean is significantly different.
- Step 5: Post-Hoc Test (if significant).
- If ANOVA is significant, use a post-hoc test (e.g., Tukey’s HSD) to determine which specific group means differ from each other.
Reporting Results
- Report the F-statistic, degrees of freedom, and p-value.
- Example: “The one-way ANOVA indicated a significant difference in plant growth across light conditions (F(2, 27) = 4.76, p = 0.01). Post-hoc tests showed that plants in full sunlight grew significantly taller than those in full shade.”
Example Calculation (One-Way ANOVA)
- Data: Plant height in three groups (full sunlight, partial shade, full shade).
- Calculate:
- Group means and overall mean.
- Variance between groups and variance within groups.
- F-Statistic: F=SS between / df betweenSS within / df withinF = \frac{\text{SS between / df between}}{\text{SS within / df within}}F=SS within / df withinSS between / df between
- Where SS is the sum of squares, and df refers to degrees of freedom.
- Interpretation: If the p-value is below 0.05, conclude there is a significant difference between at least two groups.
Important Considerations
- Effect Size: Consider calculating effect size (e.g., eta-squared) to measure the magnitude of the difference among groups.
- Assumptions: Check for normality and homogeneity of variances (e.g., using Levene’s Test).
- Limitations: ANOVA does not tell you which groups differ; follow with post-hoc tests if needed.
After you have run a one-way ANOVA and found significant results, then you should run Tukey’s HSD Test to find out which specific groups’s mean is different. The test compares all possible pairs of means rather than comparing pairs of values.
|
|
Chi-square Tests
What is the Chi-Squared Test?
2. When to Use the Chi-Squared Test
3. Types of Chi-Squared Tests
4. Steps to Conduct a Chi-Squared Test
- The Chi-Squared (χ²) Test is a statistical test used to determine if there is a significant association between two categorical variables or if observed frequencies differ from expected frequencies.
- It is commonly used for genetics studies, survey data, and categorical data (e.g., testing if there’s an association between plant color and insect preference).
2. When to Use the Chi-Squared Test
- Goodness-of-Fit Test: To see if observed data fits an expected distribution (e.g., testing if a set of observed genetics data fits the Mendelian ratio of 3:1).
- Test of Independence: To determine if there’s an association between two categorical variables (e.g., testing if gender and course choice are related).
- Conditions for Using Chi-Squared:
- Data must be categorical (e.g., color, category, or group type).
- Sample size should be large enough (typically each expected frequency should be 5 or more).
- Observations must be independent.
3. Types of Chi-Squared Tests
- Chi-Squared Goodness-of-Fit Test: Compares observed frequencies to expected frequencies based on a known ratio or distribution.
- Chi-Squared Test of Independence: Examines the relationship between two categorical variables, usually presented in a contingency table.
4. Steps to Conduct a Chi-Squared Test
- Step 1: State the hypotheses.
- Null Hypothesis (H₀): Assumes no association between variables, or that observed frequencies match expected frequencies.
- Alternative Hypothesis (H₁): Suggests a significant association exists or that observed frequencies differ from expected ones.
- Step 2: Calculate Expected Frequencies.
- For goodness-of-fit, use known ratios or distributions.
- For independence, calculate expected frequencies for each cell in the table based on row and column totals.
- Step 3: Calculate the Chi-Squared Statistic.
- Step 4: Determine Degrees of Freedom (df).
- Goodness-of-Fit: df = number of categories - 1.
- Test of Independence: df = (rows - 1) * (columns - 1).
- Step 5: Find the p-value.
- Compare the calculated χ² value to the critical value from the chi-square distribution table based on the df and chosen significance level (e.g., 0.05).
- Step 6: Interpret Results.
- If p < α, reject H₀, indicating a significant association or difference from the expected distribution.
5. Reporting Results
- Report the chi-square statistic, degrees of freedom, and p-value.
- Example: “The chi-squared test showed a significant association between plant color and insect preference (χ²(3) = 12.6, p = 0.01).”
6. Example Calculation (Goodness-of-Fit)
- Data: Observed frequencies of red, pink, and white flowers in a genetic cross.
- Calculate:
- Expected frequencies based on Mendelian ratio (e.g., 1:2:1 for a cross).
- χ² value using the formula above.
- Determine significance by comparing to chi-square critical values.
- Interpretation: If the p-value is below 0.05, conclude there’s a significant difference from the expected Mendelian ratio.
7. Important Considerations
- Sample Size: Chi-squared tests are sensitive to small expected frequencies; ensure expected counts are generally ≥ 5.
- Limitations: Chi-squared does not indicate the strength of association; it only tests for significance.
- Alternative Tests: If assumptions aren’t met, consider Fisher’s Exact Test for small sample sizes.
Pearson’s Correlation Coefficient:
1. What is Pearson’s Correlation Coefficient?
- Pearson’s Correlation Coefficient (r) measures the strength and direction of the linear relationship between two continuous variables.
- Values of r range from -1 to +1:
- r = +1: Perfect positive correlation (as one variable increases, the other increases proportionally).
- r = -1: Perfect negative correlation (as one variable increases, the other decreases proportionally).
- r = 0: No correlation.
2. When to Use Pearson’s Correlation Coefficient
3. Interpreting Pearson’s r
4. Steps to Calculate Pearson’s r
- Use when assessing the linear relationship between two continuous, normally distributed variables (e.g., height and weight).
- Conditions for using Pearson’s correlation:
- Both variables should be continuous (e.g., height, temperature, concentration).
- Data should approximate a normal distribution.
- The relationship should be linear (check this by plotting the data on a scatterplot).
3. Interpreting Pearson’s r
- Strength of Correlation:
- 0.0 to ±0.3: Weak correlation.
- ±0.3 to ±0.7: Moderate correlation.
- ±0.7 to ±1.0: Strong correlation.
- Direction of Correlation:
- Positive value (+): Indicates a positive relationship (both variables increase together).
- Negative value (-): Indicates a negative relationship (one variable increases as the other decreases).
4. Steps to Calculate Pearson’s r
- Step 1: Organize your data into two sets of paired values (e.g., height and weight).
- Step 2: Calculate the mean of each variable (x̄ and ȳ).
- Step 3: Use the formula to calculate Pearson’s
- Step 4: Find the significance of r by comparing it to critical values based on your sample size (n) or using software to obtain a p-value.
- Step 5: Interpret the results:
- If p < α (e.g., 0.05), the correlation is statistically significant.
5. Reporting Results
- Report r, sample size (n), and p-value.
- Example: “There was a significant positive correlation between temperature and enzyme activity (r = 0.75, n = 25, p < 0.01), indicating that as temperature increased, enzyme activity also increased.”
6. Example Calculation
- Data: Height and weight of a sample group.
- Calculate:
- Mean values for height and weight.
- Pearson’s r using the formula above.
- Interpret the correlation and test for statistical significance.
- Interpretation: If r = 0.75, this suggests a strong positive correlation between height and weight, indicating that as height increases, weight tends to increase.
7. Important Considerations
- Linearity: Pearson’s r only measures linear relationships. For curved relationships, consider other tests (e.g., Spearman’s rank correlation).
- Outliers: Extreme values can distort r, making it appear stronger or weaker than it is.
- Causation: Correlation does not imply causation. A significant r value suggests association, but not that one variable causes the other to change.
Regression Analysis
1. What is Regression Analysis?
2. When to Use Regression Analysis
3. Key Terms in Regression Analysis
4. Steps to Conduct Linear Regression
NOTE: If your data has a linear relationship you can progress to the Pearson Correlation Coefficient which tells us the type of linear relationship (positive, negative, none) between two variables, as well as the strength of that relationship (weak = 0.0 to 0.29, moderate = 0.30 to 0.49, strong = 0.50+).
5. Reporting Results
Note:
6. Example Calculation
7. Important Considerations
- Regression analysis is a statistical method that examines the relationship between an independent variable (predictor) and a dependent variable (outcome).
- It helps determine how much the independent variable influences the dependent variable and can be used to make predictions.
2. When to Use Regression Analysis
- Use regression when you want to understand or predict the effect of one variable on another (e.g., how light intensity affects plant growth).
- Conditions for regression:
- The relationship between variables should be approximately linear.
- Both variables should be continuous, and data should be normally distributed.
- The independent variable (predictor) should cause or influence the dependent variable (outcome).
3. Key Terms in Regression Analysis
- Independent Variable (X): The predictor or explanatory variable.
- Dependent Variable (Y): The outcome or response variable.
- Slope (b): Indicates the rate of change in Y for a one-unit change in X.
- Intercept (a): The point where the regression line crosses the Y-axis, representing the value of Y when X is zero.
- R-Squared (R²): Measures the proportion of variation in Y explained by X. Ranges from 0 to 1:
- R² close to 1: Strong relationship.
- R² close to 0: Weak relationship.
4. Steps to Conduct Linear Regression
- Step 1: Plot the data on a scatterplot to ensure a linear relationship.
- Step 2: Calculate the regression equation: Y=a+bX
- Where a is the intercept and b is the slope.
- Step 3: Interpret the Slope (b).
- A positive b indicates that as X increases, Y also increases.
- A negative b indicates that as X increases, Y decreases.
- Step 4: Calculate the R-Squared (R²) value to understand the strength of the relationship.
- Step 5: Test for statistical significance.
- Find the p-value for the slope (b) to determine if the relationship is statistically significant.
NOTE: If your data has a linear relationship you can progress to the Pearson Correlation Coefficient which tells us the type of linear relationship (positive, negative, none) between two variables, as well as the strength of that relationship (weak = 0.0 to 0.29, moderate = 0.30 to 0.49, strong = 0.50+).
5. Reporting Results
- Report the regression equation, slope, intercept, R², and p-value.
- Example: “A linear regression analysis showed that light intensity significantly predicted plant growth (Y = 2.3 + 0.5X, R² = 0.68, p < 0.01), with an increase in light intensity resulting in greater plant growth.”
Note:
- If your regression analysis reveals a linear relationship, you can continue onto calculating the Pearson Correlation Coefficient.
- If your regression analysis reveals a monotonic relationship (this could be a polynomial, exponential or logistic relationship), you can continue onto calculating the Spearman Rank Correlation.
6. Example Calculation
- Data: Light intensity (X) and plant growth rate (Y).
- Calculate:
- Plot data and ensure a linear relationship.
- Use the least-squares method to find the slope (b) and intercept (a).
- Calculate R² to evaluate the strength of the relationship.
- Test for significance of the slope using a p-value.
- Interpretation: A significant positive slope (b = 0.5) suggests that as light intensity increases, plant growth rate also increases, explaining 68% of the variation in plant growth (R² = 0.68).
7. Important Considerations
- Linearity: Only use linear regression when the relationship between variables is linear. For non-linear relationships, consider polynomial or non-linear regression.
- Extrapolation: Be cautious about predicting values outside the data range, as the relationship may not hold.
- Outliers: Outliers can heavily influence the regression line, so check for unusual points that may skew results.
- Causation: Regression shows association, but does not prove causation. Consider other factors that might influence the relationship.
Advanced Regression Techniques
1. Logistic Regression
2. Multiple Linear Regression
3. Polynomial Regression
1. Logistic Regression
- Purpose: Logistic regression is used to model the relationship between one or more predictor variables and a binary outcome (e.g., yes/no, present/absent).
- When to Use: Use logistic regression when the dependent variable is categorical with two possible outcomes (e.g., whether someone develops diabetes or not).
- Example: Researchers investigating how exercise and weight impact the probability of developing diabetes can use logistic regression. This method allows them to predict the likelihood of diabetes based on the predictor variables.
- Note: An online logistic regression calculator can be used to simplify computations. [Link to Calculator]
2. Multiple Linear Regression
- Purpose: Multiple linear regression finds the line of best fit for data with multiple independent variables (X1, X2, etc.) and one continuous dependent variable (Y).
- When to Use: Use when you have more than one predictor variable and want to see their combined effect on the dependent variable.
- Example: If you collect data on plant height, age, and the number of flowers, multiple regression allows you to predict the number of flowers based on both height and age.
- Note: An online multiple linear regression calculator can assist with calculations. [Link to Calculator]
3. Polynomial Regression
- Purpose: Polynomial regression is used when the relationship between predictor variables and a response variable is better represented by a curve rather than a straight line.
- When to Use: Use polynomial regression if a linear model does not adequately fit the data and a curved relationship is observed.
- Example: In the case of plant growth over time, if the growth curve is non-linear, polynomial regression can provide a better fit, as indicated by a higher R² value (e.g., 0.9749 versus 0.8928 for linear regression).
- Note: An online polynomial regression calculator can be used for ease of calculation. [Link to Calculator]
Mann Whitney U-Test
1. What is the Mann-Whitney U-Test?
2. When to Use the Mann-Whitney U-Test
3. Assumptions of the Mann-Whitney U-Test
4. Steps to Conduct the Mann-Whitney U-Test
- The Mann-Whitney U-Test is a non-parametric test that compares the differences between two independent groups when the data do not follow a normal distribution.
- It evaluates whether one group tends to have higher or lower values than the other, making it useful for comparing two independent groups with ordinal or non-normally distributed data.
2. When to Use the Mann-Whitney U-Test
- Use when comparing two independent groups (e.g., scores from two different classes).
- It is suitable for ordinal data (ranked data) or when data is not normally distributed and does not meet the assumptions for a t-test.
- Examples:
- Comparing exam scores between two different classrooms.
- Comparing plant growth between two soil types when data are skewed.
3. Assumptions of the Mann-Whitney U-Test
- Independence: Each group should consist of independent observations (no pairing or repeated measurements).
- Ordinal or Continuous Data: Data should be ordinal or continuous.
- Non-Normal Distribution: Use when data does not follow a normal distribution or sample sizes are small.
4. Steps to Conduct the Mann-Whitney U-Test
- Step 1: Organize data into two groups and rank all values across both groups from lowest to highest.
- Step 2: Sum the ranks for each group separately (R₁ and R₂).
- Step 3: Calculate the U statistic for each group using the formula:
- Where
- n1 and n2 are the sample sizes for each group, and R1 is the sum of ranks for Group 1..
- Calculate U2 similarly, using the ranks for Group 2.
- Step 4: The smaller of U1 and U2 is the Mann-Whitney U value.
- Step 5: Compare the U value to the critical value in the Mann-Whitney table or obtain a p-value to determine statistical significance.
5. Reporting Results
- Report the U statistic, sample sizes, and p-value.
- Example: “The Mann-Whitney U-Test indicated a significant difference in exam scores between Class A and Class B (U = 45, n₁ = 15, n₂ = 15, p = 0.03), suggesting that one class generally performed better than the other.”
6. Example Calculation
- Data: Exam scores from two independent classrooms.
- Calculate:
- Rank all scores across both classrooms.
- Sum ranks for each classroom and calculate U1U_1U1 and U2U_2U2.
- Compare the smaller U value to critical values or calculate the p-value.
- Interpretation: If the p-value is less than 0.05, you conclude there is a statistically significant difference between the two groups.
7. Important Considerations
- Non-Parametric: As a non-parametric test, the Mann-Whitney U-Test does not assume normality, making it ideal for non-normal or small datasets.
- Effect Size: Consider reporting effect size to show the strength of the difference between groups.
- Limitations: It only indicates if there’s a significant difference in ranks, not in the means of the groups.
Class Materials:
Error Analysis
Significant Figures
Precision Measurements and Uncertainties
Precision Lab
Topic 1 Statistics (ppt)
Biostatistics Practical Problems
Graphing In Edexcel
Graphing in Edexcel Practice problems
Standard Deviation (ppt)
Standard Deviation (notes)
Standard Deviation Practice problems
Hydroponics Standard Deviation Practice problems
t-Test (ppt)
t-Test (notes)
Correlation and Causation (ppt)
Correlation and Causation (notes)
Correlation reading
Correlations of cancer (pdf)
Data set #1 (pdf)
Data set #2 (pdf)
Data set #3 (pdf)
T-test reading
T-Testing in Biology University of
Statistics Review
Useful Links
Review of means
Click here for calculating SD with tools
Click here for Flash Card questions on Statistical Analysis
Click here for tips on Excel graphing.
“Using error bars in experimental Biology” by Geoff Cumming, Fiona Fidler, and David L. Vaux. (Journal of Cell Biology)
Are two sets of data really different?Click here to perform Student’s t-test
Click here to perform Student’s t-test via copy and paste
Example graph (from The Biology Teacher, September 2013)
Graphic Calculator Tour
Easy Calculation
Statistics calculator
MERLIN software for Excel
Chi-square calculator
Chi-square table
T-test calculator
Standard deviation reading
T-Test Table, Excel and calculations can be found here.
There are many statistical tools to establish a statistically significant correlation. read more here or read an article about Cause and Correlation by Wisegeek here.
Difference Between Correlation and Causation article
Excellent Handbook of Biological Statistics from John MacDonald
Basic Statistical Tools, from the Natural Resources Management Department
And The Little Handbook of Statistical Practice is very useful.
Sumanas statistics animations
Field Studies Council stats page, including the t-test
Open Door Website stats page and help with graphs and tables.
Making Population Pyramids on Excel
Spreadsheet Data Analysis Tutortial
Video over Table
Making Table g
Making Tables
This is an ecocolumn design you can use in the long-term IA’s 1 - from learner.org
Here’s another ecocolumn design you can use for the long-term IA project - from fastplants.org
Video Clips
Error Analysis
Significant Figures
Precision Measurements and Uncertainties
Precision Lab
Topic 1 Statistics (ppt)
Biostatistics Practical Problems
Graphing In Edexcel
Graphing in Edexcel Practice problems
Standard Deviation (ppt)
Standard Deviation (notes)
Standard Deviation Practice problems
Hydroponics Standard Deviation Practice problems
t-Test (ppt)
t-Test (notes)
Correlation and Causation (ppt)
Correlation and Causation (notes)
Correlation reading
Correlations of cancer (pdf)
Data set #1 (pdf)
Data set #2 (pdf)
Data set #3 (pdf)
T-test reading
T-Testing in Biology University of
Statistics Review
Useful Links
Review of means
Click here for calculating SD with tools
Click here for Flash Card questions on Statistical Analysis
Click here for tips on Excel graphing.
“Using error bars in experimental Biology” by Geoff Cumming, Fiona Fidler, and David L. Vaux. (Journal of Cell Biology)
Are two sets of data really different?Click here to perform Student’s t-test
Click here to perform Student’s t-test via copy and paste
Example graph (from The Biology Teacher, September 2013)
Graphic Calculator Tour
Easy Calculation
Statistics calculator
MERLIN software for Excel
Chi-square calculator
Chi-square table
T-test calculator
Standard deviation reading
T-Test Table, Excel and calculations can be found here.
There are many statistical tools to establish a statistically significant correlation. read more here or read an article about Cause and Correlation by Wisegeek here.
Difference Between Correlation and Causation article
Excellent Handbook of Biological Statistics from John MacDonald
Basic Statistical Tools, from the Natural Resources Management Department
And The Little Handbook of Statistical Practice is very useful.
Sumanas statistics animations
Field Studies Council stats page, including the t-test
Open Door Website stats page and help with graphs and tables.
Making Population Pyramids on Excel
Spreadsheet Data Analysis Tutortial
Video over Table
Making Table g
Making Tables
This is an ecocolumn design you can use in the long-term IA’s 1 - from learner.org
Here’s another ecocolumn design you can use for the long-term IA project - from fastplants.org
Video Clips
Watch Hans Rosling’s brilliant Joy of Statistics here. For a short clip: