Difference in learning among students doing pen-and-paper homework compared to web-based homework in an introductory statistics course

A repeated crossover experiment comparing learning among students handing in pen-and-paper homework (PPH) with students handing in web-based homework (WBH) has been conducted. The system used in the experiments, the tutor-web, has been used to deliver homework problems to thousands of students in mathematics and statistics over several years. Since 2011 experimental changes have been made regarding how the system allocates items to students, how grading is done and the type of feedback provided. The experiment described here was conducted annually from 2011 to 2014. Approximately 100 students in an introductory statistics course participated each year. The main goals were to determine whether the above mentioned changes had an impact on learning as measured by test scores in addition to comparing learning among students doing PPH with students handing in WBH. The difference in learning between students doing WBH compared to PPH, measured by test scores, increased signiﬁcantly from 2011 to 2014 with an effect size of 0.634. This is a strong indication that the changes made in the NAMEOFSYSTEM have a positive impact on learning. Using the data from 2014 a signiﬁcant difference in learning between WBH and PPH for 2014 was detected with an effect size of 0.416 supporting the use of WBH as a learning tool.


Introduction
Enrolment to universities has increased substantially in the past decade in most OECD (Organisation for Economic Co-operation and Development) countries.In COUNTRYOFAUTHORS, the 1 ACCEPTED MANUSCRIPT ACCEPTED MANUSCRIPT increase in tertiary level enrolment was 40% between 2000and 2010(OECD, 2013)).This increase has resulted in larger class sizes at the University of COUNTRYOFAUTHORS, especially in undergraduate courses.As stated in Black and Wiliam (1998), several studies have shown firm evidence that innovations designed to strengthen the frequent feedback that students receive about their learning yield substantial learning gains.Providing students with frequent quality feedback is time consuming and in large classes this can be very costly.It is therefore of importance to investigate whether web-based homework (WBH), that does not require marking by teachers but provides feedback to students, can replace (at least to some extent) traditional pen-and-paper homework (PPH).
To investigate this, an experiment has been conducted over a four year period in an introductory course in statistics at the University of COUNTRYOFAUTHORS.About 100 students participated each year.The experiment is a repeated crossover experiment so the same students were exposed to both methods, WBH and PPH.
The learning environment NAMEOFSYSTEM (http://NAMEOFSYSTEM.net) used in the experiments has been under development during the past decade in the University of COUNTRY-OFAUTHORS.Two research questions are of particular interest: 1. Have changes made in the NAMEOFSYSTEM had an impact on learning, as measured by test performance?2. Is there a difference in learning, as measured by test performance, between students doing WBH and PPH after the changes made in the NAMEOFSYSTEM?
In this section, an overview of different learning environments in the context of the functionality of the NAMEOFSYSTEM is given (Section 1.1), focusing on how to allocate exercises (problems) to students.A literature review of studies, conducted to investigate a potential difference in learning between WBH and PPH, is given in Section 1.2 followed by a brief discussion about formative assessment and feedback (Section 1.3).Finally a short description of the NAMEOFSYSTEM is given in Section 1.4.

Web-based learning environments
A number of web-based learning environments are available on the web, some open and free to use, others commercial products.Several types of systems have emerged, including the learning management systems (LMS), learning content management systems (LCMS) and adaptive and intelligent web-based educational systems (AIWBES).The LMS is designed for planning, delivering and managing learning events, usually adding little value to the learning process nor supporting internal content processes while the primary role of a LCMS is to provide a collaborative authoring environment for creating and maintaining learning content (Ismail, 2001).In AIWBES the focus is on the student.Such systems adapt to the needs of each and every student (Brusilovsky & Peylo, 2003) in contrast to many systems that are merely a network of static hypertext pages (Brusilovsky, 1999).
A number of web-based learning environments use intelligent methods to provide personalized content or navigation such as the one described in Own (2006).However, only few systems use intelligent methods for exercise item allocation (Barla et al., 2010).The use of intelligent item allocation algorithms (IAA) is, however, a common practice in testing.Computerized Adaptive Testing (CAT) (Wainer, 2000) is a form of computer-based tests where the test is tailored to the examinees ability level by means of Item Response Theory (IRT).IRT is the framework used in psychometrics for the design, analysis and grading of computerized tests to measure abilities (Lord, 1980).As Wauters, Desmet, and Van Den Noortgate (2010) argue, IRT is potentially a valuable method for adapting the item sequence to the learners knowledge level.However, the IRT methods are designed for testing, not learning, and as shown in AUTHREF and AUTHREF the IRT models are not appropriate since they do not take learning into account.New methods for IAA in learning environments are therefore needed.
Several systems can be found that are specifically designed for providing content in the form of exercise items.Examples of systems providing homework exercises are the WeBWork system (Gage, Pizer, & Roth, 2002), ASSiSTments (Razzaq et al., 2005), ActiveMath (Melis et al., 2001), OWL (Hart, Woolf, Day, Botch, & Vining, 1999), LON-CAPA (Kortemeyer, Kashy, Benenson, & Bauer, 2008) and WebAssign (Brunsmann, Homrighausen, Six, & Voss, 1999).None of those 3 ACCEPTED MANUSCRIPT ACCEPTED MANUSCRIPT systems use intelligent methods for item allocation, instead a fixed set of items are submitted to the students or drawn randomly from a pool of items.

Web-based homework vs. pen-and-paper homework
A number of studies have been conducted to investigate a potential difference in learning between WBH and PPH.In most of the studies reviewed, no significant difference was detected (Bonham, Deardorff, & Beichner, 2003;Cole & Todd, 2003;Demirci, 2007;Gok, 2011;Kodippili & Senaratne, 2008;LaRose, 2010;Lenz, 2010;Palocsay & Stevens, 2008;Williams, 2012).In three of the studies reviewed, WBH was found to be more effective than PPH as measured by final exam scores.In the first study, described in Dufresne, Mestre, Hart, and Rath (2002), data was gathered in various offerings of two large introductory physics courses taught by four lecturers over a three year period.The OWL system was used to deliver WBH.The authors found that WBH lead to higher overall exam performance, although the difference in average gain for the five instructorcourse combinations was not statistically significant.In the second paper, VanLehn et al. (2005) describe Andes, a physics tutoring system.The performance of students working in the system was compared to students doing PPH homework for four years.Students using the system did significantly better on the final exam than the PPH students.However, the study has one limitation; the two groups were not taught by the same instructors.Finally, Brewer and Becker (2010) describe a study in multiple sections of college algebra.The WBH group used an online homework system developed by the textbook publisher.The authors concluded that the WBH group generally scored higher on the final exam but no significant difference existed between mathematical achievement of the control and treatment groups except in low-skilled students where the WBH group exhibited significantly higher mathematical achievement.
Even though most of the studies performed comparing WBH and PPH show no difference in learning, the fact that students do not do worse than students doing PPH makes WBH a favourable option, especially in large classes where correcting PPH is very time consuming.Also, students' perception towards WBH has been shown to be positive (Demirci, 2007;Hauk & Segalla, 2005;Hodge, Richardson, & York, 2009;LaRose, 2010;Roth, Ivanchenko, & Record, 2008;Smolira, 4 ACCEPTED MANUSCRIPT ACCEPTED MANUSCRIPT 2008;VanLehn et al., 2005).
All the studies reviewed were conducted using a quasi-experimental design, i.e. students were not randomly assigned to the treatment groups.Either multiple sections of the same course were tested where some sections did PPH while the other(s) did WBH or the two treatments were assigned on different semesters.This could lead to some bias e.g.due to difference in the student groups or lecturers participating in the two treatment arms of the experiments.The experiment described in this paper is a repeated randomized crossover experiment so the same students were exposed to both WBH and PPH, resulting in a more accurate estimate of the potential difference between the two methods.

Assessment and feedback
Assessments are frequently used by teachers to assign grades to students (assessment of learning) but a potential use of assessment is to use it as a part of the learning process (assessment for learning) (J.Garfield et al., 2011).The term summative assessment (SA) is often used for the former and formative assessment (FA) for the latter.The concepts of feedback and FA overlap strongly and, as stated in Black and Wiliam (1998), the terms do not have a tightly defined and widely accepted meaning.Therefore, some definitions will be given below.Taras (2005) defines SA as "... a judgement which encapsulates all the evidence up to a given point.This point is seen as a finality at the point of the judgement" (p.468) and about FA she writes "... FA is the same process as SA.In addition for an assessment to be formative, it requires feedback which indicates the existence of a 'gap' between the actual level of the work being assessed and the required standard" (p.468).A widely accepted definition of feedback is then provided in Ramaprasad (1983): "Feedback is information between the actual level and the reference level of a system parameter which is used to alter the gap in some way" (p.4).Stobart (2008) suggests making the following distinction between the complexity of feedback; knowledge of results (KR) only states whether the answer is incorrect or correct, knowledge of correct response (KCR) where the correct response is given when the answer is incorrect and elaborated feedback (EF) where, for example, an explanation of the correct answer is given.

ACCEPTED MANUSCRIPT ACCEPTED MANUSCRIPT
The terms formative assessment, feedback and the distinction between the different types of feedback will be used here as defined above.
The functionalities of the system have changed considerable during the past decade.A pilot version, written only in HTML and Perl, is described in AUTHREF.A newer version, implemented in Plone (Nagle, 2010), is described in detail in AUTHREF.The newest version, described in AUTHREF, is a mobile-web site and runs smoothly on tablets and smart phones.Also, users do not need to be connected to the Internet when answering exercises, but only when downloading the item banks.
The NAMEOFSYSTEM is an LCMS including exercise item banks within mathematics and statistics.The system is open and free to use for everyone having access to the web.At the heart of the system is the formative assessment.Intelligent methods are used for item allocation in such a way that the difficulty of the items allocated adapts to the students' ability level.Since the focus of the experiment described here is on the effect of doing exercises (answering items) in the system, only functionalities related to that will be described.A more detailed description of the NAMEOFSYSTEM is given in the above mentioned papers.

Item allocation algorithm
In the systems used for WBH named in Section 1.1 a fixed set of items are allocated to students or drawn randomly, with uniform probability, from a pool of items.This was also the case in the first version of the NAMEOFSYSTEM.A better way might be to implement an IAA so that the difficulty of the items adapts to the students' ability.As discussed in Section 1.1, current IRT methods are not appropriate when the focus is on learning, therefore a new type of IAA has been developed using the following basic criteria: • Increase the difficulty level as the student learns • select items so that a student can only complete a session with high grade by completing the 6 ACCEPTED MANUSCRIPT ACCEPTED MANUSCRIPT most difficult items • select items from previous sessions to refresh memory.
Items are grouped into lectures in the NAMEOFSYSTEM system where each lecture covers a specific topic.This could be discrete distributions in material used in an introductory course in statistics or limits in a basic course in calculus.Within a lecture, the difficulty of an item is simply calculated as the ratio of incorrect responses to the total number of responses.The items are then ranked according to their difficulty, from the easiest item to the most difficult one.
The implementation of the first criterion (shown above) has changed over the years.In the first version of the NAMEOFSYSTEM all items within a lecture were assigned uniform probability of being chosen for every student.This was changed in 2012 with the introduction of a probability mass function (pmf) that calculates the probability of an item being chosen for a student.The pmf is exponentially related to the ranking of the item and also depends on the student's grade: where q is a constant (0 ≤ q ≤ 1) controlling the steepness of the function, N is the total number of items belonging to the lecture, r is the difficulty rank of the item (r = 1, 2, ...N ), g is the grade of the student (0 ≤ g ≤ 1) and c is a normalizing constant, c = ∑ N i=1 q i .Finally, m is a constant (0 < m < 1) so that when g < m, the pmf is strongly decreasing and the mass is mostly located at the easy items, when g = m the pmf is uniform and when g > m the pmf is strongly increasing with the mass mostly located at the difficult items.This was changed in 2013 in such a way that the mode of the pmf moves to the right with increasing grade which is achieved by using the following pmf based on the beta distribution: The last criterion for the IAA is related to how people forget.Ebbinghaus (1913) was one of the first to research this issue.He proposed the forgetting curve and showed in his studies that learning and the recall of learned information depends on the frequency of exposure to the material.
It was therefore decided in 2012 to change the IAA in such a way that students are now occasionally allocated items from previous lectures to refresh memory.

Grading
Although the main goal of making the students answer exercises in the NAMEOFSYSTEM is learning there is also a need to evaluate the students' performance.The students are permitted to continue to answer items until they (or the instructor) are satisfied, which makes grading a non-trivial issue.
In the first version of the NAMEOFSYSTEM, the last eight answers counted (with equal weight) towards the NAMEOFSYSTEM grade.Students were given one point for a correct answer and mi-

ACCEPTED MANUSCRIPT
nus half a point for an incorrect one.The idea was that old sins should be forgotten when students are learning.This had some bad side effects with students often quitting answering items after seven correct attempts in a row AUTHREF, which is a perfectly logical result since a student who has a sequence of seven correct and one incorrect will need another eight correct answers in sequence to increase the grade.The NAMEOFSYSTEM grade was also found to be a bad predictor of students' performance on a final exam, the grade being too high AUTHREF.It was therefore decided in 2014 to change the grading scheme (GS) and use min(max(n/2, 8), 30) items after n attempts when calculating the NAMEOFSYSTEM grade.That is, use a minimum of eight answers, then after eight answers use n/2, but no more than 30 answers.Using this GS, the weight of each answer is less than before (when n > 8), thus eliminating the fear of answering the eighth item incorrectly, simultaneously making it more difficult for students to get a top grade since more answers are used when calculating the grade.

Feedback
The quality of the feedback is a key feature in any procedure for formative assessment (Black & Wiliam, 1998).In the first version of the NAMEOFSYSTEM, only KR/KCR type feedback was provided.Sadler (1989) suggested that KR type feedback is insufficient if the feedback is to facilitate learning so in 2012 an explanation was added to items in the NAMEOFSYSTEM item bank, thus providing students with EF.A question from a lecture covering inferences for proportions is shown in Figure 2.Here the student has answered incorrectly (marked by red).The correct answer is marked with green and an explanation given below.

Summary of changes in the NAMEOFSYSTEM
In the sections above, changes related to the IAA, grading and feedback were reviewed.A summary of the changes discussed is shown in Table 1.

9
ACCEPTED MANUSCRIPT  The experiment conducted is a repeated randomized crossover experiment.The design of the experiment is shown in Figure 3.

ACCEPTED MANUSCRIPT
Each year the class was split randomly into two groups.One group was instructed to do exercises in the NAMEOFSYSTEM system in the first homework assignment (WBH) while the other group handed in written homework (PPH).The exercises on the PPH assignment and in the NAME-OFSYSTEM were similar and covered the same topics.Shortly after the students handed in their homework they took a test in class.The groups were crossed before the next homework, that is, the former WBH students handed in PPH and vice versa and again the students were tested.Each year this procedure was repeated and the test scores from the four exams registered.The students were not made aware of the experiment but were told that the groups were made to manage the number PPH homework that needed to be corrected at a time.There were no indications that the students  were aware of the experiment.The number of students taking each exam is shown in Table 2.
To answer the first research question, stated in Section 1, the following linear mixed model is fitted to the data from 2011-2014 and nonsignificant factors removed: where g is the test grade, α is the math background (m = weak, strong), β is the lecture material (l = discrete distributions, continuous distributions, inference about means, inference about proportions), γ is the type of homework (h = PPH, WBH), δ is the year (y = 2011, 2012, 2013, 2014) and s is the random student effect (s i ∼ N (0, σ 2 s )).The interaction term (αγ) measures whether the effect of type of homework is different between students with strong and weak math background and (βγ) whether the effect of type of homework is different for the lecture material covered.The interaction term (δγ) is of special interest since it measures the effect of changes made in the NAMEOFSYSTEM system during the four years of experiments.
To answer the second research question, only data gathered in 2014 is used and the following linear mixed model fitted to the data: with α, β, γ and s as above.If the interaction terms are found to be nonsignificant, the γ factor is of special interest since it measures the potential difference in learning between students doing WBH and PPH.

ACCEPTED MANUSCRIPT
In addition to collecting the exam grades, the students answered a survey at the end of each semester.442 students in total responded to the surveys (121 in 2011, 88 in 2012, 131 in 2013 and 102 in 2014).Two of the questions are related to the use of the NAMEOFSYSTEM and the students' perception of WBH and PPH homework: 1. Do you learn by answering items in the tuto-web?(yes/no) 2. What do you prefer for homework?(PPH/WBH/Mix of PPH and WBH) 3 Results

Analysis of exam scores
In order to see which factors relate to exam scores the linear mixed model in Eq. ( 3) was fitted to the exam score data using R (R Core Team, 2014).The lmer function in the lme4 package, which includes functions to fit linear and generalized linear mixed-effects models (Bates, Maechler, Bolker, & Walker, 2014), was used.The interaction terms (mh) and (lh) were found to be nonsignificant and therefore removed from the model.This indicates that the effect of homework type does not depend on math background nor lecture material covered.However, the (yh) interaction was found to be significant implying that the effect of the type of homework is not the same during the four years.The resulting final model can be written as: The estimates of the parameters and the associated t-values are shown Table 3 along with p-values calculated using the lmerTest package (Kuznetsova, Brockhoff, & Christensen, 2013).Estimates of the variance components were σ2 s = 1.84 and σ2 = 3.33.The reference group (included in the intercept) are students in the 2011 course with weak math background handing in PPH on discrete distributions.Residual plots revealed no violation of model assumptions, such as non-normal errors or random effects.

13
ACCEPTED MANUSCRIPT By looking at the estimate for the year2014:tw term it can be noticed that the difference between the WBH and PPH groups is significantly different in 2011 (the reference group) and 2014 (p = 0.012), indicating that the changes made to the NAMEOFSYSTEM had a positive impact on learning.The difference in effect size between WBH and PPH in 2011 and 2014 is 0.634.It should also be noted that the effect size of math background is large (1.680).

ACCEPTED MANUSCRIPT
In order to answer the second question, the model in Eq. 4 was fitted to the data from 2014.The interaction terms were both nonsignificant and therefore removed from the model.The final model can be written as: The estimates of the parameters, the associated t-and p-values are shown Table 4. Estimates of the variance components were σ2 s = 1.48 and σ2 = 2.84.The reference group (included in the intercept) are students with weak math background handing in PPH on discrete distributions.By looking at the table it can be noted that the difference between the WBH and PPH groups is significant (p = 0.009) and the estimated effect size is 0.416 indicating that the students did better after handing in WBH than PPH.Again, the effect size of math background is large (1.379).

Analysis of student surveys
In general, the students' perception of the NAMEOFSYSTEM system is very positive.In student surveys conducted over the four years over 90% of the students feel they learn using the system.
Despite the positive attitude towards the system about 80% of the students prefer a mixture of PPH and WBH over PPH or WBH alone.
It is interesting to look at the difference in perception over the four years shown in Figure 4.As stated above, the GS was changed in 2014 making it more difficult to get a top grade for homework in the system and more difficult than in PPH.This lead to a general frustration in the student group.
The fraction of students preferring only handing in PPH, compared to WBH or mix of the two, more than tripled compared to the previous years.1. Have changes made in the NAMEOFSYSTEM had an impact on learning as measured by test performance?
2. Is there a difference in learning, as measured by test performance, between students doing PPH and WBH after the changes made in the NAMEOFSYSTEM?
The experiment was conducted over four years in an introductory course on statistics.It is a repeated crossover experiment so students were exposed to both methods, WBH and PPH.
The difference between the WBH and PPH groups was found to be significantly different in 2011 and 2014 (p = 0.012), indicating that the changes made to the NAMEOFSYSTEM have made a positive impact on learning as measured by test scores.The difference in effect size between WBH and PPH in 2011 and 2014 is 0.634.Several changes were made in the system between 2011 and 2014 as shown in Table 1.As can be seen in the table the changes are somewhat confounded but moving from uniform probability to the pmf shown in Eq. 2 when allocating items, allocating items from old material to refresh memory, changing the grading scheme so that min(max(n/2, 8), 30) items count in the grade instead of eight, providing EF instead of KR/KCR type feedback and having a mobile version appears to have had a positive impact on learning.
To answer the second research question, only data gathered in 2014 was used.The difference between the WBH and PPH groups was found to be significant (p = 0.009) with effect size 0.416 indicating that the students did better after handing in WBH than PPH.In both models the effect size of math background was large (1.680 and 1.379).
The NAMEOFSYSTEM project is an ongoing research project and the NAMEOFSYSTEM team will continue to work on improvements to the system.Improvements related to the exercise items are quality of items and feedback, the grading scheme (GS) and the item allocation algorithm (IAA).

Quality of items and feedback
As pointed out in J. B. Garfield (1994), it is important to have items that require student understanding of the concepts, not only test skills in isolation of a problem context.It is therefore important to have items that encourage deep learning rather than surface learning (Biggs, 1987).
One goal of the NAMEOFSYSTEM team is to collect metadata for each item in the item bank.
One classification of the items will reflect how deep an understanding is required using e.g. the Structure of the Observed Learning Outcomes (SOLO) taxonomy (Biggs & Collis, 1982).According to SOLO the following three structural levels make up a cycle of learning."Unistructural: The learner focuses on the relevant domain, and picks one aspect to work with.Multistructural: The learner picks up more and more relevant or correct features, but does not integrate them.Relational: The learner now integrates the parts with each other, so that the whole has a coherent structure and meaning" (p.152).
In addition to the SOLO framework, to reflect difficulty of items in statistics courses, items could also be classified based on cognitive statistical learning outcomes suggested by delMas (2002); J. Garfield and Ben-Zvi (2008); J. Garfield and delMas (2010).These learning outcomes have been defined as (J.Garfield & Franklin, 2011): "Statistical literacy, understanding and using the basic language and tools of statistics.Statistical reasoning, reasoning with statistical ideas and making sense of statistical information.Statistical thinking, recognizing the importance of examining and trying to explain variability and knowing where the data came from, as well as connecting data analysis to the larger context of a statistical investigation" (p.4-5).Items measuring these concepts could be ranked in hierarchical order in terms of difficulty, starting with statistical literacy items as less difficult and ending with most difficult items measuring statistical thinking.

Grading scheme
The GS used in a learning environment such as the NAMEOFSYSTEM influences the behaviour of the students AUTHREF.The GS used in the NAMEOFSYSTEM was changed in 2014 eliminating some problems but introducing a new one; the students found it unfair.The following criteria will be used to develop the GS further.The GS should: • Entice students to continue to request items, thus learning more • reflect current knowledge well • be fair in students' minds.
Currently a new grading scheme is being implemented.Instead of giving equal weight to items used to calculate the grade, newer items are given more weight using the following formula: where l is the lagged item number (l = 1 being the most recent item answered), α is the weight given to the most recent answer, n g is the number of answers included in the grade and s is a parameter controlling the steepness of the function.Some weight functions for a student that has answered 30 items are shown in Figure 5.As can be seen by looking at the figure, the newest answers get the most weight and old (sins) get less.
The students will be informed of their current grade as well as what their grade will be if they 18

ACCEPTED MANUSCRIPT ACCEPTED MANUSCRIPT
answer the next item correctly to entice them to continue requesting items.Studies investigating the effect of the new GS will be conducted in 2016 -2017.

Item allocation algorithm
In the current version of the IAA, the items are ranked according to difficulty level, calculated as the ratio of incorrect responses to the total number of responses.This is, however, not optimal since the ranking places the items with equal distance apart on the difficulty scale.A solution to this problem could be to use directly the ratio of incorrect responses to the total number of responses in the IAA instead of the ranking.Another solution would be to implement a more sophisticated method for estimating the difficulty of the items using IRT but as mentioned earlier those methods are designed for testing not learning.However, it would be interesting to extend the IRT models by including a learning parameter which would make the models more suitable in a learning environment.Finally, it is of interest to investigate formally the impact of allocating items from old material to refresh memory.

Figure 1 :
Figure 1: The different probability mass functions used in the item allocation algorithm.Left: uniform.Middle: exponential.Right: beta.

Figure 2 :
Figure 2: A question from a lecture on inferences for proportions.The students are informed what the correct answer is and shown an explanation of the correct answer.

Figure 3 :
Figure 3: The design of the experiment.The experiment was repeated four times from 2011-2014.

Figure 4 :
Figure 4: Results from the student survey.Left: "Do you learn from the NAMEOFSYSTEM?".Right: "What is you preference for homework"?

Table 1 :
Summary of changes in the NAMEOFSYSTEM.The data used for the analysis was gathered in an introductory course in statistics in the University of COUNTRYOFAUTHORS from 2011-2014.Every year some 200 first year students in chemistry, biochemistry, geology, pharmacology, food science, nutrition, tourism studies and geography were enrolled in the course.The course was taught by the same instructor over the timespan of the experiment.About 60% of the students had already taken a course in basic calculus the semester before while the rest of the students had much weaker background in mathematics.Around 60% of the students were females and 40% males.The students needed to hand in homework four times during the course.The subjects of the homework were: discrete distributions, continuous distributions, inference about means and inference about proportions.The students were told in the beginning of the course that there would be several in-class tests during the semester but they were not told how many, at what timepoints or from which topics they would be examined in.The final grade in the course consisted of four parts, the final exam (50%), the four homework assignments (10%), in-class tests (15%) and assignments in the statistical software R (25%).

Table 2 :
Number of students taking the tests.

Table 3 :
Parameter estimates for the final model used to answer research question 1.The reference group are students in the 2011 course with weak math background handing in PPH on discrete distributions.Grades were given on the 0 -10 scale.

Table 4 :
Parameter estimates for the final model used to answer research question 2. The reference group (included in the intercept) are students with weak math background handing in PPH on discrete distributions.Grades were given on the 0 -10 scale.