Abstract
Extracting an individual’s scientific knowledge is essential for improving educational assessment and understanding cognitive tasks in engineering activities such as reasoning and decision-making. However, knowledge extraction is an almost impossible endeavor if the domain of knowledge and the available observational data are unrestricted. The objective of this paper is to quantify individuals’ theory-based causal knowledge from their responses to given questions. Our approach uses directed-acyclic graphs (DAGs) to represent causal knowledge for a given theory and a graph-based logistic model that maps individuals’ question-specific subgraphs to question responses. We follow a hierarchical Bayesian approach to estimate individuals’ DAGs from observations. The method is illustrated using 205 engineering students’ responses to questions on fatigue analysis in mechanical parts. In our results, we demonstrate how the developed methodology provides estimates of population-level DAG and DAGs for individual students. This dual representation is essential for remediation since it allows us to identify parts of a theory that a population or individual struggles with and parts they have already mastered. An addendum of the method is that it enables predictions about individuals’ responses to new questions based on the inferred individual-specific DAGs. The latter has implications for the descriptive modeling of human problem-solving, a critical ingredient in sociotechnical systems modeling.
1 Introduction
The use of scientific knowledge is prominent in engineering education and design. Engineering students use the theoretical knowledge of mechanics of materials, thermodynamics, and control engineering to devise mechanical and electrical components. Designers use scientific knowledge to extrapolate from experiments to real-world applications. Having the ability to quantify individuals’ scientific knowledge can advance both engineering education practices and design research. First, this ability would make it possible to assess students’ knowledge accurately [1] and to develop personalized educational support tools [2,3]. Second, scientific knowledge is an essential ingredient of engineering design expertise necessary for design problem framing and problem-solving [4,5]. A descriptive decision-making model incorporating prior knowledge can better understand how designers carry out inductive and deductive reasoning tasks [6–8]. Furthermore, quantifying individuals’ knowledge structures is essential for understanding design cognition, expert-novice behaviors, and systems that mimic human problem-solving [9].
The representation of scientific knowledge requires quantifying causal knowledge about specific relationships among the concepts that make up a theory. There is a need for approaches that extract such detailed causal knowledge from individuals’ responses. Primary methods in student response modeling (e.g., three-parameter logistic model in item response theory (IRT) [10]) represent student knowledge using a single node, the so-called “ability” [11–13]. In such methods, a small number of parent nodes (e.g., ability, educational history, family background) predict whether a student will succeed or fail in an exam. But modeling the ability using a single parameter is not adequate for situations where a large amount of domain knowledge is required, such as in engineering education. Recent advances in student modeling propose dynamic Bayesian networks to explicitly model prerequisite skill hierarchies and yield meaningful instructional policies [14–16]. Nevertheless, these approaches still lack the methods for statistical inference of individual-specific Bayesian networks from observed exam responses.
To address this knowledge gap, the objective in this paper is to quantify individuals’ theory-based causal knowledge from individuals’ responses to given questions. As the objective suggests, the paper focuses on a specific type of scientific knowledge, called theory-based causal knowledge. Theory-based causal knowledge relies on widely accepted principles for explaining physical phenomena where relations between physical variables are governed using causal relations. Our approach builds on directed-acyclic graphs (DAGs), i.e., graphs with directed links with paths that form no cycles, representing causal knowledge. For example, the DAG in Fig. 1 (far-left) shows the causal knowledge associated with the distortion energy theory of static failure. There are six physical variables, X = {F, G, Sy, M, σ, ny}, represented as nodes in the graph. Variable F represents the loading applied to a mechanical component with geometry G. The external loading, F, causes the component to develop internal moment M as shown using a link directed from F to M. The internal moment, M further induces normal stress σ. Normal stress σ and yield strength Sy are used to calculate the yield factor of safety ny. In Fig. 1, you can also find a schematic of the proposed knowledge extraction methodology. We assume an ideal DAG dictates how different physical variables are interconnected for a given theory. Then, we assume that each individual has a unique DAG, not directly observable, and we model the probability that causal relations are correctly identified (prior). Given an individual’s graph, we devise a graph-based logistic (GrL) model that maps question-specific subgraphs of an individual’s DAG to the probability of responding correctly to the given question (likelihood). In particular, we assume that the probability of a correct response to a question is proportional to the fraction of question-specific causal links—from the true DAG—that an individual knows correctly. Finally, we use hierarchical Bayesian inference to estimate the posterior over individuals’ DAGs conditioned on the observed responses to different questions. The advantage of hierarchical Bayesian inference methodology is that it can infer the population- and individual-level uncertainty in model parameters when few observed responses are available. This paper provides multiple improvements in the Bayesian methodology compared to our previous work in Ref [17]. These improvements are noted throughout the paper.
We illustrate the approach by modeling the DAGs of undergraduate engineering students about the theory of fatigue failure of mechanical components. The dataset includes questions testing the students’ knowledge about internal stresses, the endurance limit, adjustments to the endurance limit, and the factor of safety against fatigue failure [18].
The results from the study highlight the merits of our approach for quantification of individual-specific causal knowledge as well as for predictions of individuals’ responses to unseen questions. Our approach enables the identification of parts of a theory that a subject struggles with and the features they have already mastered, both essential for providing individual feedback for personalized education. Moreover, our approach makes it possible to draw inferences from an individual’s estimated knowledge structure to new situations based on the same theory. Such extrapolation is needed because there are many different problems for a given theory that individuals can solve, and circumstances never repeat themselves perfectly.
The organization of this paper is as follows. In Sec. 2.2, we provide the necessary background on item response theory. Section 3 provides mathematical details of DAGs, a graph-based logistic model, and the hierarchical Bayesian inference approach. In Sec. 4, we describe the experimental dataset. In Sec. 5, we present our results and highlight the salient features of the method. Finally, we discuss the implications of these results from engineering education and design perspective. Section 6 summarizes the key conclusions.
2 Related Work
2.1 Knowledge Representation in Engineering Design.
Domain-specific knowledge can be structured in different ways, e.g., causal relations, taxonomies, rules, procedural knowledge, etc. [19–21]. Many studies undertake computational approaches for representing knowledge of design processes and design artifacts, e.g., in the product systems design [22,23]. The goal of these computational studies is to discover generalized and specialized product knowledge from design databases for supporting tool development for improved analogical design. Dong and Sarakar [24] represent complex products and processes as matrices where nodes are product elements and cells are structural, functional, or behavioral relationships between nodes. Then, they derive generalized design knowledge as the macroscopic level information from matrix representations using singular value decomposition. With the goal of quantifying a product’s innovativeness in terms of component-level decisions, Rebhuhn et al. [25] represent the product design process as the hierarchy of product, function, and components. They use multi-agent models to propagate novelty scores of products down to the component level. Siddharth et al. [26] develop engineering knowledge graph by aggregating entities and their relationships from a patent database. Fu et al. [27] analyze the US patent database and discover different structural forms such as hierarchy and ring. Despite these developments, computational approaches for representing and estimating an individual’s theory-driven causal knowledge are lacking. The proposed methodology addresses this gap by modeling theory-specific causal knowledge as a probabilistic causal graph and estimating person-specific causal graphs using Bayesian inference.
2.2 Background on Item Response Theory.
Item response theory describes fundamental principles for formal assessment of individuals’ characteristics [10,28,29]. IRT is based on the relationship between an individual’s performance on test items and the individual’s overall ability characteristics, which the test is designed to measure. Several statistical models are used to model individual characteristics and test items. IRT-based models have two standard components: (i) they represent an individual’s ability in terms of single- or multidimensional latent parameters and (ii) they represent the probability of correct response as a monotonically increasing function of ability. This function of ability is sometimes called the item characteristic curve. An example application of IRT is the force concept inventory which tests the Newtonian concepts along six dimensions such as kinematics, impetus, active force, action–reaction pairs, concatenation of influence, and other effects such as centrifugal forces [30].
Problem discrimination αl: This measures how the probability of answering a question correctly changes with ability.
Problem difficulty βl: This measures problem difficulty based on the ability required to get the correct answer. A higher ability to solve a given problem corresponds to greater problem difficulty.
Pseudo-guessing parameter cl: This accounts for the probability of getting a correct answer by guessing in a multiple-choice question and is one over the number of choices.
Building on IRT, multidimensional IRT (MIRT) models represent an individual’s ability to use more than one dimension [31]. In such models, a vector of independent dimensions replaces a unidimensional ability parameter. However, there are limitations to applying MIRT when measuring the interconnected ability characteristics. MIRT models assume that all ability dimensions are required for answering any question correctly, and the probability of a correct response increases with every dimension (monotonicity assumption). MIRTs do not allow for the possibility that a question may require only a subset of ability dimensions to answer correctly. They also assume that the responses of different questions are uncorrelated and based on independent ability dimensions (local independence assumption). This assumption does not allow us to make predictions of answers to unseen questions.
IRT models mainly represent unidimensional or independent multidimensional ability parameters. They do not define more complex, interconnected ability characteristics, such as a knowledge graph. The graphical ability representation requires only a subset of the most relevant ability dimensions to explain an individual’s performance on a question. Also, current IRT models can only predict responses for test questions used while training the model. The question-specific parameters in the item characteristic curve do not allow predictions to be made on unseen questions. There is also little work in the literature on how we can accurately infer multidimensional ability from test responses. The presented approach addresses the above limitations of the existing IRT models in the following manner: (i) it imposes a theory-specific structure on individual-specific ability, which is essential because the nature of human knowledge is best represented through the strong constraints of domain knowledge [8,32,33] and (ii) it incorporates Bayesian statistical inference to estimate the individual-specific ability dimensions and represents uncertainty in these estimates. Section 3 presents mathematical details of this approach.
3 Methodology
The proposed method involves the following steps for representing individuals’ theory-based causal knowledge: (i) define a structure over the true causal knowledge for a given theory, (ii) model a priori uncertainty in how much individuals know of the true causal scientific knowledge, (iii) defines the relationship between an individual’s knowledge and the probability of correct response to theory-related questions, and (iv) characterize the posterior probability over individuals’ causal scientific knowledge conditional on the observed question responses.
3.1 Representing Causal Knowledge As Directed Acyclic Graph.
We use DAGs to represent causal relationships in a scientific theory. DAGs are graphs consisting of directed links connecting pairs of causally related physical variables with the additional requirement that there are no cyclic paths of directed links. The DAG is an abstraction of structural equations, which may include a diverse set of mathematical models, computer algorithms, etc. [34]. In such structural equations, some variables are inputs, and some are outputs, and the interpretation is that the input variables cause the output variables. These causal relationships are represented with directed links, putting aside the specific equations. For example, Fig. 5 shows a simplified graphical representation of the causal relationships between variables in the stress-based theory of fatigue failure.
An important assumption is that the causal knowledge being modeled is propositional (i.e., the person knows a functional relationship between two variables) rather than procedural (i.e., the person knows a rule) [15]. Furthermore, the physical variables in a DAG can take any real value. Still, the causal links between physical variables are binary, i.e., a link either exists or does not exist. This simple representation still lets us quantify the effects of individuals’ knowledge on their responses to theory-related questions.
We also assume that a true knowledge graph, including a set of physical variables and their causal links, is completely known for the theory under study. Subject matter experts (e.g., experienced engineers or teachers) or prior knowledge database (e.g., records of theory-based experiments) can help to construct such true causal knowledge [12]. Representation of scientific theories in DAGs is feasible because DAGs involve directed links, and their connected paths are acyclic. Every directed link connecting two variables assumes that the starting variable is the cause and the ending variable is the effect. The nonexistence of cyclic paths ensures that a variable cannot be its cause. Note that any feedback loops may be represented by appropriately expanding the DAG in time. Mathematically, let X = {x1, x2, …, xN} be the set of physical variables relevant to a given scientific theory. The true knowledge graph for a specific theory is an N × N binary matrix, , where is 1 if the variable xi is a direct cause of xj and 0 otherwise.
3.2 Modeling Individuals’ Directed Acyclic Graphs and Their Relationship to the Correctness of Responses.
We assume that a person’s knowledge graph is always a subgraph of the true knowledge graph of the theory. This means that if the theory has no direct link from xi to xj, then a person does not makeup such a link. This ensures that a subgraph is acyclic if the true graph is acyclic. We can only test an individual’s knowledge in intersection with the true knowledge graph. If the individual has the wrong knowledge graph, they will get the wrong answer. Without the constraint of the true knowledge graph, there will be N(N − 1)/2 possible links and 2N(N−1)/2 possible knowledge graphs for N known variables. The inference of the individualistic knowledge graph then becomes intractable even for moderate N.
To quantify a priori uncertainty, we assign a prior probability measure over the space of knowledge graphs. Hierarchical Bayesian modeling further allows us to represent the causal knowledge of each individual and the population in terms of parameters of the prior distribution (hyperparameters). Then, a graph-based logistic model quantifies the effect of individuals’ causal knowledge on their responses to theory-related questions. Figure 3 represents the plate-notation diagram for the proposed hierarchical Bayesian model.
3.2.1 Prior Over Knowledge Graphs of Individuals.
To capture known correlations between different causal links and reduce the number of parameters, we impose an additional assumption on the link probabilities. Namely, we assume that some link probabilities are identical based on an expert belief about whether or not they require the knowledge of the same structural equations. For instance, Fig. 5 represents a knowledge graph in which different subgroups of variables are enclosed in separate boxes. We assume that different directed links connecting variables between two fixed subgroups require the knowledge of the same causal relationships. Then those links are assigned the same link probability. For example, the probability of all links between Marin Factors and variable Se is equal.
3.2.2 Hyperprior Over the Population-Level Knowledge Graph.
This model uses a different prior distribution for every link in the knowledge graph and a noncentered parameterization for representing the hyperpriors. In contrast, our previous work [17] defined same prior distribution over each link and used a centered parameterization for hyperpriors. The rationale for these changes are that separate prior distributions allow us to better study the correlation between individuals’ link-specific abilities and the noncentered parameterization helps better posterior exploration.
3.2.3 The Likelihood of Correct Responses by Individuals.
In contrast to the IRT, our model requires detailed knowledge about the subgraph of the true knowledge graph that each question tests. Each question involves a set of input variables and an output variable to be evaluated. A question using multiple output variables may be divided into separate questions, each with a single output variable. A person answers the question by providing a value of the output variable. The knowledge relevant to answer question l is part of the knowledge graph that connects the input variables to the output variable. Mathematically, we can get the relevant subgraph from the knowledge graph using an N × Nreduction matrix Ql, whose cell value ql,ij is 1 if variable xi and xj belongs to the set of relevant input variables and zero otherwise. Then, the true knowledge subgraph for question l is the Hadamard product (elementwise product) of the reduction matrix Ql and the true knowledge graph KTrue, denoted as . In matrix irrelevant variables have been replaced by zeros. Furthermore, we assume that rth individual’s response to question l depends only on the relevant subgraph .
Similar to the 3PL model in IRT, we have the following parameters to represent the problem-related effects when modeling the probability of correct response:
Slope parameterα: The sensitivity of the probability of correct answer to the normalized problem-specific ability .
Threshold parameterβ: The minimum fraction of correctly identified links required to answer correctly with probability greater than 0.5. This parameter quantifies the relevance of selected question-specific links for correctly answering a given set of questions.
Pseudo-guessing parametercl: The probability of correct answer by guessing alone. This parameter is dependent on how many choices are available and how responses are evaluated. Generally, the pseudo-guessing probability equals one over the number of possible answers.
Slip parameters: This accounts for the possibility that an individual may know the right causal graph, but they may not be able to use it correctly, for reasons such as available information is limited, grading criteria are unknown, or they make numerical errors.
We call this likelihood function—the GrL model. Parameters α and β are invariant across different questions because each question’s explanatory quantity ϕ is normalized. Moreover, global parameters α and β are meant to quantify the suitability of the questions for testing the true knowledge graph, or equivalently, the utility of the true knowledge graph for answering the questions. A large slope α would signify an abrupt change in the correctness probability and, thus, significant sensitivity to the fraction of correctly identified links. Additionally, the threshold β closer to 1 would imply that individuals need to know all the relevant links to answer the given questions correctly. Finding that α is of the order of 10 and β close to 1 would indicate that the true knowledge graph and given questions are perfectly compatible with each other. For illustration, Fig. 4 visualizes the probability of answering question l correctly as a function of the fraction of perfectly matched links.
The GrL model assumes that all relevant links have equal importance in answering a question. Other models assign weights to relevant links for quantifying their relative importance [1], which introduces additional model parameters. But the GrL model focuses on quantifying individuals’ causal knowledge and does not infer relationships between a scientific theory and given questions. To quantify errors due to incorrect identification of relevant links, the GrL allows for a possibility that knowing a fraction of relevant links can result in a correct answer (through threshold β). Therefore, the GrL model is an extension of models requiring all relevant links for a correct answer (the AND-type influence) and models directing at least one link for a correct answer (the OR-type influence) [15].
3.3 Conditioning Individuals’ Directed Acyclic Graphs on Observed Responses.
Next, a posterior distribution over causal links in a DAG quantifies the uncertainty about an individual’s causal knowledge, given the observed question responses.
3.3.1 Posterior Over Individuals’ Knowledge Graphs.
3.3.2 Procedure for Sampling From the Posterior.
4 Dataset
An anonymized dataset for training and testing the proposed model was collected from the responses to questions in a final exam of an undergraduate machine design course. Note that the exam was not explicitly designed for this paper; instead, it was a part of an observational study. The dataset consists of responses to 13 questions by 205 undergraduate mechanical engineering students. The exam tested the students’ aggregated knowledge about the concepts of fatigue failure analysis using a circular shaft design problem. The students did not receive monetary incentives for their participation; however, being the final exam, they were motivated to achieve the best possible grade. The exam tested each student’s domain-specific knowledge using a total of 13 questions with an overall goal of estimating the factor of safety against fatigue failure. Refer to Appendix for the problem statement and a list of questions provided to the students during the exam.
The questions were intended to test the knowledge of causal relationships between variables shown in Fig. 5. Here variable F represents the external loading applied to the steel shaft with geometry G, which is operated at room temperature T. The external loading, F, causes the bar to develop bending moment M. Variable R is the reliability requirement for the bar. The ultimate tensile strength Sut is a material property. The theoretical endurance limit Se′ is defined in terms of the ultimate tensile strength Sut using empirical relations [18]. The nominal stress σo is adjusted by multiplying with the fatigue stress-concentration factor for bending Kf. The adjusted stresses are shown as σ. The endurance limit Se′ is adjusted through multiplication by Marin Factors for different surface finish conditions, size, loading, temperature, and various factors. This adjusted endurance limit is denoted as Se. Finally, the factor of safety is shown as nf.
Each question included input variables (design parameters) and expected the students to calculate an output variable. Table 1 summarizes the input variables, output variables, and the relevant causal links for all 13 questions. For illustration, consider question 2 and question 9, which are highlighted using loosely spaced dashes and densely spaced dots, respectively in Fig. 5. In question 2, the subjects are required to calculate the bending moment Mmax in terms of the force Fa. To answer this question correctly, the subjects need to know how the external loading F causes a bar with geometry G to develop internal loads (bending moment) M. Therefore, for question 2, the bending moment M becomes the output variable, and force F and geometry G become the input variables. Similarly, for question 9, nodes Se′, ka, kb, kc, kd, and ke are the input variables and endurance limit Se becomes the output variable. The input variables cause the output variables and answers to a given question to depend on parent nodes for that question.
Question | Design parameters | Output parameters | Relevant causal links |
---|---|---|---|
Q1 | F, G, M | σo | (G, M), (G, σo), (F, M), (M, σo) |
Q2 | F, G | M | (G, M), (F, M) |
Q3 | G, M, σo, Kf | σ | (G, M), (G, σo), (M, σo), (σo, σ), (Kf, σ) |
Q4 | Sut | Se′ | (Sut, Se′) |
Q5 | G, Sut | ka | (G, ka) (Sut, ka) |
Q6 | F, G | kb | (F, kb), (G, kb) |
Q7 | R | ke | (R, ke) |
Q8 | F, T | kc, kd | (F, kc), (T, kd) |
Q9 | Se′, ka, kb, kc, kd, ke | Se | (Se′, Se), (ka, Se), (kb, Se), (kc, Se), (kd, Se), (ke, Se) |
Q10 | σ, Se | nf | (σ, nf), (Se, nf) |
Q11 | F, G | M | (G, M), (F, M) |
Q12 | G, M, σo, Kf | σ | (G, M), (G, σo), (M, σo), (σo, σ), (Kf, σ) |
Q13 | F, G, M | σo | (G, M), (G, σo), (F, M), (M, σo) |
Question | Design parameters | Output parameters | Relevant causal links |
---|---|---|---|
Q1 | F, G, M | σo | (G, M), (G, σo), (F, M), (M, σo) |
Q2 | F, G | M | (G, M), (F, M) |
Q3 | G, M, σo, Kf | σ | (G, M), (G, σo), (M, σo), (σo, σ), (Kf, σ) |
Q4 | Sut | Se′ | (Sut, Se′) |
Q5 | G, Sut | ka | (G, ka) (Sut, ka) |
Q6 | F, G | kb | (F, kb), (G, kb) |
Q7 | R | ke | (R, ke) |
Q8 | F, T | kc, kd | (F, kc), (T, kd) |
Q9 | Se′, ka, kb, kc, kd, ke | Se | (Se′, Se), (ka, Se), (kb, Se), (kc, Se), (kd, Se), (ke, Se) |
Q10 | σ, Se | nf | (σ, nf), (Se, nf) |
Q11 | F, G | M | (G, M), (F, M) |
Q12 | G, M, σo, Kf | σ | (G, M), (G, σo), (M, σo), (σo, σ), (Kf, σ) |
Q13 | F, G, M | σo | (G, M), (G, σo), (F, M), (M, σo) |
5 Results and Discussion
The results include posterior estimates of model parameters and checks for model accuracy for both the 3PL model and the GrL model. Specifically, the following four parts constitute the results: (i) model checking where we analyze the model fit, (ii) comparing the estimates of individual-specific aggregate ability with the observed exam score, (iii) evaluating question difficulty in terms of estimated model parameters, and (iv) analyzing individuals’ causal knowledge in terms of estimated direct acyclic graphs. The analysis uses the training dataset and testing dataset as two partitions of the exam questions to perform model checking. The responses to questions Q1 to Q10 form the training dataset, whereas the answers of questions Q11, Q12, and Q13 form the testing dataset. Note that the relevant links for the testing dataset should be a subset of the relevant links for the training dataset.
In both the GrL and the 3PL models, we assign value zero to the pseudo-guessing parameter, cl = 0. The pseudo-guessing probability cl is not estimated from observations, rather it represents the intrinsic uncertainty in randomly guessing the correct answer. Of course, this uncertainty depends on how many choices are available and how responses are evaluated. In our case, the dataset consists of written responses graded by humans. Given the problems can be answered in an arbitrary way, the number of possible answers is infinite and thus the probability of guessing the answer correctly is zero.
We train both models using the NUTS sampler of the PyMC3 library in a python environment [38]. Posterior parameter samples are computed on Dell compute clusters with two 64-core AMD Epyc 7662 “Rome” processors (128 cores per node) and 256 GB of memory. The computational time for the GrL model, with reparameterization of the binary link variable kr,ij, is approximately 180 min for 60,000 iterations. This time is significantly less than the case without reparameterization, for which 2000 iterations take approximately 118 min. The computational time for the 3PL model is about 16 min for 60,000 iterations. The computational times were averaged over four separate runs.
5.1 Checking Model Accuracy.
For estimation of predictive accuracy for the 3PL model and GrL model, we use three separate approaches: (i) using an information criterion, precisely Watanabe–Akaike information criterion (WAIC) [39], for finding the in-sample deviance with an adjustment for the number of model parameters, (ii) using posterior predictive checks to perform visual verification of how close the models’ predictions are to observed responses (for both training and testing datasets) and calculate test quantities such as a total number of correct answers, and (iii) using prediction accuracy scores which represent the fraction of model predictions that exactly match the observed training and testing data.
The in-sample WAIC estimates suggest that the GrL model can better represent the observed training data than the 3PL model. Table 2 presents the values of WAIC, pWAIC, and standard error (SE) for WAIC computations. The lower the WAIC, the better the predictive accuracy. These results indicate that the GrL model has a more significant penalty (has a higher pWAIC value) as compared to the 3PL model, but the overall WAIC for the GrL model (638.86) is still lower than that of the 3PL model (1026.18). This implies that the additional model complexity of the GrL model is justified, at least according to WAIC. Note that in this work, the model fit is better as compared to our previous work [17]. The WAIC score of the former, 638, is lower than the latter, 721, even with a much larger number of model parameters.
Model | WAIC | SE | pWAIC | #Parameters |
---|---|---|---|---|
Three-parameter logistic | 1026.18 | 24.51 | 58.60 | 231 |
Graph-based logistic | 638.86 | 32.23 | 245.20 | 3299 |
Model | WAIC | SE | pWAIC | #Parameters |
---|---|---|---|---|
Three-parameter logistic | 1026.18 | 24.51 | 58.60 | 231 |
Graph-based logistic | 638.86 | 32.23 | 245.20 | 3299 |
According to the posterior predictive checking on the training dataset in Fig. 6, both the GrL model and 3PL model seem to match the observed response patterns. This result is further supported in Fig. 7 where both models adequately explain the total number of correct responses. Bayesian p-values close to 0.5 signify that about half of posterior samples are more significant than the observed test quantity, see Chap. 6 of [40]. For questions Q3 and Q10 in the 3PL model posterior samples (Fig. 6), the prediction accuracy is low. This highlights the 3PL model’s ineffectiveness in predicting responses to questions based on a single ability parameter θ.
To investigate the model accuracy at the level of individual students, we look at the predictive accuracy score. Suppose sl is an individual’s response to question l, then the average predictive accuracy score is the fraction of predicted responses that match with the observed response, . Here is a sample from the posterior and is an indicator function which is 1 if equals to sl and 0 otherwise. Figure 8 shows histograms of the student populations’ average predictive accuracy score on the training dataset. Under the 3PL model, the average predictive accuracy score is 80 or higher for 84 students. In contrast, under the GrL model, the average predictive accuracy score is 90 or higher for over 95 of the student population.
An essential distinction of the GrL model is its ability to make predictions on unseen questions Q11, Q12, and Q13 using the posterior link probabilities of respective relevant links. The 3PL model cannot make such predictions because the question-specific parameters αl and βl are unknown for the questions in the testing dataset.
From the results in Fig. 9, the GrL model seems to generally predict the observed patterns correctly except for question Q13. Figure 10 shows that the number of total correct responses in the testing dataset is lower than the corresponding prediction made using the GrL model. Overall, the average predictive accuracy score for the testing dataset is higher than 90 for approximately 79 of the students, as seen in Fig. 11.
The lower predictive accuracy for some questions in the testing set under the GrL model may be attributed to the inconsistencies in the observed responses. If we assume that any two questions have the same relevant links, then an individual’s responses to those questions should be the same (either correct or incorrect). However, this is not always the case in the observed responses. For instance, consider student #34 in Fig. 11 (marked using a star) for whom the average predictive accuracy score is 64. This student answered training questions Q1 and Q2 correctly but answered question Q3 wrong. Consequently, we should expect a correct response for Q11, an incorrect response for Q12, and a correct response for Q13; because questions Q1, Q2, and Q3 have the same relevant causal links as questions Q13, Q11, and Q12 respectively. However, the student’s actual response to question Q11 is correct, response to Q12 is incorrect, and response to Q13 is incorrect. Total 30 students with scores between 60 and 80 have such inconsistency while answering one of the three testing dataset questions (see Fig. 11). Three students with accuracy scores close to 33 have an inconsistency for two testing dataset questions. Two students with accuracy scores below 30 have inconsistent answers for all testing dataset questions. An exact reason for such errors in the dataset is unclear, but they could arise because students make mistakes even after correctly knowing the causal links.
5.2 Representing Aggregate-Level Ability.
Knowledge assessment practices commonly use a single number (total test score) to measure an individual’s aggregate ability. We investigate whether model parameters in the 3PL and GrL model can be rearranged to reflect individuals’ aggregate ability accurately. In the case of the fatigue questions, the aggregate ability is observed from the students’ total exam score, which quantifies overall knowledge of the topic.
5.3 Representing Question Difficulty.
Question difficulty is a latent property of different questions used in knowledge elicitation. Because we do not observe question difficulty directly in the given dataset, we analyze comparative question difficulty based on the estimated model parameters.
In the 3PL model, the threshold parameter βl signifies the difficulty of a question. Based on the posterior estimates of βl in Fig. 13, we may infer that questions Q1, Q2, and Q4 are easy questions and questions Q6 and Q9 are difficult for individuals across the population. This estimation of difficulty is mainly along the lines of the percentage of wrong responses. For instance, high fractions of the students, approximately , get questions 6 and 9 wrong.
The GrL model lacks a specific model parameter to quantify problem difficulty. Instead, the number of relevant links (as listed in Table 1) can proxy for problem difficulty. The higher the number of relevant links required to answer a question correctly, the larger the question can be. However, it is essential to emphasize that problem difficulty varies across different questions and students based on students’ abilities. An accurate measure of difficulty would need a precise representation of individuals’ question-specific knowledge, achievable by quantifying individuals’ knowledge about causal links as discussed in Sec. 5.4.
The threshold parameter β, in the GrL model, represents a fraction of relevant links required to answer a question pertinent to a given theory. The posterior estimate of threshold β is close to 0.25, as shown in Fig. 14. A posterior β from 1 signifies the partial relevance of selected causal links for predicting a correct answer. This deviation may arise from various external factors. For example, the written responses in the dataset were graded by multiple graders, which may induce variation. Also, a written response can be partially correct, in which case a grader uses their judgment to mark the answer right or incorrect.
The posterior estimates of slip parameter s using the GrL model is of order 10−2, which indicates a low probability for knowledgeable student subjects to answer incorrectly. Furthermore, the posterior estimates of slope parameter α are close to 100, signifying that the model effectively differentiates between student subjects with varying abilities.
5.4 Representing Causal Knowledge.
Unlike the 3PL model, the GrL model can quantify causal knowledge in terms of link probabilities. Figure 15 shows the distributions of estimated link probabilities across different causal links for the entire student population. Here, the x-axis represents the population-level estimate of link probability (colormap corresponds to the x-axis), and the y-axis represents the number of samples. We observe that the students have better knowledge of some links than the others. For example, the students know links (Sut, Se′), (T, kd), (R, Ke), etc., with high certainty. Conversely, for some causal links, such as (G, ka) and (G, kb), probability density is skewed toward 0, signifying poor knowledge of these links. An instructor can utilize this link-specific knowledge to better focus on the concepts that a population might find challenging. Some causal links, such as (Ka, Se) and (Kb, Se), have identical probability density because of the assumption that causal links between fixed subgroups have equal link probability.
The GrL model is also helpful for categorizing the students based on their knowledge of causal links. Consider two students, a high-scoring student who answered all ten training questions correctly versus a low-scoring student who answered four training questions correctly. The differences in the knowledge structures of these two students are evident from the estimated link probabilities in Fig. 16. The high-scoring student seems to have high knowledge of all relationships such as (F, kb) and (F, kc), whereas the low-scoring student appears to know only some relationships, such as (Sut, Se′), (T, kd), and (R, ke), with higher probability. In Fig. 16, it is essential to note that even for the high-scoring student, some links, such as (G, M) and (M, σo), have low probability. One possible reason for this could be the low threshold β for the GrL model. This might cause the model to assume that a particular link is not required to answer a given question.
Furthermore, for some causal links such as (Se′, Se) and (σ, nf), the model is uncertain at a population level (Fig. 15). The uncertainty in some links could be because they either do not repeat enough times across multiple questions or even when they repeat more than once, they happen to repeat with many other links. For example, links (σ, nf) and (Se, nf) appear only in Q10. If an individual gets Q10 incorrect, then the model will be uncertain about whether both or one link is weak. On the other hand, the model is quite sure about the link (Sut, Se′). This link happens to be the only link required to answer Q4 correctly, because of which the model has high certainty about this link. Also, note that the training dataset was a field dataset and was not specifically collected for this model. An ideal set of questions to be designed would have the following features: (i) a question tests only one causal link to maximize the learning about the link-specific knowledge and (ii) if multiple causal links constitute a single question, they should repeat in other questions so that the trained model has substantial certainty about individual link probabilities.
5.5 Discussion: Implications for Engineering Education and Design.
The probabilistic graphical method proposed in this paper has implications for learning and teaching in engineering education. Quantification of students’ causal knowledge can support the detailed representation of students’ ability [11]. The method provides an in-depth understanding of concepts a given population (or an individual) understands poorly knowledge and the concepts that the people (or an individual) know well. Based on this understanding, instructors and individualized tutoring systems (ITS) [41,42] can provide personalized feedback to help improve students’ knowledge. In the context of adaptive tests, instructors and ITS can potentially use estimated causal knowledge structures to generate new questions with varying difficulty using different combinations of causal links. This would allow instructors to test the same concepts using different questions and help reveal students’ knowledge of multiple concepts. The probabilistic graphical method can be utilized for scaffold learning by assisting an instructor in understanding how much support a learner needs to complete learning tasks. The problem-specific scaffolding is achievable by assessing the threshold parameter (β) and the link probabilities (aij) from the estimated theory-based causal knowledge. This accurate assessment enables the optimal degree of assistance to support the learner’s development [43].
Another application of the probabilistic graphical method is modeling the dynamic nature of learning. The method can aid the dynamic Bayesian networks of individuals’ learning [44,45] through quantitative assessment of causal knowledge at different time-steps. For example, an individual takes more than one assessment in exams or quizzes for a given course. During these assessments, an individual’s estimated DAG can capture the state of individuals’ knowledge and indicate how the individual’s knowledge increases over time.
The computational modeling of human decision-making in engineering design is another avenue that can benefit from quantifying causal knowledge. Using the estimated DAGs, the expertise research can differentiate experts from novices and test design theories such as novice designers implement situation-independent rules, and experienced designers tend to think in a pattern-based way [46]. For design practitioners, a better understanding of the knowledge structures can help reduce the inefficiencies caused by a poor comprehension of relevant physical variables and their interrelations. A fast inference of an individual’s causal knowledge can better design personal support tools that help human designers in decision-making [7,47,48] and knowledge-based inductive reasoning [8,49]. With the quantification of knowledge structures, better human-machine interaction (e.g., co-robotics) and improved design of partially automated artificial intelligence (AI)-based products and systems that work with humans [9] can be made possible.
To realize the applications above, we have created a tool that implements the proposed method.2 Given a theory-specific true DAG, a set of questions, and individuals’ responses to these questions, such a tool would follow the steps in Sec. 3 and infer DAGs for the population and individuals (similar to Figs. 15 and 16). The purpose of such a tool is to augment existing educational tools for knowledge assessment [50] and adaptive tests [2] with predictive functionalities.
6 Conclusions
This paper quantifies individuals’ theory-based causal knowledge using an approach based on DAGs, a GrL model, and hierarchical Bayesian inference. The approach uses relational constraints from a given theory to model individuals’ abilities (causal knowledge). It predicts the correct response based on individuals’ question-specific causal understanding. This approach is domain-general and can be implemented for any causal theory. In the illustrative study, we tested the approach on engineering students’ response data to questions related to fatigue failure. The results suggest that hierarchical Bayesian inference quantifies uncertainty in the population’s causal knowledge as well as uncertainty in individual-specific causal knowledge. The posterior estimates of individual-specific DAGs allow us to identify low-knowledge and high-knowledge subjects across different causal links as shown in Fig. 16. Furthermore, the GrL model can leverage the estimated individual-specific DAGs to predict individuals’ responses to unseen questions, given that a new question requires the same theoretical knowledge and the pseudo-guessing parameter cl is pre-defined.
Further work is necessary for validating the performance of the GrL model on multiple-choice questions, for which the responses are likely to have fewer errors from external factors such as variation in grading criteria. Future work should consider improvements in the representation of individuals’ causal knowledge, e.g., through modeling knowledge of functional relationships connecting parent variables to a child variable, instead of simply modeling link probabilities. This will also help to better model the complexity of questions. The presented method requires a priori definition of theoretical knowledge in terms of a true knowledge graph. Future extensions could model procedural knowledge by developing novel prior distributions for unconstrained graphs. Additionally, a Bayesian inference tool for causal knowledge representation is required to augment the existing intelligent tutoring systems for educational remediation and existing decision support systems for engineering design.
Footnote
Acknowledgment
The authors gratefully acknowledge financial support from the National Science Foundation through NSF CMMI (Grant Nos. 1662230 and 1728165). Earlier version of this work was presented at ASME IDETC 2020 [17].
Conflict of Interest
There are no conflicts of interest.
Data Availability Statement
The datasets generated and supporting the findings of this article are obtainable from the corresponding author upon reasonable request.
Nomenclature
- s =
slip parameter
- cl =
pseudo-guessing probability for question l
- Ar =
N × N matrix of link probabilities representing prior belief about individual r’s knowledge
- Kr =
N × N matrix representing individual r’s knowledge graph with same encoding as KTrue
- =
N × N binary matrix where ijth cell is 1 if the relationship between variables xi and xj is relevant to question l, or 0 otherwise
- Erl =
binary variable denoting whether individual r’s response to question l is correct (1) or incorrect (0)
- X = {xi}i=1:N =
a collection of N physical variables relevant to a given theory
- =
N × N binary matrix representing the true knowledge graph for a specific theory
- α =
slope parameter
- β =
threshold parameter
- μij =
the group means of the population’s ability for causal link ij
- τr,ij =
individual r’s offset from the mean ability μij
- ϕ =
a feature function calculating the fraction of correctly identified links, i.e., matching links between Kr and KTrue
Appendix: Questions for Testing the Knowledge of Fatigue Analysis
Figure 17 sketches a circular shaft under cyclic loading that the students of a machine design class analyze as part of the final exam. The particular exam includes 13 subquestions listed in Table 3. Each exam question requires estimating one output variable given the values of the output variable’s parent variables. The students have additional information through the following problem statement:
A circular steel bar is fixed to the floor, as shown in Fig. 17. The bar has an ultimate tensile strength Sut = 180 kpsi, a yield strength Sy = 140 kpsi, and it has a machined surface. The bar operates at room temperature. The fatigue stress-concentration factors for bending and shear at the fillet are known to be Kf = 2.3 and Kfs = 1.8, respectively.
Question | Question statements |
---|---|
Q1 | For the critical plane at the shoulder identify the critical points in Fig. 17 |
Q2 | Expression for the bending moment Mmax at the critical plane in terms of Fa |
Q3 | Expression for the maximum normal stress adjusted for stress concentration for the critical point (ignore shear stresses due to transverse load) as a function of Fa |
Q4 | Calculate the theoretical endurance limit Se′ |
Q5 | Calculate the Marin factor ka |
Q6 | Calculate the Marin factor kb |
Q7 | For a reliability of 99 calculate ke |
Q8 | For the given conditions determine the Marin factors kc, kd, and kf |
Q9 | Calculate the endurance strength Se of the bar for a reliability of 99 |
Q10 | The magnitude of the load, Fa, in pounds, for which the infinite life fatigue factor of safety at the critical point is nf = 1.5 |
Q11 | Expression for the bending moment Mmin at the critical plane in terms of Fa |
Q12 | Expression for the minimum normal stress adjusted for stress concentration for the critical point (ignore shear stresses due to transverse load) as a function of Fa |
Q13 | Show a plot of stress versus time for Q3 |
Question | Question statements |
---|---|
Q1 | For the critical plane at the shoulder identify the critical points in Fig. 17 |
Q2 | Expression for the bending moment Mmax at the critical plane in terms of Fa |
Q3 | Expression for the maximum normal stress adjusted for stress concentration for the critical point (ignore shear stresses due to transverse load) as a function of Fa |
Q4 | Calculate the theoretical endurance limit Se′ |
Q5 | Calculate the Marin factor ka |
Q6 | Calculate the Marin factor kb |
Q7 | For a reliability of 99 calculate ke |
Q8 | For the given conditions determine the Marin factors kc, kd, and kf |
Q9 | Calculate the endurance strength Se of the bar for a reliability of 99 |
Q10 | The magnitude of the load, Fa, in pounds, for which the infinite life fatigue factor of safety at the critical point is nf = 1.5 |
Q11 | Expression for the bending moment Mmin at the critical plane in terms of Fa |
Q12 | Expression for the minimum normal stress adjusted for stress concentration for the critical point (ignore shear stresses due to transverse load) as a function of Fa |
Q13 | Show a plot of stress versus time for Q3 |