33 ChatGPT Assistance in Creating Chemistry Practice Problems: Pitfalls, Positives, and Possibilities
Michael A. Christiansen
Abstract
ChatGPT, a relatively new Large Language Model (LLM) artificial intelligence (AI) chatbot, has gained significant attention as the most downloaded software app, accruing over 100 million users within two months of its release. This software can generate quick, articulate responses to virtually any textual query. Many educators are concerned about its potential for enabling student cheating. However, it currently suffers significant limitations in solving chemistry problems—documented through peer-reviewed articles and from the author’s experience—suggesting that chemistry students will perform far better, on average, by studying than by relying on ChatGPT. This is particularly true of math-centric problems, which the current chatbot frequently gets wrong. Although ChatGPT cannot yet replace human intelligence—especially when heavy math or technical expertise are needed—it can be useful when coupled with enough subject knowledge to recognize and fix errors. In this article, the author summarizes his own experience with ChatGPT and highlights its potential benefits and limitations in chemistry education, as well as its ability to quickly generate practice questions for students. Such mass-question databases are of particular benefit for enabling students to reap the rewards of repeated practice, also known as the “practice effect” and “testing effect.”
Keywords: chemistry, chemistry education, ChatGPT, artificial intelligence, AI, course design
Background
Although college first-year General Chemistry (GC) and second-year Organic Chemistry (OC) are subjects steeped in theory, they also bear tremendous real-world application, often taught through a combination of lecture, demonstrations, videos, and worked-out practice problems. In my 12 years of teaching GC and OC at Utah State University, such problems have been a staple of my curriculum, because when well designed, they can fuse theory, practice, and application together in students’ minds (Fendos, 2021; Graham et al., 2016; Szu et al., 2011). Thus, my homework assignments are saturated with practice problems, which I frequently recommend that students do several times apiece as they prepare for exams (Chemistry Unleashed, 2019). This repeated problem-solving approach is rationalized by two educational phenomena called the practice effect (Bartels et al., 2010) and the testing effect (Rowland, 2014), the latter being succinctly described in James M. Lang’s book Distracted as follows:
Tests and quizzes not only measure learning, but actually produce it. Although we don’t often think of them this way, tests and quizzes are practice memory exercises. They force students to retrieve what they have learned and articulate it or apply it. In so doing, they are strengthening their ability to repeat those steps in the future. So we should not by any means apologize for, or shy away from, tests and quizzes. (Lang, 2020)
Although I have given my students hundreds of sample exercises to practice in each course, I have found through anonymous surveys that they want more. In fact, my perceived failure to deliver enough practice problems is one of the most consistent criticisms I receive in course evaluations. Many chemistry question databases exist—some free, others not—and each has varied quality and applicability to my course objectives (Apodaca, 2011; Reaxys, 2023; University of Texas Libraries, 2023). Hence, while striving to expand student access to an ever-growing number of sample problems, a proper balance must be struck between minimizing cost, choosing relevant questions, and not using others’ questions without permission.
This latter issue is of particular concern because I often film myself solving questions for my students, and then upload those videos to YouTube (Chemistry Unleashed, n.d.). Although it is unlikely a publisher or other instructor would litigate me for using their exercises and posting video answers online, I would consider such actions by me unethical. Thus, the quandary arises: how can I quickly create my own unique database of practice problems for students? As I found during the spring 2023 semester, ChatGPT is one tool for accomplishing this goal. However, caution must be taken, for reasons explained later on. I will accordingly provide a brief history of ChatGPT, a summary of some of its shortcomings and uses (including my own experiences making chemistry questions), and possible future applications.
What is ChatGPT and How Does It Work?
In 2022, an American artificial intelligence (AI) research lab called OpenAI launched ChatGPT (Chat Generative Pretrained Transformer; Hu, 2023), an online Large Language Model (LLM) chatbot that responds in human-like ways to any question. ChatGPT quickly became the most downloaded software app in history, gaining over 100 million users in two months (Hu, 2023). Google, Microsoft, and Meta responded by accelerating creation of their own LLM chatbots—Bard, Bing Chat, and LLaMA—which remain at various levels of development and use (Pathak, 2023; Ray, 2023; “What’s,” 2023). Image- and video-generating AI programs have also emerged but will not be covered here (Deep Dream Generator, n.d.; VAGPE Media, 2023; Howfinity, 2023; OpenAI DALL-E2, n.d.)
ChatGPT’s online interface is simple. Once users visit the site (OpenAI ChatGPT, n.d.), create an account, and log in, they are presented with two top-screen tabs: one for GPT-3.5 (the free version) and another for GPT-4 (the subscription version). The bottom of the homepage features a “Send a message” field, where users can type any question or prompt and hit ENTER. The chatbot then replies onscreen. In some ways, ChatGPT seems like a next-generation search engine, but instead of responding with myriad websites, it directly answers the exact questions posed. Users can also modify its parameters—for example, by asking it to create a biography of a historical figure with a specific length. It seems to command virtually limitless knowledge on any topic. But despite its superficial word-generating omnipotence, ChatGPT has flaws that users should understand to avoid pitfalls. To do so, we must know to some extent how it works, which one source summarizes as follows:
Large Language Models, such as GPT-3, are trained on vast amounts of text data from the internet and are capable of generating human-like text, but they may not always produce output that is consistent with human expectations or desirable values. In fact, their objective function is a probability distribution over word sequences (or token sequences) that allows them to predict what the next word is in a sequence. (Ramponi, 2022)
In other words, ChatGPT functions through a probability distribution (ostensibly updated through iterative interactions with human users and “trainers”) to predict and deliver the statistically most likely next word or phrase a human would give to answer a prompt (Clark, 2023). Thus, ChatGPT is really a high-speed statistical text prediction algorithm. It does not have goals or sentience. Instead, much like an advanced version of the word prediction tech on smartphones or search engines, it rapidly anticipates the most likely next words a human would use in an ideally correct response and then delivers those words. It then repeats this until fulfilling the user’s query.
Narrow AI, General AI, and Super AI
Though still young, ChatGPT already sits centerstage in many online discussions, articles, and videos. Wide-ranging concerns have arisen regarding the potential long-term impact of both ChatGPT and of AI in general. These are best understood by differentiating between narrow AI, general AI, and super AI, which AI researcher Vijah Kanade (2022) explains as follows:
Artificial narrow intelligence (ANI), also referred to as weak AI or narrow AI, is application- or task-specific AI. It is programmed to perform singular tasks such as facial recognition, speech recognition in voice assistants, or driving a car. Narrow AI simulates human behavior based on a limited set of parameters, constraints, and contexts. . . .
Artificial general intelligence (AGI), also referred to as strong AI or deep AI, is the ability of machines to think, comprehend, learn, and apply their intelligence to solve complex problems, much like humans. Strong AI uses a theory of mind AI framework to recognize other intelligent systems’ emotions, beliefs, and thought processes. A theory of mind-level AI refers to teaching machines to truly understand all human aspects, rather than only replicating or simulating the human mind. . . .
Artificial superintelligence (ASI) is a type of AI that surpasses human intelligence and can perform any task better than a human. ASI systems not only understand human sentiments and experiences but can also evoke emotions, beliefs, and desires of their own, similar to humans. (Kanade, 2022)
Concerns about AI reaching an apocalyptic, humanity-ending state—such as those articulated by Stephen Hawking in Brief Answers to The Big Questions (Hawking, 2018; Rutschman, 2018; Cellan-Jones, 2014)—depend on AGI or ASI, which currently remain either in development or relegated to theory (Kanade, 2022). ChatGPT itself is a form of ANI.
Pitfalls
As AI evolves, whatever future scenarios might unfold lie beyond the scope of this chapter; I will instead focus here on reports and applications from the field of Chemistry Education.
Pitfall 1: ChatGPT is Currently Bad at Math
In a recent Journal of Chemical Education (JCE) article, authors Humphry and Fuller claimed that “ChatGPT can be very useful in helping students with some calculations” (Humphry & Fuller, 2023). However, they then shared an example in which ChatGPT was given four measured values and asked to calculate their average, standard deviation, and 95% confidence interval about the mean. ChatGPT did this, but got all three answers wrong, seemingly nullifying the prior assertion of the chatbot’s mathematical usefulness. Although Humphry and Fuller did not address this error, chemistry educator Julian Tyson did in a later article (Tyson, 2023). Separately, Fergus and coworkers independently entered six math-centered chemistry questions into ChatGPT, using three separate accounts (Fergus et al., 2023). The AI responded with five answers (in one trial it gave no response). All five were different, and each one was wrong.
In other work, researcher Ted M. Clark (2023) gave ChatGPT and a class of first-year undergraduates the same set of 34 GC questions. The chatbot correctly answered just 40% of math-based questions and 47% of non-math questions, well below the student average scores of 76% and 65%, respectively. On average, then, students will likely outperform ChatGPT in its current state on conceptual and quantitative chemistry questions, underscoring Tyson’s claim that “ChatGPT is hopeless at math” (Tyson, 2023). In a separate study by Frieder et al. (2023), ChatGPT and GPT-4 were unable to pass graduate-level math exams. Thus, in its current state—which will undoubtedly improve over time—I recommend warning students about ChatGPT’s below-average output, thereby disincentivizing them from mindlessly using its responses to answer chemistry questions.
Pitfall 2: ChatGPT Sometimes Outputs Plausibly-Sounding-but-False Information as True
This chatbot phenomenon, known as “hallucination” or “confabulation” (Lakshmanan, 2022; Tyson, 2023), is unsurprising when considering that ChatGPT operates by drawing from vast amounts of inputted data to predict and produce the statistically most likely next words a human would give in a query response. Factual accuracy does not necessarily play into its outcomes. One of the more interesting results of this is the chatbot’s creation of completely fictional references. For instance, Tyson (2023) asked ChatGPT to answer a question about global arsenic cyclization with references cited in its response. The chatbot’s output read with an authoritative tone and included three citations. However, having expertise in this area, Tyson recognized that at least one of the statements in ChatGPT’s response was completely opposite of the truth. Moreover, none of the references looked familiar. On further inspection, each of them was completely made up. In one case, a cited reference’s Digital Object Identifier (DOI) did exist but linked to a completely different article.
In a more well-known example, a law firm submitted a 10-page legal brief “citing more than half a dozen court decisions . . . in support of their argument that [their] suit [for which the statute of limitation had expired] should be allowed to proceed” (Weisner, 2023). Sometime after the defense attorneys were unable to find any of the cases cited, it was discovered that the prosecuting attorney had written the brief using ChatGPT. None of the cases were real (Weisner, 2023).
Though other pitfalls may exist, an exhaustive summary of every possible example lies beyond this chapter’s scope, and over time, ChatGPT will undoubtedly improve. Suffice it to say, in their present state, indiscriminately copied responses from ChatGPT will not yield a passing grade in undergraduate GC or OC courses—as Clark’s research solidly shows (Clark, 2023; Tyson, 2023). Additionally, the chatbot’s outputs may even include misleading or incorrect statements or fabricated references. Thus, for classes whose instructors allow ChatGPT, students should be warned to use it as a potentially helpful tool, but to not uncritically accept its answers. In Tyson’s words: “ChatGPT is useless as a chemistry tutor” (Tyson, 2023).
A Summary of My Own Experiences Writing Chemistry Questions
ChatGPT seemingly has two overarching uses in education: helping students do assignments and helping teachers create those assignments (Clark, 2023). As explained in the beginning of this chapter, my chemistry curriculum centers on repeated practice problem exercises, and my course evaluations often mention students’ desire for more. Weighed by the concern of wanting to provide large databases of sample questions and produce explanation videos on YouTube for my students while not violating others’ copyrights, I recently used ChatGPT to create 18 new GC questions in about four total hours of work time. In this initial trial run, I found the chatbot to be useful, but only when coupled with human expertise. The quality of its output also depends significantly on the details given in the user query, in a refinement process sometimes called “Prompt Fu” (Prompt Fu, n.d.). Congruent with Clark’s findings, I observed that “questions in which more general information was appropriate were answered at a higher level” (Clark, 2023). Instead of an exhaustive retelling of the entire endeavor, I will only summarize a few examples where ChatGPT’s output was lackluster or incorrect (Tables 1-6), with some details omitted for brevity. If you read through each case dialog below, you’ll notice the areas where the chatbot erred. The “take-home” from these exchanges is that ChatGPT can be a useful tool for quickly generating a sizable chemistry question database (I got 18 good questions in about 4 total hours, which would normally have taken at least twice as long) but requires expert scrutiny to correct the errors it creates.
Table 1: ChatGPT References Something That is Not There
Example 1: In this example, ChatGPT required multiple prompts before giving a satisfactory answer. However, its response to the penultimate prompt claimed that “hydrogen sulfide (H2S) is oxidized to sulfur (S).” This was incorrect because molecular sulfur, which has a formula of S not bonded to any other elements, does not appear anywhere in the reaction ChatGPT created. In other words, the chatbot referenced something, (S), that did not exist in the chemical reaction it created. Question (me): Here’s an example redox reaction: Fe(CO)5 (l) + HI (s) → Fe(CO)4I2 (s) + CO (g) + H2 (g). Can you write me a new reaction that is similar to this one, is chemically correct, but does not involve CO? Answer (ChatGPT): Sure, here’s an example: CuCl2 (aq) + Mg (s) → MgCl2 (aq) + Cu (s) Follow-up Question (me): This is okay, but can you give me another example that involves more than 3 elements? Response (ChatGPT): Sure, here’s an example: 2Fe(NO3)3 (aq) + 3NaOH (aq) + H2S (g) → 2FeS (s) + 6NaNO3 (aq) + 3H2O (l). In this reaction, iron (III) nitrate (Fe(NO3)3) is reduced to iron (II) sulfide (FeS) while hydrogen sulfide (H2S) is oxidized to sulfur (S). Sodium hydroxide (NaOH) is used as a base to facilitate the reaction and form sodium nitrate (NaNO3) and water (H2O). The reaction involves four elements – iron (Fe), nitrogen (N), oxygen (O), and sulfur (S). Follow-up Question (me): Can you explain this further? I do not see molecular sulfur (S) on the product side. Response (ChatGPT): I apologize for the confusion in my previous response. You are correct that the molecular sulfur (S) is not present in the product side of the given reaction. Here’s a corrected example of a redox reaction that involves more than 3 elements and does not involve carbon monoxide: 2Hg(NO3)2 (aq) + Mg (s) → 2Hg (l) + Mg(NO3)2 (aq). |
Table 2: ChatGPT Creates a Question That is Missing Essential Information
Example 2: After two prompts, ChatGPT created an acceptable “percent yield” question. However, it did not give a mass value for one of the reaction products: MgCl2. This type of question cannot be solved without that mass value. Instructors could create their own from ChatGPT’s response here, so this kind of result could still be used. Question (me): Please create a percent yield question, complete with a balanced chemical equation and four multiple-choice options, where one of them is correct and the others are wrong. Answer (ChatGPT): Sure, here’s a question: When 50.0 grams of methane (CH4) is burned in excess oxygen gas (O2), 100.0 grams of carbon dioxide (CO2) and 54.0 grams of water (H2O) are produced. What is the percent yield of CO2? (Balanced equation: CH4 + 2O2 → CO2 + 2H2O) Follow-up Question (me): That’s a great try! I would like you to write me a new one where the question does NOT tell me which reactant is added in excess. In other words, give me a certain number of grams of each reactant, so I have to determine which reactant is the limiting reactant. Response (ChatGPT): Sure thing! Here’s a question for you: When 25.0 grams of magnesium (Mg) reacts with 35.0 grams of hydrochloric acid (HCl) to produce magnesium chloride (MgCl2) and hydrogen gas (H2), what is the percent yield of MgCl2? (Balanced equation: Mg + 2HCl → MgCl2 + H2) Follow-up Question (me): That’s pretty good, except you didn’t give me a gram amount of MgCl2 product. Which of your answers is correct, and how many grams of MgCl2 product lead to that answer? Response (ChatGPT): You’re right, my apologies for the oversight. Here’s the revised question with the answer and the amount of MgCl2 product: When 25.0 grams of magnesium (Mg) reacts with 35.0 grams of hydrochloric acid (HCl) to produce magnesium chloride (MgCl2) and hydrogen gas (H2), what is the percent yield of MgCl2 if 29.4 grams of MgCl2 are produced? |
Table 3: ChatGPT Creates a Set of Multiple-Choice Options That Does Not Include the Correct Answer
Example 3: In this example, ChatGPT provided a correct answer (2.82 x 10^23 atoms) in its written explanation, but it did not include this correct answer in its multiple-choice option set. An instructor could easily fix this by updating the multiple choice options to include the correct answer. Question (me): Please give me four multiple-choice options (one correct option and three incorrect options) for the following: “How many atoms are there in 15 grams of sulfur?” Answer (ChatGPT): Here are the options: A) 6.02 x 10^23 atoms. B) 1.50 x 10^23 atoms. C) 3.01 x 10^23 atoms. D) 9.03 x 10^23 atoms. Correct answer: B) 1.50 x 10^23 atoms. Explanation: The number of atoms present in a given amount of a substance can be calculated using Avogadro’s number (6.02 x 10^23) and the molar mass of the substance. The molar mass of sulfur is approximately 32 g/mol. Therefore, the number of atoms in 15 grams of sulfur can be calculated as follows: Number of moles of sulfur = mass of sulfur / molar mass of sulfur = 15 g / 32 g/mol = 0.469 mol. Number of atoms of sulfur = Number of moles of sulfur x Avogadro’s number = 0.469 mol x 6.02 x 10^23 atoms/mol = 2.82 x 10^23 atoms. Follow-up Question (me): Wow, that’s interesting. Your answer in your explanation, 2.82 x 10^23 atoms, is correct, but that’s not the answer you put for B). Can you please update your A) through D) options to actually show 2.82 x 10^23 atoms as Option B)? Response (ChatGPT): I apologize for the confusion. You’re correct that the correct answer is 2.82 x 10^23 atoms. Here are the updated options: A) 6.02 x 10^23 atoms. B) 2.82 x 10^23 atoms. C) 3.01 x 10^23 atoms. D) 9.03 x 10^23 atoms. |
Table 4: ChatGPT Gives a Correct Explanation but Outputs a Wrong Answer
Example 4: In this example, ChatGPT gave a correct answer as having a 1:1:2 ratio of carbon to hydrogen to oxygen in a chemical formula, but then claimed that this formula would be CH2O, which has a 1:2:1 ratio of carbon to hydrogen to oxygen. The question generated by the chatbot was good enough, however, that an instructor could fix and make it usable. Question (me): Please solve this problem: “An 8.523-gram sample of a compound contains 2.695 grams of carbon, 0.248 grams of hydrogen, and 5.580 grams of oxygen. What is the compound’s empirical formula?” Answer (ChatGPT): To find the empirical formula of a compound, we need to determine the ratio of the elements in the compound. We can do this by converting the mass of each element to moles and then dividing by the smallest number of moles to get the simplest whole-number ratio. First, let’s find the number of moles of each element: Moles of carbon = 2.695 g / 12.011 g/mol = 0.2246 mol. Moles of hydrogen = 0.248 g / 1.008 g/mol = 0.2460 mol. Moles of oxygen = 5.580 g / 15.999 g/mol = 0.3488 mol. Next, we divide each of these numbers by the smallest number of moles (0.2246 mol): Carbon: 0.2246 mol / 0.2246 mol = 1. Hydrogen: 0.2460 mol / 0.2246 mol = 1.095. Oxygen: 0.3488 mol / 0.2246 mol = 1.553. We need to get whole-number ratios, so we round these values to the nearest whole number: Carbon: 1, Hydrogen: 1, Oxygen: 2. Therefore, the empirical formula of the compound is CH2O. Follow-up Question (me): By following your logic, shouldn’t the answer be CHO2? Response (ChatGPT): You are correct, I made a mistake in my previous response. Thank you for bringing this to my attention. Using the same method as before, we can find the number of moles of each element: Moles of carbon = 2.695 g / 12.011 g/mol = 0.2246 mol. Moles of hydrogen = 0.248 g / 1.008 g/mol = 0.2460 mol. Moles of oxygen = 5.580 g / 15.999 g/mol = 0.3488 mol Next, we divide each of these numbers by the smallest number of moles (0.2246 mol): Carbon: 0.2246 mol / 0.2246 mol = 1. Hydrogen: 0.2460 mol / 0.2246 mol = 1.095. Oxygen: 0.3488 mol / 0.2246 mol = 1.553. We need to get whole-number ratios, so we round these values to the nearest whole number: Carbon: 1, Hydrogen: 1, Oxygen: 2. Therefore, the empirical formula of the compound is CHO2. |
Table 5: ChatGPT Makes a Math Error
Example 5: In this example, ChatGPT claimed that 35 grams of hydrochloric acid, HCl, is equal to 1.03 moles of HCl, which is incorrect. To correctly calculate a substance’s number of moles from a given amount of grams, one must divide that number of grams by the substance’s molecular weight. In this case, 35 grams of HCl should be divided by HCl’s molecular weight of 36.46 grams per mole to arrive at 0.960 moles of HCl, not 1.03 moles of HCl. It’s possible that ChatGPT either miscalculated HCl’s molecular weight or miscalculated 35 divided by 36.46. Ultimately, an instructor with enough knowledge to catch this error could modify the question to fix the mistake. Question (me): Please solve this: When 25.0 grams of magnesium reacts with 35.0 grams of hydrochloric acid what is the percent yield of MgCl2, if 29.4 grams of MgCl2 are produced? Answer (ChatGPT): [With a lengthy explanation, ChatGPT gives the answer, which includes the miscalculation of HCl’s number of moles as 1.03 mol x 2 = 2.06 mol. The correct answer should be 35 g / (36.46 g / mol) = 0.960 mol.] Therefore, the percent yield of MgCl2 is 30.0%. Follow-up Question (me): [I find its error.] Good try, but you’re incorrect. The question says that we have 35 grams of HCl, which is not 2.06 moles of HCl. Please reattempt. Response (ChatGPT): [With a lengthy explanation, ChatGPT responds with the correct answer.] |
Table 6: ChatGPT Makes Multiple Math Errors
Example 6: In this example, ChatGPT did correct algebra to solve for a volume as being equal to (1.39 x 0.0821 x 298.15) divided by 1.00. However, the answer it produced, 31.4, is incorrect. The correct answer should be 34.0, not 31.4. When I pointed this out, ChatGPT reattempted the calculation and got another wrong answer, 30.5. In a follow-up prompt that simplified the math by removing all units, ChatGPT still got a wrong answer of 30.4. It’s unclear why ChatGPT could not do simple multiplication-division correctly. However, an instructor with enough knowledge to catch this could modify the question to fix the mistake. Question (me): Please solve this question: If 25.0 grams of Al react with excess HCl according to the equation 2 Al + 6 HCl → 2 AlCl3 + 3 H2, then what volume (in liters) of hydrogen gas, measured at 25.0°C and 760.0 mmHg, is produced? Answer (ChatGPT): To solve this problem, we need to use the stoichiometry of the balanced chemical equation . . . Now we can plug in the values and solve for Volume: V = nRT / P. V = (1.39 mol)(0.0821 L•atm/mol•K)(298.15 K) / 1.00 atm = 31.4 L. Follow-up Question (me): Really close, but when I multiply 1.39 x 0.0821 x 298.15, I get a slightly different answer. Answer (ChatGPT): You’re right, thank you for catching that mistake. When I multiplied 1.39 x 0.0821 x 298.15, I made a calculation error. The correct calculation is . . . V = (1.39 mol) x (0.0821 L·atm/(mol·K)) x (298.15 K) / (1.00 atm) = 30.5 L. Follow-up Question (me): Still wrong. Please solve (1.39)(0.0821)(298.15). Answer (ChatGPT): I apologize for the mistake. The correct calculation is . . . V = (1.39 mol) x (0.0821 L·atm/(mol·K)) x (298.15 K) / (1.00 atm) = 30.4 L. Follow-up Question (me): Nope. It’s 34.0 L. |
Congruent with others’ assertions described above, my conclusions from working with ChatGPT are that (1) repeated queries are sometimes needed to fully obtain desired results; (2) some amount of expertise may be needed to catch incorrect outputs; and (3) in its current state, ChatGPT often errs in its mathematical calculations. Thus, it can be a useful tool for quickly creating a sizable question database, but expert scrutiny is needed to discern and correct errors in the process. Moreover, because of its shortcomings with math, users must beware and check its work. In other words, when using ChatGPT to answer questions outside one’s own areas of expertise, I recommend caution: double- or triple-checking its responses with reliable sources to make sure they are correct.
Positives
Despite its shortcomings, ChatGPT has a long list of impressive features, especially with the upgraded, subscription-model “plus” version, GPT-4. Unlike its free GPT-3.5 counterpart, GPT-4 has dozens of third-party browser plugins, and it can access the internet, which might improve its output quality and decrease its “hallucinations” frequency (Choudhary, 2023).
For example, in two recent videos whose scripts (according to its presenter) were ironically written by ChatGPT, YouTuber “Hayls World” summarizes various interesting features, which were demonstrated using Google Chrome and are presumably limited to GPT-4, as every attempt I made at replicating them with GPT-3.5 failed (Hayls World, 2023). In particular, “Hayls World” claimed that in ChatGPT (presumably GPT-4), users can copy and paste HTML from online articles into the query field and ask the AI to summarize them. This even works in GPT-4 by entering the prompt “TL;DR in 2 sentences” (short for “Too Long Didn’t Read; summarize in 2 sentences”), then copying and pasting hundreds of pages of text. Supposedly, the chatbot then provides a quick summary. This purportedly also works by entering web links for images into the chatbot and asking it to explain them. Additionally, she explained that ChatGPT and GPT-4 can solve math problems by entering them into the query field; users can also paste their own written essays into the chatbot and ask it to grade their work based on their teachers’ grading criteria. “Hayls World” further claimed that by installing the plugin “ChatGPT summary for Chrome,” GPT-4 can create a condensed audio or written summary of any YouTube video just by entering that video’s HTML.
These features could have great potential for chemistry educators. For instance, if the summary feature were accurate, then educators can read a larger number of research articles (via synopses) in less time. The grading feature might help educators to grade and improve their own work before sending it to colleagues for review, or to do a “first-run” mass-grade of student work. The utility of ChatGPT solving math problems needs no explanation.
Nevertheless, as one might deduce from the explanations further above, I would be skeptical of output summarizing lengthy articles, dissertations, or books, and even more so for math problems. However, the potential benefits of such features are intriguing, especially as GPT-4 iteratively improves, and might even be positive if users scrutinize and crosscheck its responses with outside sources. However, even with full internet access one should exercise caution, for despite providing access to a wealth of information, the internet is also:
Packed with untruths, hate speech, and other garbage, [and] Chatbots absorb it all . . . And because of the surprising way they mix and match what they’ve learned to generate entirely new text, they often create convincing language that is flat-out wrong, or does not exist in their training data. (Tyson, 2023)
In areas where mistakes seem less serious, both ChatGPT and GPT-4 can give teaching instructions for skills like playing the guitar, learning new languages, or using software like Adobe Photoshop. It can also create recipes, including ones based on ingredients specified by users, give meal plans with shopping lists, and with GPT-4 only, generate workout plans that include pictures and detailed animations showing how to perform the exercises (Halys World, March 2023). Of note, users can ask both ChatGPT and GPT-4 to rewrite messages in the style of specific people, including celebrities or the users themselves. In the latter case, this is done by entering samples of one’s own writing, thus enabling the chatbot to create an AI writing “clone” of oneself (Halys World, March 2023). This feature, enhanced by a Gmail-compatible plugin called ChatSonic, can be used to ask ChatGPT to write emails on particular subjects or with particular points, in one’s own writing style, and with a specific tone (chosen from various options such as “professional,” “personal,” “angry,” etc.), potentially saving time (Halys World, September 2023). ChatGPT also has settings that can be adjusted to change its output, such as “length” and “temperature” (ChatGPT Settings, n.d.; Murdok, 2023), and its output can vary depending on the user’s personal attributes. For instance, if users tell the chatbot they are minors, then it may moderate its responses to be age appropriate.
ChatGPT and GPT-4 have other features with various levels of potential utility or impact; thus, caution is advised. For instance, users can describe in the chatbot’s query field challenges they are experiencing with specific interpersonal relationships and ask for advice on how to solve or manage them (Halys World, March 2023). Or, by using the “PlaylistAI” plugin and linking to a Spotify account, GPT-4 will generate music playlists according to the user’s specifications (Halys World, September 2023). It also has a plugin called “show me diagrams” that enables users to create charts and diagrams based on written prompts, and a Chrome extension called “Talk-to-ChatGPT” allows users to interact with GPT-4 completely through speech (no typing), with the chatbot responding in one of 38 different voice options (Halys World, September 2023).
However effective or reliable these features may or may not be at present, GPT’s functionality will undoubtedly improve over time. Thus, attempts to stop students from using it may be like bygone mathematicians trying to stop students from using calculators (Hochman, 1986; Pang, 2016). In other words, as Clark affirms, “ignoring this technology may be unwise. A more prudent approach would be to incorporate chatbot responses into assignments involving critical thinking, to continue to monitor its capability, and to adapt assignments as the technology improves” (Clark, 2023). Instructors should thus respond by crafting an AI policy and putting it in their syllabi (Tyson, 2023). Moreover, if they are concerned about cheating, they should warn their students of the poor to mediocre quality of much of ChatGPT’s work along with the likelihood that using it without scrutiny will result in a failing grade.
Possibilities and Conclusions
Many educators worry about ChatGPT’s potential for facilitating cheating. For instance, one colleague in my department recently tested its ability to answer some of his GC exam questions, and it did this correctly. Thus, especially for instructors who administer computer-proctored exams online, it may be necessary to alter web browser settings to prevent opening extra tabs, lest students plug questions into ChatGPT while taking tests. However, from the above-described findings and my own observations, ChatGPT’s current ability to correctly answer chemistry questions—especially when math is involved—is limited and will likely result in a far-below-average score. At present, then, students should be more worried about it than teachers. Obviously, ChatGPT’s shortcomings may narrow as the AI algorithm iteratively complexifies; thus, teachers should be mindful as we design future homework, quiz, and exam problems.
There are many potentially positive uses of ChatGPT in teaching and learning. For instance, Clark suggests: “The shortcomings that make ChatGPT unsuitable for completing assignments could be considered assets if the chatbot is used to pursue goals pertaining to information literacy including critical thinking, critical analysis, and evaluation of information. . . . This could include open-ended assignments in which students use the chatbot to solve a problem and evaluate its responses” (Clark, 2023). Thus, instructors could have students enter questions into the chatbot, analyze and evaluate its responses, and then correct ones that are wrong.
Overall, I found the chatbot to be useful for creating a sizable chemistry question database in a relatively quick timeframe: 18 high-quality questions in about 4 total hours. This would have normally taken me at least twice as long, but it did require multiple prompts, content expertise to look for errors, and careful rechecking of all its math. Perhaps with future iterations, such as GPT-4 or beyond, larger question databases might be more quickly constructed, thus enabling instructors to give their students faster entry to the potential benefits of the practice effect and testing effect.
One serious question remains: because ChatGPT statistically predicts and outputs the most likely next word or phrase, is it not doomed to produce answers that are statistically average—at the “top” of the bell curve, so to speak—and thus be forever unable to generate high-quality, groundbreaking, or cutting-edge results? Though the answer remains unclear, users should be aware of ChatGPT’s inability to create high-end results and push the boundaries of current knowledge toward and beyond the cutting edge. Nevertheless, though ChatGPT cannot presently serve as a full substitute for human intelligence—especially when heavy math or technical knowledge are required—when balanced with enough crosschecking or subject knowledge to recognize and fix errors, it can be very useful.
Questions to Guide Reflection and Discussion
- Discuss the reliability of ChatGPT in generating accurate chemistry and STEM practice problems. What are the risks of relying on AI for academic content?
- Reflect on the author’s experiences with ChatGPT’s limitations in solving math-centric chemistry problems. How does this impact its utility in educational settings?
- Explore the potential benefits of using ChatGPT to create practice questions. How can educators effectively integrate this tool while ensuring content accuracy?
- Consider the ethical implications of using AI-generated content in education. What measures should be in place to address potential misinformation?
- How can educators balance the use of AI tools like ChatGPT with traditional teaching methods to enhance student learning without fostering dependency?
References
Apodaca, Richard L. (2011, October 11). Sixty-four free chemistry databases. https://depth-first.com/articles/2011/10/12/sixty-four-free-chemistry-databases/
Bartels, C., Wegrzyn, M., Wiedl, A., Ackermann, V., & Ehrenreich, H. (2010). Practice effects in healthy adults: A longitudinal study on frequent repetitive cognitive testing. BMC Neuroscience, 11(118). https://doi.org/10.1186/1471-2202-11-118
Cellan-Jones, R. (2014, December 2). Stephen Hawking warns artificial intelligence could end mankind. BBC News. https://www.bbc.com/news/technology-30290540
ChatGPT Settings [Optimize Prompts & Outputs]. (n.d.). ChatGPT settings guide. Retrieved September 9, 2023, from https://settingsguide.com/chatgpt-settings/
Chemistry Unleashed. (2019, September 26). How to study for exams [Video]. YouTube. https://www.youtube.com/watch?v=cX6qhAo0KhM
Chemistry Unleashed. (n.d.). Home [YouTube channel]. YouTube. Retrieved September 9, 2023, from https://www.youtube.com/@chemistryunleashed4348
Choudhary, G. (2023, May 17). OpenAI empowers ChatGPT with internet browsing and third-party plugins in latest GPT-4 update for plus subscribers. Mint. https://www.livemint.com/technology/tech-news/openai-empowers-chatgpt-with-internet-browsing-and-third-party-plugins-in-latest-gpt-4-update-for-plus-subscribers-11684311903318.html
Clark, T. (2023). Investigating the use of an artificial intelligence chatbot with general chemistry exam questions. Journal of Chemical Education, 100(5), 1905–1916. https://doi.org/10.1021/acs.jchemed.3c00027
Deep Dream Generator. (n.d.). Retrieved September 9, 2023, from https://deepdreamgenerator.com/
Fendos, J. (2021). Combining jigsaws, rule-based learning, and retrieval practice improves IUPAC nomenclature competence. Journal of Chemical Education, 98(5), 1503–1517. https://doi-org.dist.lib.usu.edu/10.1021/acs.jchemed.0c01235
Fergus, S., Botha, M., & Ostovar, M. (2023). Evaluating academic answers generated using ChatGPT. Journal of Chemical Education, 100(4), 1672–1675. https://doi.org/10.1021/acs.jchemed.3c00087
Frieder, S., Pinchetti, L., Chevalier, A., Griffiths, R.-R., Salvatori, T., Lukasiewicz, T., Petersen, P., & Berner, J. (2023). Mathematical capabilities of ChatGPT. Cornell University Archive. https://doi.org/10.48550/arXiv.2301.13867
Graham, J., McIntee, E., & Schaller, C. (2016). Web-based 2D NMR spectroscopy practice problems. Journal of Chemical Education, 93(8), 1483–1485. https://doi-org.dist.lib.usu.edu/10.1021/acs.jchemed.6b00007
Hawking, S. (2018). Brief answers to the big questions. American Psychological Association. Bantam. https://www.amazon.com/Brief-Answers-Questions-Stephen-Hawking/dp/1984887262
Hayls World. (2023, March 21). 10 ChatGPT life hacks that’ll change your life!! [Video]. YouTube. https://www.youtube.com/watch?v=-fopYsgFdzc
Hayls World. (2023, September 4). 10 ChatGPT hacks that take it to the next level!! [Video]. YouTube. https://www.youtube.com/watch?v=LHNghEPMZIs
Hochman, A. (1986, April 4). Math teachers stage a calculated protest. The Washington Post. https://www.washingtonpost.com/archive/local/1986/04/04/math-teachers-stage-a-calculated-protest/c003ddaf-b86f-4f2b-92ca-08533f3a5896/
Howfinity. (2023, August 15). Photoshop AI – Biggest AI update for design [Video]. YouTube. https://www.youtube.com/watch?v=mmB0QHV0pCM
Howfinity. (2023, May 20). How to use Midjourney – AItext to image generator – beginner’s guide [Video]. YouTube. https://www.youtube.com/watch?v=_MF6qxPZDHs
Hu, K. (2023, February 2). ChatGPT sets record for fastest-growing user base – analyst note. Reuters. https://www.reuters.com/technology/chatgpt-sets-record-fastest-growing-user-base-analyst-note-2023-02-01/
Humphry, T., & Fuller, A. (2023). Potential ChatGPT use in undergraduate chemistry laboratories. Journal of Chemical Education, 100(4), 1434–1436. https://doi.org/10.1021/acs.jchemed.3c00006
Kanade, V. (2022, March 25). Narrow AI vs. general AI vs. super AI: Key comparisons. Spiceworks. https://www.spiceworks.com/tech/artificial-intelligence/articles/narrow-general-super-ai-difference/
Lakshmanan, Lak (2022, December 16). Why large language models like ChatGPT are bullshit artists, and how to use them effectively. Becoming Human: Artificial Intelligence Magazine. https://web.archive.org/web/20221217075021/https:/becominghuman.ai/why-large-language-models-like-chatgpt-are-bullshit-artists-c4d5bb850852
Lang, J. (2020). Assessed Attention. In Distracted: why students can’t focus and what you can do about it (pp. 208–209). Basic Books
Murdok. (2023, August 1). Understanding and adjusting the “temperature” parameter in ChatGPT. MDOK. https://www.murdok.org/artificial-intelligence/understanding-and-adjusting-the-temperature-parameter-in-chatgpt/
OpenAI ChatGPT. (n.d.). Retrieved September 9, 2023, from https://openai.com/chatgpt
OpenAI DALL-E2. (n.d.). Retrieved September 9, 2023, from https://openai.com/dall-e-2
Pang. W. (2016, December 22). The common high-school tool that’s banned in college. The Atlantic. https://www.theatlantic.com/education/archive/2016/12/the-conundrum-of-calculators-in-the-classroom/493961/
Pathak, K. (2023, July 31). The war between ChatGPT, Bing Chat, and Google Bard. Lifehacker. https://lifehacker.com/the-war-between-chatgpt-bing-chat-and-google-bard-1850575579Prompt Fu, 2023. (n.d.).
Prompt fu: Tech tips and command line fu found within this dojo. Retrieved September 9, 2023, from https://www.promptfu.com/
Ramponi, M. (2022, December 23). How ChatGPT actually works. AssemblyAI. https://www.assemblyai.com/blog/how-chatgpt-actually-works/
Ray, T. (2023, February 7). Google’s Bard builds on controversial LaMDA bot that engineer called ‘sentient’. ZDNet. https://www.zdnet.com/article/googles-bard-builds-on-controversial-lamda-bot-that-engineer-had-called-sentient/
Reaxys (n.d.). An expert-curated chemistry database. Retrieved September 9, 2023, from https://beta.elsevier.com/products/reaxys?trial=true
Rowland, C. (2014). The effect of testing versus restudy on retention: A meta-analytic review of the testing effect. Psychological Bulletin, 140(6), 1432–1463. https://doi.org/10.1037/a0037559
Rutschman, A. (2018, March 15). Stephen Hawking warned about the perils of artificial intelligence – yet AI gave him a voice. The Conversation. https://theconversation.com/stephen-hawking-warned-about-the-perils-of-artificial-intelligence-yet-ai-gave-him-a-voice-93416
Szu, E., Nandagopal, K., Shavelson, R., Lopez, E., Penn, J., Scharberg, M., & Hill, G. (2011). Understanding Academic Performance in Organic Chemistry. Journal of Chemical Education, 88(9), 1238–1242. https://doi-org.dist.lib.usu.edu/10.1021/ed900067m
Tyson, J. (2023). Shortcomings of ChatGPT. Journal of Chemical Education, 100(8), 3098–3101. https://doi-org.dist.lib.usu.edu/10.1021/acs.jchemed.3c00361
University of Texas Libraries (2023, September 8). Chemistry: Free Resources. https://guides.lib.utexas.edu/chemistry/free
VAGPE Media. (2023, August 15). 3D Animated Disney cartoon story with free AI tools in 5 mins | AI video generator image to animation [Video]. YouTube. https://www.youtube.com/watch?v=PcVQf3m_pxQ
Weisner, B., & Schweber, N. (2023, June 9). The lawyer who relied on ChatGPT explains himself. It was awkward. The New York Times (Digital Edition).What’s the next word in large language models? (2023, April 24). Nature Machine Intelligence, 5, 331–332. https://doi.org/10.1038/s42256-023-00655-z