Can ChatGPT reduce human financial analysts’ optimistic biases?

Abstract This paper examines the potential of ChatGPT, a large language model, as a financial advisor for listed firm performance forecasts. We focus on the constituent stocks of the China Securities Index 300 and compare ChatGPT’s forecasts for major financial performance measures with human analysts’ forecasts and the realised values. Our findings suggest that ChatGPT can correct the optimistic biases of human analysts. This study contributes to the literature by exploring the potential of ChatGPT as a financial advisor and demonstrating its role in reducing human biases in financial decision-making.


Introduction
The recent development of large language models (LLMs) has led to numerous applications across various disciplines.With a large number of parameters, these models can be fine-tuned to receive input instructions and generate human-like responses.In such a context, a body of literature has emerged to explore their applications in the field of finance.In this paper, we investigate the capability of ChatGPT, the most renowned LLM, in forecasting listed firm performance.In addition, we examine whether ChatGPT can reduce the optimistic biases of human analysts.
Although ChatGPT is primarily a language model and not specifically designed for financial decision-making (Ko and Lee 2023), its ability to efficiently extract and process a wide range of information makes it suitable to serve as a financial advisor.We provide a detailed review of the usage of ChatGPT in finance in the next section.
In this paper, we investigate ChatGPT's role in forecasting listed firm performance.Analyst forecasts involve intensive information extraction and processing activities in which machines have an advantage, given their high computation efficiency, over humans (Boyacı, Canyakmaz, and de V� ericourt 2023).Research consistently shows that machines, or machine-augmented analysts, outperform humans in earnings and stock price predictions (e.g.Chen et al. 2022;Coleman, Merkley, and Pacelli 2022;Cao et al. 2023).Therefore, we expect ChatGPT to produce more accurate performance forecasts than human analysts do.
Furthermore, we explore the channels through which improved forecast accuracy is achieved.Human analysts often exhibit optimistic biases (Easterwood and Nutt 1999;Lim 2001;Wu et al. 2018), which stem from their involvement in the forecast process (Duru and Reeb 2002) and conflicts of interest (Hovakimian and Saenyasiri 2010).Machines, however, are impartial (Tantri 2021) and less affected by human biases (Liaudinskas 2022;Liu 2022).These biases can explain many anomalies (van Binsbergen, Han, and Lopez-Lira 2023) and may be related to the memory process.Drawing evidence from the pricing of artistic works, Aubry et al. (2023) show that machines can reduce human experts' conscious rational biases and unconscious behavioural biases, thus improving the forecast accuracy of auction outcomes.Similarly, we posit that ChatGPT's superior ability in forecasting firm performance results from mitigating human analysts' optimistic biases.
One of the major challenges in studying ChatGPT's forecasts involves restricting the information set.It is crucial to ensure that the model does not use future information that includes the realised outcomes.However, as ChatGPT operates as a black box, it is not possible to prevent it from using data beyond a certain time point simply by giving it instructions.Fortunately, ChatGPT's training data extend only up to September 2021.Leveraging on this setting, we instruct ChatGPT to forecast the performance of each firm in the China Securities Index 300 (CSI 300) from 2021 to 2023 and then compare its forecasts with those of human analysts and the realised performance.For human analysts' forecasts, we only use analyst reports issued in September 2021 to ensure that human analysts and ChatGPT have access to a comparable information set.Empirically, we focus on seven major financial performance measures, namely the price-to-earnings ratio (PE), the price-to-book ratio (PB), earnings per share (EPS), the return on assets (ROA), the return on equity (ROE), Revenue Growth, and Profit Growth.
In the comparison between ChatGPT and human analysts, we find that ChatGPT is significantly more conservative than human analysts across all performance dimensions and forecast horizons, i.e. the end of the years 2021, 2022 and 2023.Using realised performance as a benchmark, human analysts exhibit systematic and persistent optimistic biases.They overestimate all seven performance measures across all time horizons.The upward biases are more pronounced for the long-term horizon, with the exception of the PE forecasts.In contrast, ChatGPT does not exhibit one-sided upward biases.For the short-term horizon of 2021, its forecasts are not significantly different from the realised performance for five of the seven measures, while the forecasted values of the remaining two are lower than the realised ones.For the long-term horizon of 2022, ChatGPT exhibits biases towards higher values for five of the seven measures.However, the forecast errors are quantitatively smaller than those of human analysts, indicating that ChatGPT at least partially mitigates the optimistic biases in human analysts' forecasts.
We quantify the humans' optimistic biases corrected by ChatGPT in a formal regression setting.We use the upward forecast errors, calculated as the differences between forecasted and realised values, as the dependent variables, and regress them against a ChatGPT dummy variable indicating whether the forecast is issued by ChatGPT.ChatGPT exhibits fewer optimistic biases in all seven measures than human analysts, and the differences are statistically significant for ROA, ROE, Revenue Growth, and Profit Growth.This implies that ChatGPT has the potential to help correct the optimistic biases of human analysts.
This study contributes to the literature by exploring the potential of ChatGPT as a financial advisor, and it deepens our understanding of the strengths and limitations of investment based on advice from artificial intelligence (AI).In addition, it contributes to the discussion on the interaction between machines and humans by demonstrating how machines can reduce human biases.

Literature review
We summarise the research on the application of ChatGPT in finance in three aspects: financial concept comprehension, academic use, and investment decision-making.
The first strand of literature explores whether ChatGPT is able to comprehend key financial concepts, explain financial reporting to non-professionals, and assume the role as a personal financial advisor.ChatGPT can accurately explain financial concepts such as alpha values, crowdfunding, alternative finance, financial risk, financial crises, the Basel framework, and banking products (Wenzlaff and Spaeth 2022;Hofert 2023;Lakkaraju et al. 2023;Ren, Lee, and Hu 2023;Yue et al. 2023), although its elaboration of mathematical facts needs improvement (Hofert 2023).Niszczota and Abbas (2023) examine whether ChatGPT is capable of serving as a financial advisor and find that it exhibits a higher level of financial literacy than human investors who make random guesses. 1 Overall, ChatGPT is comparable to financial professionals and demonstrates high levels of accuracy and expertise (Ren, Lee, and Hu 2023).In a similar setting, Wei, Wu, and Chu (2023) find that ChatGPT's answers to auditing questions imitate those from experienced financial auditors.
In addition to helping laypeople comprehend financial and accounting concepts, ChatGPT is capable of explaining the jargon in plain language.Ren, Lee, and Hu (2023) show that ChatGPT's answers to financial and accounting questions are more understandable to laypeople than those from human experts.Neilson (2023) provides more direct evidence that ChatGPT can recommend superannuation contribution plans to non-professionals.By asking ChatGPT questions or giving it directions such as 'explain the meaning of alpha in finance to my grandmother', Yue et al. (2023) show that ChatGPT can further customise the complexity of its explanations when given indications of its audience.Notwithstanding its merits, Lakkaraju et al. (2023) and Neilson (2023) warn of the possible limits of ChatGPT in numeric reasoning, its inconsistency, and its ignorance of various relevant issues, such as local regulatory requirements.
Regarding the academic use of ChatGPT in economics and finance research, evidence of ChatGPT's performance is mixed depending on the specific jobs that it undertakes.ChatGPT does well in coding support, data analyses, and the interpretation of findings (Alshater 2022;Dowling and Lucey 2023;Feng, Hu, and Li 2023;Korinek 2023).However, its performance is unsatisfactory in literature synthesis, the development of testing frameworks, domain-specific expertise, and idea origination (Alshater 2022;Dowling and Lucey 2023).
The literature investigating the role of ChatGPT in investment decision-making is closely related to our research context.The literature is inconclusive regarding ChatGPT's understanding and interpretation of financial texts in a zero-shot setting (i.e.no example of expected responses provided in prompts).Some papers conclude that ChatGPT is accurate and efficient in extracting the opinions and sentiments in news, Fedspeak, 2 and corporate disclosures (Hansen and Kazinnik 2023;Jha et al. 2023).Conversely, others claim that ChatGPT struggles in tasks such as financial named entity recognition (FinNER) and sentiment analyses (Lan et al. 2023;Li et al. 2023).Wang et al. (2023) demonstrate that assessing the performance of ChatGPT is complicated: it generates reasonable answers, which may not always be relevant to the prompts being given; however, at the same time, it outperforms fine-tuned bidirectional encoder representations from transformers (BERT) models in sentiment analyses.One possible explanation of the conflicting results may be the use of different data sources and performance measures.Studies relying on only one data set tend to overevaluate the performance of ChatGPT (Jha et al. 2023;Hansen and Kazinnik 2023), whereas those using multiple data sets and different tasks observe more of its limitations (Li et al. 2023;Wang et al. 2023).
Recent papers extend this literature by directly examining whether ChatGPT can extract value-relevant signals from the financial context in investment decisionmaking.Lopez-Lira and Tang (2023) conduct a sentiment analysis of news headlines for stocks using ChatGPT, classifying news as good, bad, or irrelevant.They construct a 'ChatGPT score' for each stock and find that it is positively correlated with subsequent daily stock returns.Kim, Muhn, and Nikolaev (2023) instruct ChatGPT to summarise information contained in management discussions and analyses, annual reports, and earnings conference calls, and generate refined summaries with pronounced sentiments.The refined summaries exhibit more significant explanatory power over the abnormal returns surrounding the disclosure days than the original texts.However, ChatGPT has limitations.Xie et al. (2023) use ChatGPT to predict the direction of future stock movements with inputs of historical stock prices.Results show that the performance of ChatGPT is poor as the predictions are less accurate than those of logistic regressions.
Applying ChatGPT to investment, Ko and Lee (2023) find that ChatGPT outperforms a randomly selected portfolio.Using a three-stage procedure, Chen et al. (2023) provide further evidence on how ChatGPT utilises the information it extracts to achieve superior investment performance.They first give financial news prompts to ChatGPT and ask which companies are positively or negatively affected.Then, they construct graphs that visualise the relationships.Finally, they use machine learning methods, including graph neural networks and long short-term memory neural networks, to make predictions of stock price movements with higher accuracy than those of ChatGPT.
Our paper differs from the above research in three distinct aspects.First, we examine ChatGPT's use in investment using a comprehensive set of performance measures.We investigate the accuracy of ChatGPT in predicting stock valuation, profitability, and growth.Second, in addition to comparing ChatGPT and humans, we discuss the use of ChatGPT to reduce human biases.By showing how ChatGPT is more impartial than human analysts and less subject to optimistic biases, this paper reveals the potential for the collaboration between ChatGPT and humans in investment decision-making.

Data description
We restrict our sample to the constituent stocks of the CSI 300.The CSI 300 was introduced on 8 April 2005 by the Shanghai and Shenzhen stock exchanges.It consists of the 300 most actively traded Chinese A-share stocks, which account for over 70% of the combined market capitalisation of the two exchanges.The index is widely recognised as a comprehensive indicator of broad movements in the Chinese stock markets (Hou and Li 2014).
We designate seven measures of firm performance, which are commonly used and discussed by human analysts in evaluating firm performance.For stock valuation, we use PE, calculated as the share price divided by the earnings per share, and PB, constructed as the share price over the book value per share.Regarding profitability, we use profit divided by the outstanding shares (EPS), net income over total assets (ROA), and net income over total equity (ROE).We measure firm growth using the rates of Revenue Growth and Profit Growth.
To obtain ChatGPT's forecasts, we input each firm's name and stock code and request ChatGPT to forecast PE, PB, EPS, ROA, ROE, Revenue Growth, and Profit Growth at the end of 2021, 2022, and 2023.The responses from ChatGPT are highly dependent on the prompts, which are essentially the user's questions.Korinek (2023) finds that minor tweaks in the prompts might result in different outcomes.Therefore, we try different prompts to retrieve the desired results.Following the guidance by OpenAI 3 and Alshater (2022), we construct the prompt as follows: Provide a table of price-to-earnings (P/E) forecasts at the end of 2021, 2022 and 2023 for the firms below, as of September 2021:

[A list of firm names and tickers]
There should be four columns in the table: Ticker, P/E 2021, P/E 2022, P/E 2023.
We make our prompt as clear as possible.The measure, time horizons, firms, and asof time are specified explicitly in the prompt.In the absence of time horizons, ChatGPT responds with the latest realised values as of September 2021.In the absence of the as-of time, ChatGPT declines the assignment and explains it is unable to provide information after September 2021.ChatGPT makes the forecasts that we need only when the prompt contains both the time horizon and the as-of time.The table output format that we request hides ChatGPT's language analyses of each firm's future performance and keeps the forecasted values only. 4We demarcate the prompt with delimiters: using one newline delimiter to indicate firms and two newline delimiters to indicate instructions and the firm list.
There is a context length limit in one single prompt-response pair.We split the 300 sample firms into 10 batches with 30 firms in each batch.This reasonably small batch size avoids length overflow and eliminates biases introduced by both forecast breaks within a firm (across horizons) and resumption prompt interference, thus enabling the continuation of forecasting.
In some text generation tasks, few-shot prompting, or providing dialogue examples in the prompt, can improve generation quality.By adopting zero-shot prompting instead of few-shot prompting, we take a strictly neutral stance and do not bias ChatGPT with human forecast examples (Zhao et al. 2021).We always start a new chat when moving to the next performance measure to ensure that the outputs are not affected by previous instructions.
Another challenge is that ChatGPT may produce slightly different forecasts even when given the same prompt multiple times.Fortunately, the results are highly similar and comparable despite minor differences (Ko and Lee 2023).To mitigate generation idiosyncrasy, we independently repeat the process 35 times and take the average. 5This approach gives us three-year forecasts on the seven measures for the 300 firms of the CSI 300.
We obtain financial analysts' forecasts from the China Stock Market & Accounting Research (CSMAR) database.To ensure that human analysts and ChatGPT use a comparable information set, we restrict the human forecasts to those issued in September 2021.Some reports provide forecasts for different time horizons.We obtain a total of 10,550 forecast-year observations from 582 analyst reports.The realised performance data come from the CSMAR and Wind, but we only have the data for 2021 and 2022.The 2023 data were not available at the time of writing this paper.Table 1 presents the summary statistics of ChatGPT' forecasts, human analysts' forecasts, and the realised performance, which are displayed in three rows for each measure.The first row contains ChatGPT' forecasts with 900 observations (3 years � 300 firms per year), the second row contains the human analysts' forecasts, and the last row contains the 600 observations of realised performance (2 years � 300 firms per year).As many stocks in our sample are followed by multiple analyst teams, a stock may have more than one forecast from human analysts for a given year.Therefore, the sample of human analysts' forecasts can be larger than 900.In addition, an analyst report may not include all seven measures, leading to an unbalanced number of observations across different performance measures. 6 Table 1 reveals four noticeable patterns.First, the average and median forecasted values from ChatGPT are lower than those from human analysts in all seven measures.Second, human analysts consistently exhibit upward biases.Third, ChatGPT's forecasted values are closer to the realised ones, except for the forecast of PE.Fourth, ChatGPT's forecast errors are two-sided.It overestimates PB, EPS, ROE, Revenue Growth, and Profit Growth, but its forecasted values of PE and ROA are lower than the realised values.The summary statistics in Table 1 are consistent with our conjecture that ChatGPT outperforms human analysts by reducing their optimistic biases.
The above results pool the forecasts for different time horizons together, making it difficult to interpret the differences.In Table 2, we perform mean difference t-tests for each performance measure and each time horizon between ChatGPT and human analysts in Columns ( 1)-(3).We find evidence that human analysts are more optimistic than ChatGPT.In Columns ( 4) and ( 5), we compare human analysts' forecasts with the realised performance.We only have results for 2021 and 2022 because the 2023 data of realised performance are not available yet.Human analysts Notes: The table presents mean differences between forecasts of ChatGPT and human analysts, and the forecast errors.Columns ( 1)-( 3) compare ChatGPT with human analysts, Columns ( 4) and ( 5) show the forecast errors of human analysts, and Columns ( 6) and ( 7) show the forecast errors of ChatGPT.Mean differences are tested against zero, with t values in parentheses.�� and ��� denote significance at the 5% and 1% levels, respectively.
overestimate all performance measures significantly, and the magnitude of upward biases increases with the time horizon.In contrast, ChatGPT does not exhibit any optimistic biases for the 2021 horizon (Column [6]); its forecasts are not significantly different from the realised performance for five measures, except for ROA and Revenue Growth, which are slightly underestimated.However, ChatGPT's accuracy does not persist for the long-term horizon of 2022.The forecast errors in Column ( 7) are significantly non-zero in six of the seven performance measures, with the exception of PE.ChatGPT overestimates five of the seven measures, but the biases are much fewer than those come from human analysts.

Empirical results
We present formal regression-based evidence to quantify the human analysts' optimistic biases that ChatGPT corrects.We use a full sample of forecasts from both ChatGPT and human analysts for the horizons of 2021 and 2022 (i.e.forecasted values for firm performance at the end of 2021 and 2022).Our dependent variables are upward forecast errors, calculated as the differences between the forecast, E(y), and the realised performance, y.Our focal variable is ChatGPT, a dummy variable indicating whether a forecast is issued by ChatGPT or human analysts.
We include a set of control variables for factors that are associated with analyst forecast errors as documented in the literature.Abarbanell (1991) shows that analyst forecasts do not fully incorporate past stock price information and are insufficiently efficient.This implies that price increases predict downward biased forecast errors.To control for this effect, we include the annualised return 52 weeks prior to September 2021.
Uncertainty affects forecast errors (Das, Levine, and Sivaramakrishnan 1998;Lim 2001).Analysts tend to issue forecasts in favour of firms in exchange for private information from the management team.Following Lim (2001), we use the annualised volatility 52 weeks prior to September 2021 as a market-based control for uncertainty.
Analysts perform poorly in forecasting long-term performance (Harris 1999).Forecast errors increase with the time horizon, which means that it becomes harder for analysts to forecast performance accurately in the more distant future.Following Dong et al. (2021) and Bolliger (2004), we control for the forecast horizon.
The market accumulates more information about a firm as it ages.Maskara and Mullineaux (2011) illustrate that both forecast errors and firm age are related to information asymmetry.Following Amir, Lev, and Sougiannis (2003), we include firm age as a control variable.
Moreover, ownership structure affects analyst forecast errors (Ackert and Athanassakos 2003), particularly in the unique setting of the Chinese capital market (Huang and Wright 2015;Liu 2016), where sharp distinctions exist between stateowned and non-state-owned firms.We include a control variable indicating whether a firm is state-owned.
We specify the regression model in the equation below.The subscripts denote firm i and performance measure j in horizon year t: Table 3 presents the cross-sectional regression results.Compared with human analysts, ChatGPT exhibits significantly fewer optimistic biases in four of the seven measures (ROA, ROE, Revenue Growth, and Profit Growth).The coefficients of ChatGPT are also negative (but not significant) when the dependent variables are the upward forecast errors of PE, PB, and EPS.These results are consistent with the results of the mean difference t-tests in Table 2.The results concerning control variables also align with expectations.The optimistic biases increase with the forecast horizon and decrease with the annualised return at the time when the forecast is made, which is consistent with the finding of Abarbanell (1991).
Motivated by Adiwardana et al. (2020), who propose the sample-and-rank approach, we address the randomness in ChatGPT's forecasts by selecting its most confident forecast to conduct a robustness test.In the sample-and-rank approach, an LLM picks the candidate text sequence with the highest predicted probability as the final output.Following the same idea, we pick the median of 35 responses as the final forecast of ChatGPT.We report the robustness test in Table 4, and the results are even stronger than those in Table 3.We observe that ChatGPT reduces human analysts' optimistic biases in six of the seven dimensions, with PE being the only exception.

Conclusion
In this paper, we utilise ChatGPT, an LLM, to forecast the performance of CSI 300 firms and compare its forecasts with those of human analysts issued in September 2021, which coincides with the cut-off date of ChatGPT's training data.By using the realised performance as a benchmark, we consistently find that ChatGPT outperforms human analysts, achieving smaller upward forecast errors.Human analysts tend to provide optimistic forecasts, whereas ChatGPT is more conservative.The superior accuracy of ChatGPT's forecasts can be attributed to its ability to correct the optimistic biases inherent in human analysts' forecasts.We consider that LLM applications such as ChatGPT are not meant to fundamentally replace human financial analysts.Rather, ChatGPT can assist and improve human financial forecasting.In this article, we provide evidence of ChatGPT's ability to forecast financial performance and reduce humans' optimistic biases, which suggests the potential for ChatGPT to assist analysts and investors.ChatGPT holds the promise of reducing overconfidence of or conflicts of interests among human analysts.On a cautionary note, we warn against overextending our results, as investors should neither solely rely on ChatGPT nor use its forecasts as the 'correct' answers.The result that ChatGPT outperforms human analysts on average does not mean it is always more accurate than human analysts.We interpret our findings as proving the value of ChatGPT in supplementing human analysts' and investors' forecasts.As this paper represents the first attempt to uncover the forecast differences between ChatGPT and human analysts, we encourage future researchers to explore in depth the reasons for these differences.This research has some limitations, which also serve as warnings to our readers.First, the analysis only covers a brief period of two-year forecasts; thus, it is insufficient when considering a wide range of market dynamics.As a result, more evidence is needed about ChatGPT's forecast performance across long cycles.The short time span raises concerns regarding the generalisability of our findings, especially when market conditions change.As data availability increases, future researchers should compare ChatGPT's forecasts with those of human analysts over a longer historical span.
Second, due to the black-box nature of LLMs, we know little about the internal processes in which ChatGPT makes financial forecasts and delivers its forecasts with fewer optimistic biases than humans.Possible channels may include ChatGPT's superior ability to process fundamental information and synthesise beliefs and/or its greater impartiality compared with humans.This limitation suggests another direction for future research, that is, to leverage novel research designs and uncover the internal mechanisms of ChatGPT through which fewer optimistically biased forecasts are made.
To conclude, this paper provides empirical evidence on the applications of ChatGPT, one of a growing number of LLMs, in forecasting listed firms' performance, and highlights ChatGPT's potential in the provision of financial advice.In addition, our results elucidate the role of LLMs in mitigating human biases in financial decision-making.Moving forward, future researchers may consider exploring other applications of LLMs in finance and investigating their effectiveness in various decision-making contexts.
Notes: This table presents the summary statistics of forecasts for PE, PB, EPS, ROA, ROE, Revenue Growth, and Profit Growth for the CSI 300 constituent stocks.'GPT' is short for ChatGPT, and 'Analyst' for human analysts.For each measure, statistics of ChatGPT' forecasts, human analysts' forecasts, and realised values are displayed in three rows.

Table 2 .
Mean difference results.

Table 3 .
Cross-sectional regressions of forecast errors.This table presents the cross-sectional regression results for forecast errors of ChatGPT and human analysts.The dependent variables are the upward forecast errors calculated as the differences between the forecasted and realised values.The independent variable is ChatGPT, a binary indicator taking a value of 1 if a forecast is made by ChatGPT and 0 otherwise.L.variable indicates a variable lagged for one period.We report regression coefficients with t values in parentheses.Standard errors are clustered at the firm level.� , �� , and ��� denote significance at the 10%, 5%, and 1% levels, respectively.

Table 4 .
Cross-sectional regressions of forecast errors (robustness).This table presents the cross-sectional regression results for forecast errors of ChatGPT and human analysts.The dependent variables are the upward forecast errors calculated as the differences between the forecasted and realised values.We use the median of the 35 candidate forecasts to replace the mean as the final forecast of ChatGPT for a robustness test.The independent variable is ChatGPT, a binary indicator that takes a value of 1 if a forecast is made by ChatGPT and 0 otherwise.We report regression coefficients with t values in parentheses.Standard errors are clustered at the firm level.� and ��� denote significance at the 10% and 1% levels, respectively.