生成式人工智能在内容分析中的应用及测量效度评估Application and Measurement Validity Evaluation of Generative Artificial Intelligence in Content Analysis
程萧潇,吴栎骞
摘要(Abstract):
本研究旨在考察以GPT为代表的生成式人工智能模型在内容分析研究中的应用前景及潜在效度折损问题。通过分析与气候变化相关的中英文社交媒体文本数据,本研究从语言/数据集、提示微调策略以及模型版本三个维度系统评估了GPT模型在新闻传播学核心概念(认知、情感和立场)编码上的效度差异及其背后的潜在原因。研究表明,GPT倾向于过度识别和解读文本内容,并表现出对“中立文本”的偏见。在多维度比较上,本研究并未发现GPT在概念编码效度上存在明显的跨语言/数据集差异;GPT-4较其3.5版本在部分类目中显示出更高的测量效度;经提示微调的GPT模型能够在一定程度上提升编码的准确性,但引入更多示例样本可能会导致一定程度的效度损失。此外,本研究还发现文本的词汇和语义特征会影响GPT的测量效度。
关键词(KeyWords): GPT;大语言模型;内容分析;生成式人工智能;效度
基金项目(Foundation): 2023年度国家社会科学基金青年项目“中国气候议题对外传播效果与提升策略研究”(项目批准号:23CXW034)阶段性研究成果
作者(Author): 程萧潇,吴栎骞
参考文献(References):
- 巢乃鹏、黄文森(2020):范式转型与科学意识:计算传播学的新思考,《新闻与写作》,第5期,13-18页。
- 陈昌凤、袁雨晴(2024):智能新闻业:生成式人工智能成为基础设施,《内蒙古社会科学》,第45卷第1期,40-48页。
- 龚为纲(2024-03-08):大语言模型助力计算社会科学迭代,《中国社会科学报》,第A06版。
- 胡正荣、李涵舒(2023):颠覆与重构:AIGC的效用危机与媒介生态格局转化,《新闻与写作》,第8期,48-55页。
- 彭兰(2023):从ChatGPT透视智能传播与人机关系的全景及前景,《新闻大学》,第4期,1-16页。
- 韦路、徐靓颀(2023):生成式人工智能对传媒生态的挑战与对策,《中国广播电视学刊》,第9期,4-9页。
- 喻国明、苏健威(2023):生成式人工智能浪潮下的传播革命与媒介生态——从ChatGPT到全面智能化时代的未来,《新疆师范大学学报(哲学社会科学版)》,第44卷第5期,81-90页。
- 张华平、李林翰、李春锦(2023):ChatGPT中文性能测评与风险应对,《数据分析与知识发现》,第7卷第3期,16-25页。
- Amin,M.M.,Cambria,E.& Schuller,B.W.(2023).Will affective computing emerge from foundation models and general artificial intelligence?A first evaluation of ChatGPT.IEEE Intelligent Systems,38(2),15-23.doi:10.1109/MIS.2023.3254179.
- Argyle,L.P.,Busby,E.C.,Fulda,N.,Gubler,J.R.,Rytting,C.& Wingate,D.(2023).Out of one,many:Using language models to simulate human samples.Political Analysis,31(3),337-351.doi:10.1017/pan.2023.2.
- Baden,C.,Pipal,C.,Schoonvelde,M.& Van Der Velden,M.A.C.G.(2022).Three gaps in computational text analysis methods for social sciences:A research agenda.Communication Methods and Measures,16(1),1-18.doi:10.1080/19312458.2021.2015574.
- Barberá,P.,Boydstun,A.E.,Linn,S.,McMahon,R.& Nagler,J.(2021).Automated text classification of news articles:A practical guide.Political Analysis,29(1),19-42.doi:10.1017/pan.2020.8.
- Boumans,J.W.& Trilling,D.(2016).Taking stock of the toolkit:An overview of relevant automated content analysis approaches and techniques for digital journalism scholars.Digital Journalism,4(1),8-23.doi:10.1080/21670811.2015.1096598.
- Brown,T.B.,Mann,B.,Ryder,N.,Subbiah,M.,Kaplan,J.,Dhariwal,P.,Neelakantan,A.,Shyam,P.,Sastry,G.,Askell,A.,Agarwal,S.,Herbert-Voss,A.,Krueger,G.,Henighan,T.,Child,R.,Ramesh,A.,Ziegler,D.M.,Wu,J.,Winter,C.,Hesse,C.,Chen,M.,Sigler,E.,Litwin,M.,Gray,S.,Chess,B.,Clark,J.,Berner,C.,McCandlish,S.,Radford,A.,Sutskever,I.& Amodei,D.(2020).Language models are few-shot learners.In Proceedings of the 34th International Conference on Neural Information Processing Systems (pp.1877-1901).Vancouver,BC,Canada:Curran Associates,Inc.
- Bulian,J.,Sch?fer,M.S.,Amini,A.,Lam,H.,Ciaramita,M.,Gaiarin,B.,Huebscher,M.C.,Buck,C.,Mede,N.,Leippold,M.& Strauss,N.(2023).Assessing large language models on climate information.arXiv:2310.02932.doi:10.48550/arXiv.2310.02932.
- Chan,C.H.,Bajjalieh,J.,Auvil,L.,Wessler,H.,Althaus,S.,Welbers,K.,Van Atteveldt,W.& Jungblut,M.(2021).Four best practices for measuring news sentiment using ‘off-the-shelf' dictionaries:A large-scale p-hacking experiment.Computational Communication Research,3(1),1-27.doi:10.5117/CCR2021.1.001.CHAN.
- Chen,M.,Tworek,J.,Jun,H.,Yuan,Q.M.,de Oliveira Pinto,H.P.,Kaplan,J.,Edwards,H.,Burda,Y.,Joseph,N.,Brockman,G.,Ray,A.,Puri,R.,Krueger,G.,Petrov,M.,Khlaaf,H.,Sastry,G.,Mishkin,P.,Chan,B.,Gray,S.,Ryder,N.,Pavlov,M.,Power,A.,Kaiser,L.,Bavarian,M.,Winter,C.,Tillet,P.,Such,F.P.,Cummings,D.,Plappert,M.,Chantzis,F.,Barnes,E.,Herbert-Voss,A.,Guss,W.H.,Nichol,A.,Paino,A.,Tezak,N.,Tang,J.,Babuschkin,I.,Balaji,S.,Jain,S.,Saunders,W.,Hesse,C.,Carr,A.N.,Leike,J.,Achiam,J.,Misra,V.,Morikawa,E.,Radford,A.,Knight,M.,Brundage,M.,Murati,M.,Mayer,K.,Welinder,P.,McGrew,B.,Amodei,D.,McCandlish,S.,Sutskever,I.& Zaremba,W.(2021).Evaluating large language models trained on code.arXiv:2107.03374.doi:10.48550/arXiv.2107.03374.
- Cheng,X.X.(2024).Networked framing of GMO risks and discussion fragmentation on Chinese social media:A dynamic perspective.Humanities and Social Sciences Communications,11(1),42.doi:10.1057/s41599-023-02564-3.
- Chinn,S.,Hart,P.S.& Soroka,S.(2020).Politicization and polarization in climate change news content,1985—2017.Science Communication,42(1),112-129.doi:10.1177/1075547019900290.
- Chu,H.R.& Yang,J.Z.(2019).Emotion and the psychological distance of climate change.Science Communication,41(6),761-789.doi:10.1177/1075547019889637.
- Chu,J.X.,Zhu,Y.Q.& Ji,J.J.(2023).Characterizing the semantic features of climate change misinformation on Chinese social media.Public Understanding of Science,32(7),845-859.doi:10.1177/09636625231166542.
- Coe,K.& Scacco,J.M.(2017).Content analysis,quantitative.In Matthes,J.,Davis,C.S.& Potter,R.F.(Eds.),The International Encyclopedia of Communication Research Methods (pp.1-11).Wiley Online Library.doi:10.1002/978111 8901731.iecrm0045.
- Dai,B.,Ali,A.& Wang,H.W.(2020).Exploring information avoidance intention of social media users:A cognition-affect-conation perspective.Internet Research,30(5),1455-1478.doi:10.1108/INTR-06-2019-0225.
- De Kok,T.(2024,March 1).ChatGPT for Textual Analysis?How to use Generative LLMs in Accounting Research.SSRN Scholarly Paper,Rochester,NY.doi:10.2139/ssrn.4429658
- Demszky,D.,Yang,D.Y.,Yeager,D.S.,Bryan,C.J.,Clapper,M.,Chandhok,S.,Eichstaedt,J.C.,Hecht,C.,Jamieson,J.,Johnson,M.,Jones,M.,Krettek-Cobb,D.,Lai,L.,Mitchell,N.J.,Ong,D.C.,Dweck,C.S.,Gross,J.J.& Pennebaker,J.W.(2023).Using large language models in psychology.Nature Reviews Psychology,2,688-701.doi:10.1038/s44159-023-00241-5.
- Effrosynidis,D.,Sylaios,G.& Arampatzis,A.(2022).Exploring climate change on Twitter using seven aspects:Stance,sentiment,aggressiveness,temperature,gender,topics,and disasters.PLoS ONE,17(9),e0274213.doi:10.1371/journal.pone.0274213.
- Fogel-Dror,Y.,Shenhav,S.R.,Sheafer,T.& Van Atteveldt,W.(2019).Role-based association of verbs,actions,and sentiments with entities in political discourse.Communication Methods and Measures,13(2),69-82.doi:10.1080/19312458.2018.1536973.
- González-Bailón,S.& Paltoglou,G.(2015).Signals of public opinion in online communication:A comparison of methods and data sources.The Annals of the American Academy of Political and Social Science,659(1),95-107.doi:10.1177/0002716215569192.
- Grimmer,J.& Stewart,B.M.(2013).Text as data:The promise and pitfalls of automatic content analysis methods for political texts.Political Analysis,21(3),267-297.doi:10.1093/pan/mps028.
- Günther,E.& Quandt,T.(2016).Word counts and topic models:Automated text analysis methods for digital journalism research.Digital Journalism,4(1),75-88.doi:10.1080/21670811.2015.1093270.
- Guo,L.,Vargo,C.J.,Pan,Z.X.,Ding,W.C.& Ishwar,P.(2016).Big social data analytics in journalism and mass communication:Comparing dictionary-based text analysis and unsupervised topic modeling.Journalism & Mass Communication Quarterly,93(2),332-359.doi:10.1177/1077699016639231.
- Hanjalic,A.& Xu,L.Q.(2005).Affective video content representation and modeling.IEEE Transactions on Multimedia,7(1),143-154.doi:10.1109/TMM.2004.840618.
- Harris,R.J.& Sanborn,F.W.(2014).A cognitive psychology of mass communication (6th ed.).New York:Routledge.
- Hou,C.Y.,Zhu,G.X.,Zheng,J.,Zhang,L.S.,Huang,X.S.,Zhong,T.L.,Li,S.,Du,H.X.& Ker,C.L.(2024).Prompt-based and fine-tuned GPT models for context-dependent and-independent deductive coding in social annotation.In Proceedings of the 14th Learning Analytics and Knowledge Conference (pp.518-528).Kyoto:ACM.doi:10.1145/3636555.3636910.
- Huang,F.,Kwak,H.& An,J.S.(2023).Is ChatGPT better than human annotators?Potential and limitations of ChatGPT in explaining implicit hate speech.In Companion Proceedings of the ACM Web Conference 2023 (pp.294-297).Austin:ACM.doi:10.1145/3543873.3587368.
- Iniguez-Gallardo,V.,Lenti Boero,D.& Tzanopoulos,J.(2021).Climate change and emotions:Analysis of people's emotional states in Southern Ecuador.Frontiers in Psychology,12,644240.doi:10.3389/fpsyg.2021.644240.
- Kroon,A.C.,Van Der Meer,T.& Vliegenthart,R.(2022).Beyond counting words:Assessing performance of dictionaries,supervised machine learning,and embeddings in topic and frame classification.Computational Communication Research,4(2),528-570.doi:10.5117/CCR2022.2.006.KROO.
- Lee,S.,Ma,S.Y.,Meng,J.B.,Zhuang,J.& Peng,T.Q.(2022).Detecting sentiment toward emerging infectious diseases on social media:A validity evaluation of dictionary-based sentiment analysis.International Journal of Environmental Research and Public Health,19(11),6759.doi:10.3390/ijerph19116759.
- Lee,S.,Peng,T.Q.,Goldberg,M.H.,Rosenthal,S.A.,Kotcher,J.E.,Maibach,E.W.& Leiserowitz,A.(2024).Can large language models capture public opinion about global warming?An empirical assessment of algorithmic fidelity and bias.arXiv:2311.00217.doi:10.48550/arXiv.2311.00217.
- Li,L.Y.,Fan,L.Z.,Atreja,S.& Hemphill,L.(2023).“HOT” ChatGPT:The promise of ChatGPT in detecting and discriminating hateful,offensive,and toxic comments on social media.arXiv:2304.10619.doi:10.48550/arXiv.2304.10619.
- Luo,Y.W.,Card,D.& Jurafsky,D.(2021).Detecting stance in media on global warming.arXiv:2010.15149.doi:10.48550/arXiv.2010.15149.
- Maier,D.,Waldherr,A.,Miltner,P.,Wiedemann,G.,Niekler,A.,Keinert,A.,Pfetsch,B.,Heyer,G.,Reber,U.,H?ussler,T.,Schmid-Petri,H.& Adam,S.(2018).Applying LDA topic modeling in communication research:Toward a valid and reliable methodology.Communication Methods and Measures,12(2-3),93-118.doi:10.1080/19312458.2018.1430754.
- Maier,D.,Baden,C.,Stoltenberg,D.,De Vries-Kedem,M.& Waldherr,A.(2022).Machine translation vs.multilingual dictionaries assessing two strategies for the topic modeling of multilingual text collections.Communication Methods and Measures,16(1),19-38.doi:10.1080/19312458.2021.1955845.
- Matter,D.,Schirmer,M.,Grinberg,N.& Pfeffer,J.(2024).Close to human-level agreement:Tracing journeys of violent speech in incel posts with GPT-4-enhanced annotations.arXiv:2401.02001.doi:10.48550/arXiv.2401.02001.
- Matthes,J.& Kohring,M.(2008).The content analysis of media frames:Toward improving reliability and validity.Journal of Communication,58(2),258-279.doi:10.1111/j.1460-2466.2008.00384.x.
- Plutchik,R.(2001).The nature of emotions.American Scientist,89(4),344.doi:10.1511/2001.28.344.
- Rathje,S.,Mirea,D.M.,Sucholutsky,I.,Marjieh,R.,Robertson,C.& Van Bavel,J.J.(2023).GPT is an effective tool for multilingual psychological text analysis (preprint).PsyArXiv.doi:10.31234/osf.io/sekf5.
- Riffe,D.,Lacy,S.,Watson,B.R.& Fico,F.(2019).Analyzing media messages:Using quantitative content analysis in research (4th ed.).New York:Routledge.doi:10.4324/9780429464287.
- Rozado,D.,Hughes,R.& Halberstadt,J.(2022).Longitudinal analysis of sentiment and emotion in news media headlines using automated labelling with Transformer language models.PLoS ONE,17(10),e0276367.doi:10.1371/journal.pone.0276367.
- Salah,M.,Al Halbusi,H.& Abdelfattah,F.(2023).May the force of text data analysis be with you:Unleashing the power of generative AI for social psychology research.Computers in Human Behavior:Artificial Humans,1(2),100006.doi:10.1016/j.chbah.2023.100006.
- Saparov,A.,Pang,R.Y.,Padmakumar,V.,Joshi,N.,Kazemi,S.M.,Kim,N.& He,H.(2023).Testing the general deductive reasoning capacity of large language models using OOD examples.Neural Information Processing Systems,36,3083-3105.doi:10.48550/arXiv.2305.15269.
- Savelka,J.,Agarwal,A.,Bogart,C.,Song,Y.F.& Sakr,M.(2023).Can Generative Pre-trained Transformers (GPT) pass assessments in higher education programming courses?In Proceedings of the 2023 Conference on Innovation and Technology in Computer Science Education V.1 (pp.117-123).Turku:ACM.doi:10.1145/3587102.3588792.
- Simchon,A.,Brady,W.J.& Van Bavel,J.J.(2022).Troll and divide:The language of online polarization.PNAS Nexus,1(1),pgac019.doi:10.1093/pnasnexus/pgac019.
- Song,H.,Tolochko,P.,Eberl,J.M.,Eisele,O.,Greussing,E.,Heidenreich,T.,Lind,F.,Galyga,S.& Boomgaarden,H.G.(2020).In validations we trust?The impact of imperfect human annotations as a gold standard on the quality of validation of automated content analysis.Political Communication,37(4),550-572.doi:10.1080/10584609.2020.1723752.
- Stoll,A.,Ziegele,M.& Quiring,O.(2020).Detecting impoliteness and incivility in online discussions:Classification approaches for german user comments.Computational Communication Research,2(1),109-134.doi:10.5117/CCR2020.1.005.KATH.
- Trilling,D.& Jonkman,J.G.F.(2018).Scaling up content analysis.Communication Methods and Measures,12(2-3),158-174.doi:10.1080/19312458.2018.1447655.
- Van Atteveldt,W.& Peng,T.Q.(2018).When communication meets computation:Opportunities,challenges,and pitfalls in computational commu-nication science.Communication Methods and Measures,12(2-3),81-92.doi:10.1080/19312458.2018.1458084.
- Walter,D.& Ophir,Y.(2019).News frame analysis:An inductive mixed-method computational approach.Communication Methods and Measures,13(4),248-266.doi:10.1080/19312458.2019.1639145.
- Xiao,Z.A.,Yuan,X.D.,Liao,Q.V.,Abdelghani,R.& Oudeyer,P.Y.(2023).Supporting qualitative analysis with large language models:Combining codebook with GPT-3 for deductive coding.In Companion Proceedings of the 28th International Conference on Intelligent User Interfaces (pp.75-78).Sydney:ACM.doi:10.1145/3581754.3584136.
- Zhang,B.W.,Ding,D.J.& Jing,L.W.(2023).How would stance detection techniques evolve after the launch of ChatGPT?.arXiv:2212.14548.doi:10.48550/arXiv.2212.14548.
- ① https://www.kaggle.com/datasets/edqian/twitter-climate-change-sentiment-dataset
- ② https://www.kaggle.com/datasets/die9origephit/climate-change-tweets
- ③ F1分数是精确率和召回率的调和平均数。其中,精确率反映了GPT模型预测为正的结果中的准确性;召回率是在所有实际为正例的样本中,被模型正确预测为正例的比例。对于二分类变量(子框架和离散情绪),我们按照惯例使用正例的F1分数;对于多分类变量(主框架、立场和情感效价),考虑到文本在各次类目上的不平衡分布,本研究使用加权F1分数(F1-weighted)。