These datasets are used in machine learning (ML) research and have been cited in peer-reviewedacademic journals. Datasets are an integral part of the field of machine learning. Major advances in this field can result from advances in learning algorithms (such as deep learning), computer hardware, and, less-intuitively, the availability of high-quality training datasets.[1] High-quality labeled training datasets for supervised and semi-supervised machine learning algorithms are usually difficult and expensive to produce because of the large amount of time needed to label the data. Although they do not need to be labeled, high-quality datasets for unsupervised learning can also be difficult and costly to produce.[2][3][4]
Many organizations, including governments, publish and share their datasets. The datasets are classified, based on the licenses, as Open data and Non-Open data.
The datasets from various governmental-bodies are presented in List of open government data sites. The datasets are ported on open data portals. They are made available for searching, depositing and accessing through interfaces like Open API. The datasets are made available as various sorted types and subtypes.
Sarcasm, Perceived and Intended, by Reactive Supervision (SPIRS)
Intended and perceived sarcastic tweets along with their context collected using reactive supervision; an equal number of negative (non-sarcastic) samples
The Dialog State Tracking Challenges 2 & 3 (DSTC2&3) were research challenge focused on improving the state of the art in tracking the state of spoken dialog systems.
L. Zheng; N. Guha; B. Anderson; P. Henderson; D. Ho
Caselaw Access Project
All official, book-published state and federal United States case law — every volume or case designated as an official report of decisions by a court within the United States.
260 hours of speech, from 543 speakers (302 male, 241 female) from across the United States, for around 2,400 two-sided telephone conversations, collected by Texas Instruments in 1990-1991.
audio, text transcript, word-level timestamps, phonetic transcriptions
260 hours of speech, from 543 speakers (302 male, 241 female) from across the United States, for around 2,400 two-sided telephone conversations, at ~3 million words. Collected by Texas Instruments in 1990-1991.
audio, text transcript, word-level timestamps, phonetic transcriptions
speech recognition, phonetic transcription. The most commonly used test set for this dataset is called "Hub5'00".
7805 gesture captures of 14 different social touch gestures performed by 31 subjects. The gestures were performed in three variations: gentle, normal and rough, on a pressure sensor grid wrapped around a mannequin arm.
Touch gestures performed are segmented and labeled.
2D maps and 3D grids from thousands of N-body and state-of-the-art hydrodynamic simulations spanning a broad range in the value of the cosmological and astrophysical parameters
Each map and grid has 6 cosmological and astrophysical parameters associated to it
A structured general-purpose dataset on life, work, and death of 1.22 million distinguished people. Public domain.
A five-step method to infer birth and death years, gender, and occupation from community-submitted data to all language versions of the Wikipedia project.
Each file represents a single experiment and contains a single anomaly. The dataset represents a multivariate time series collected from the sensors installed on the testbed.
There are two markups for Outlier detection (point anomalies) and Changepoint detection (collective anomalies) problems
This section includes datasets that deals with structured data.
Dataset Name
Brief description
Preprocessing
Instances
Format
Default Task
Created (updated)
Reference
Creator
DBpedia Neural Question Answering (DBNQA) Dataset
A large collection of Question to SPARQL specially design for Open Domain Neural Question Answering over DBpedia Knowledgebase.
This dataset contains a large collection of Open Neural SPARQL Templates and instances for training Neural SPARQL Machines; it was pre-processed by semi-automatic annotation tools as well as by three SPARQL experts.
This section includes datasets that contains multi-turn text with at least two actors, a "user" and an "agent". The user makes requests for the agent, which performs the request.
Dataset Name
Brief description
Preprocessing
Instances
Format
Default Task
Created (updated)
Reference
Creator
Taskmaster
"The Taskmaster corpus consists of THREE datasets, Taskmaster-1 (TM-1), Taskmaster-2 (TM-2), and Taskmaster-3 (TM-3), comprising over 55,000 spoken and written task-oriented dialogs in over a dozen domains."[339]
Taskmaster-1: goal-oriented conversational dataset. It includes 13,215 task-based dialogs comprising six domains.
Taskmaster-2: 17,289 dialogs in the seven domains (restaurants, food ordering, movies, hotels, flights, music and sports).
Taskmaster-3: 23,757 movie ticketing dialogs.
Taskmaster-1 and Taskmaster-2: conversation id, utterances, Instruction id
" LAMBADA is a collection of narrative passages sharing the characteristic that human subjects are able to guess their last word if they are exposed to the whole passage, but not if they only see the last sentence preceding the target word."[343]
MITRE Engage is a framework for planning and discussing adversary engagement operations that empowers you to engage your adversaries and achieve your cybersecurity goals.
A dataset adopting the FEVER methodology that consists of 1,535 real-world claims regarding climate-change collected on the internet.
Each claim is accompanied by five manually annotated evidence sentences retrieved from the English Wikipedia that support, refute or do not give enough information to validate the claim totalling in 7,675 claim-evidence pairs.[393]
Large collection of monolingual corpora extracted from web data (Common Crawl dumps) covering 150+ languages
Various (filtering, language classification, adult-content detection and other labelling)
3.4 TB English text, 1.4 TB Chinese text, 1.1 TB Russian text, 595 MB German text, 431 MB French text, and data for 150+ languages (figures for version 23.01)
As datasets come in myriad formats and can sometimes be difficult to use, there has been considerable work put into curating and standardizing the format of datasets to make them easier to use for machine learning research.
OpenML:[495] Web platform with Python, R, Java, and other APIs for downloading hundreds of machine learning datasets, evaluating algorithms on datasets, and benchmarking algorithm performance against dozens of other algorithms.
PMLB:[496] A large, curated repository of benchmark datasets for evaluating supervised machine learning algorithms. Provides classification and regression datasets in a standardized format that are accessible through a Python API.
Metatext NLP: https://metatext.io/datasets web repository maintained by community, containing nearly 1000 benchmark datasets, and counting. Provides many tasks from classification to QA, and various languages from English, Portuguese to Arabic.
Appen: Off The Shelf and Open Source Datasets hosted and maintained by the company. These biological, image, physical, question answering, signal, sound, text, and video resources number over 250 and can be applied to over 25 different use cases.[497][498]
^Weiss, G. M.; Provost, F. (October 2003). "Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction". Journal of Artificial Intelligence Research. 19: 315–354. doi:10.1613/jair.1199.
^Žliobaitė, Indrė; Bifet, Albert; Pfahringer, Bernhard; Holmes, Geoff (2011). "Active Learning with Evolving Streaming Data". Machine Learning and Knowledge Discovery in Databases. Lecture Notes in Computer Science. Vol. 6913. pp. 597–612. doi:10.1007/978-3-642-23808-6_39. ISBN978-3-642-23807-9.
^James Bennett; Stan Lanning (12 August 2007). "The Netflix Prize"(PDF). Proceedings of KDD Cup and Workshop 2007. Archived from the original(PDF) on 27 September 2007. Retrieved 25 August 2007.
^McAuley, Julian; Targett, Christopher; Shi, Qinfeng; Anton van den Hengel (2015). "Image-based Recommendations on Styles and Substitutes". arXiv:1506.04757 [cs.CV].
^Lv, Yuanhua; Lymberopoulos, Dimitrios; Wu, Qiang (2012). "An exploration of ranking heuristics in mobile local search". Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval. pp. 295–304. doi:10.1145/2348283.2348325. ISBN978-1-4503-1472-5.
^Harper, F. Maxwell; Konstan, Joseph A. (2015). "The MovieLens Datasets: History and Context". ACM Transactions on Interactive Intelligent Systems. 5 (4): 19. doi:10.1145/2827872. S2CID16619709.
^Koenigstein, Noam; Dror, Gideon; Koren, Yehuda (2011). "Yahoo! Music recommendations: Modeling music ratings with temporal dynamics and item taxonomy". Proceedings of the fifth ACM conference on Recommender systems. pp. 165–172. doi:10.1145/2043932.2043964. ISBN978-1-4503-0683-6.
^McFee, Brian; Bertin-Mahieux, Thierry; Ellis, Daniel P.W.; Lanckriet, Gert R.G. (2012). "The million song dataset challenge". Proceedings of the 21st International Conference on World Wide Web. pp. 909–916. doi:10.1145/2187980.2188222. ISBN978-1-4503-1230-1.
^Lim, Tjen-Sien; Loh, Wei-Yin; Shih, Yu-Shan (2000). "A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms". Machine Learning. 40 (3): 203–228. doi:10.1023/a:1007608224229. S2CID17030953.
^Nguyen, Kiet Van; Nguyen, Vu Duc; Nguyen, Phu X. V.; Truong, Tham T. H.; Nguyen, Ngan Luu-Thuy (2018). "UIT-VSFC: Vietnamese Students' Feedback Corpus for Sentiment Analysis". 2018 10th International Conference on Knowledge and Systems Engineering (KSE). pp. 19–24. doi:10.1109/KSE.2018.8573337. ISBN978-1-5386-6113-0.
^Nhung Thi-Hong Nguyen, Phuong Ha-Dieu Phan, Luan Thanh Nguyen, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen (24 April 2021). "Vietnamese Open-domain Complaint Detection in E-Commerce Websites". arXiv:2104.11969 [cs.CL].{{cite arXiv}}: CS1 maint: multiple names: authors list (link)
^Phu Gia Hoang, Canh Duc Luu, Khanh Quoc Tran, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen (26 January 2023). "ViHOS: Hate Speech Spans Detection for Vietnamese". arXiv:2301.10186 [cs.CL].{{cite arXiv}}: CS1 maint: multiple names: authors list (link)
^Dermouche, Mohamed; Velcin, Julien; Khouas, Leila; Loudcher, Sabine (2014). "A Joint Model for Topic-Sentiment Evolution over Time". 2014 IEEE International Conference on Data Mining. IEEE. pp. 773–778. doi:10.1109/icdm.2014.82. ISBN978-1-4799-4302-9.
^Rose, Tony; Stevenson, Mark; Whitehead, Miles (2002). "The Reuters Corpus Volume 1-from Yesterday's News to Tomorrow's Language Resources". LREC. 2. S2CID9239414.
^Al-Harbi, S; Almuhareb, A; Al-Thubaity, A; Khorsheed, M. S.; Al-Rajeh, A (2008). "Automatic Arabic Text Classification". Proceedings of the 9th International Conference on the Statistical Analysis of Textual Data, Lyon, France.
^Kossinets, Gueorgi; Kleinberg, Jon; Watts, Duncan (2008). "The Structure of Information Pathways in a Social Communication Network". arXiv:0806.3201 [physics.soc-ph].
^Androutsopoulos, Ion; Koutsias, John; Chandrinos, Konstantinos V.; Paliouras, George; Spyropoulos, Constantine D. (2000). "An evaluation of Naive Bayesian anti-spam filtering". In Potamias, G.; Moustakis, V.; van Someren, M. (eds.). Proceedings of the Workshop on Machine Learning in the New Information Age. 11th European Conference on Machine Learning, Barcelona, Spain. Vol. 11. pp. 9–17. arXiv:cs/0006013. Bibcode:2000cs........6013A.
^Zafarani, Reza, and Huan Liu. "Social computing data repository at ASU." School of Computing, Informatics and Decision Systems Engineering, Arizona State University (2009).
^Abdulla, N., et al. "Arabic sentiment analysis: Corpus-based and lexicon-based." Proceedings of the IEEE conference on Applied Electrical Engineering and Computing Technologies (AEECT). 2013.
^Abooraig, Raddad; Al-Zu'bi, Shadi; Kanan, Tarek; Hawashin, Bilal; Al Ayoub, Mahmoud; Hmeidi, Ismail (June 2018). "Automatic categorization of Arabic articles based on their political orientation". Digital Investigation. 25: 24–41. doi:10.1016/j.diin.2018.04.003.
^Lowe, Ryan; Pow, Nissan; Serban, Iulian; Pineau, Joelle (2015). "The Ubuntu Dialogue Corpus: A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems". arXiv:1506.08909 [cs.CL].
^Jason Williams Antoine Raux Matthew Henderson, "[1]", Dialogue & Discourse | April 2016 .
^Hoppe, Travis (16 December 2021), The-Pile-FreeLaw, retrieved 11 January 2023
^K. Kowsari, D. E. Brown, M. Heidarysafa, K. Jafari Meimandi, M. S. Gerber and L. E. Barnes, "HDLTex: Hierarchical Deep Learning for Text Classification", 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 364–371. doi:10.1109/ICMLA.2017.0-134
^K. Kowsari, D. E. Brown, M. Heidarysafa, K. Jafari Meimandi, M. S. Gerber and L. E. Barnes, "Web of Science Dataset", doi:10.17632/9rw3vkcfy4.6
^Galgani, Filippo, Paul Compton, and Achim Hoffmann. "Combining different summarization techniques for legal text." Proceedings of the Workshop on Innovative Hybrid Approaches to the Processing of Textual Data. Association for Computational Linguistics, 2012.
^Schler, Jonathan; et al. (2006). "Effects of Age and Gender on Blogging"(PDF). AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs. 6. Archived from the original(PDF) on 14 November 2020. Retrieved 6 August 2019.
^Anand, Pranav, et al. "Believe Me-We Can Do This! Annotating Persuasive Acts in Blog Text."Computational Models of Natural Argument. 2011.
^Traud, Amanda L., Peter J. Mucha, and Mason A. Porter. "Social structure of Facebook networks." Physica A: Statistical Mechanics and its Applications391.16 (2012): 4165–4180.
^Richard, Emile; Savalle, Pierre-Andre; Vayatis, Nicolas (2012). "Estimation of Simultaneously Sparse and Low Rank Matrices". arXiv:1206.6474 [cs.DS].
^Weston, Jason; Bordes, Antoine; Chopra, Sumit; Rush, Alexander M.; Bart van Merriënboer; Joulin, Armand; Mikolov, Tomas (2015). "Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks". arXiv:1502.05698 [cs.AI].
^Luyckx, Kim; Daelemans, Walter (2008). "Personae: a corpus for author and personality prediction from text". Proceedings of LREC-2008, the Sixth International Language Resources and Evaluation Conference. hdl:10067/687330151162165141. ISBN978-2-9517408-4-6.
^Baumgartner, Jason; Zannettou, Savvas; Keegan, Brian; Squire, Megan; Blackburn, Jeremy (23 January 2020). "The Pushshift Reddit Dataset". arXiv:2001.08435 [cs.SI].
^Ciarelli, Patrick Marques; Oliveira, Elias (2009). "Agglomeration and Elimination of Terms for Dimensionality Reduction". 2009 Ninth International Conference on Intelligent Systems Design and Applications. pp. 547–552. doi:10.1109/ISDA.2009.9. ISBN978-1-4244-4735-0.
^Zhou, Mingyuan; Padilla, Oscar Hernan Madrid; Scott, James G. (2 July 2016). "Priors for Random Count Matrices Derived from a Family of Negative Binomial Processes". Journal of the American Statistical Association. 111 (515): 1144–1156. arXiv:1404.3331. doi:10.1080/01621459.2015.1075407.
^Ning, Yue; Muthiah, Sathappan; Rangwala, Huzefa; Ramakrishnan, Naren (2016). "Modeling Precursors for Event Forecasting via Nested Multi-Instance Learning". arXiv:1602.08033 [cs.SI].
^Buza, Krisztian. "Feedback prediction for blogs."Data analysis, machine learning and knowledge discovery. Springer International Publishing, 2014. 145–152.
^Soysal, Ömer M (2015). "Association rule mining with mostly associated sequential patterns". Expert Systems with Applications. 42 (5): 2582–2592. doi:10.1016/j.eswa.2014.10.049.
^Zhu, Yukun, et al. "Aligning books and movies: Towards story-like visual explanations by watching movies and reading books." Proceedings of the IEEE international conference on computer vision. 2015.
^Bowman, Samuel R.; Angeli, Gabor; Potts, Christopher; Manning, Christopher D. (2015). "A large annotated corpus for learning natural language inference". arXiv:1508.05326 [cs.CL].
^Wang, Alex; Singh, Amanpreet; Michael, Julian; Hill, Felix; Levy, Omer; Bowman, Samuel R. (2018). "GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding". arXiv:1804.07461 [cs.CL].
^To, Quoc Huy; Nguyen, Van Kiet; Nguyen, Luu Thuy Ngan; Nguyen, Gia Tuan Anh (2020). "Gender Prediction Based on Vietnamese Names with Machine Learning Techniques". Proceedings of the 4th International Conference on Natural Language Processing and Information Retrieval. pp. 55–60. arXiv:2010.10852. doi:10.1145/3443279.3443309. ISBN9781450377607. S2CID224814110.
^Nguyen, Luan Thanh; Van Nguyen, Kiet; Nguyen, Ngan Luu-Thuy (18 March 2021). "Constructive and Toxic Speech Detection for Open-Domain Social Media Comments in Vietnamese". Advances and Trends in Artificial Intelligence. Artificial Intelligence Practices. Lecture Notes in Computer Science. Vol. 12798. pp. 572–583. arXiv:2103.10069. doi:10.1007/978-3-030-79457-6_49. ISBN978-3-030-79456-9. S2CID232269671.
^Saxton, David, et al. "Analysing Mathematical Reasoning Abilities of Neural Models." International Conference on Learning Representations. 2018.
^M. Versteegh, R. Thiollière, T. Schatz, X.-N. Cao, X. Anguera, A. Jansen, and E. Dupoux (2015). "The Zero Resource Speech Challenge 2015," in INTERSPEECH-2015.
^Sakar, Betul Erdogdu; et al. (2013). "Collection and analysis of a Parkinson speech dataset with multiple types of sound recordings". IEEE Journal of Biomedical and Health Informatics. 17 (4): 828–834. doi:10.1109/jbhi.2013.2245674. PMID25055311. S2CID15491516.
^Zhao, Shunan; Rudzicz, Frank; Carvalho, Leonardo G.; Marquez-Chin, Cesar; Livingstone, Steven (2014). "Automatic detection of expressed emotion in Parkinson's Disease". 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 4813–4817. doi:10.1109/ICASSP.2014.6854516. ISBN978-1-4799-2893-4.
^Hammami, Nacereddine; Bedda, Mouldi (July 2010). "Improved tree model for arabic speech recognition". 2010 3rd International Conference on Computer Science and Information Technology. pp. 521–526. doi:10.1109/ICCSIT.2010.5563892. ISBN978-1-4244-5537-9.
^Zue, Victor; Seneff, Stephanie; Glass, James (1990). "Speech database development at MIT: TIMIT and beyond". Speech Communication. 9 (4): 351–356. doi:10.1016/0167-6393(90)90010-7.
^Kapadia, S.; Valtchev, V.; Young, S.J. (1993). "MMI training for continuous phoneme recognition on the TIMIT database". IEEE International Conference on Acoustics Speech and Signal Processing. pp. 491-494 vol.2. doi:10.1109/ICASSP.1993.319349. ISBN0-7803-0946-4.
^Ghandoura, Abdulkader; Hjabo, Farouk; Al Dakkak, Oumayma (June 2021). "Building and benchmarking an Arabic Speech Commands dataset for small-footprint keyword spotting". Engineering Applications of Artificial Intelligence. 102: 104267. doi:10.1016/j.engappai.2021.104267.
^Zhou, Fang; Claire, Q.; King, Ross D. (2014). "Predicting the Geographical Origin of Music". 2014 IEEE International Conference on Data Mining. pp. 1115–1120. doi:10.1109/ICDM.2014.73. ISBN978-1-4799-4302-9.
^Saccenti, Edoardo; Camacho, José (2015). "On the use of the observation-wise k-fold operation in PCA cross-validation". Journal of Chemometrics. 29 (8): 467–478. doi:10.1002/cem.2726. hdl:10481/55302. S2CID62248957.
^Bertin-Mahieux, Thierry, et al. "The million song dataset." ISMIR 2011: Proceedings of the 12th International Society for Music Information Retrieval Conference, 24–28 October 2011, Miami, Florida. University of Miami, 2011.
^Defferrard, Michaël; Benzi, Kirell; Vandergheynst, Pierre; Bresson, Xavier (6 December 2016). "FMA: A Dataset For Music Analysis". arXiv:1612.01840 [cs.SD].
^Lagrange, Mathieu; Lafay, Grégoire; Rossignol, Mathias; Benetos, Emmanouil; Roebel, Axel (2015). "An evaluation framework for event detection using a morphological model of acoustic scenes". arXiv:1502.00141 [stat.ML].
^Chen, Zesheng; Ji, Chuanyi (2007). "Optimal worm-scanning method using vulnerable-host distributions". International Journal of Security and Networks. 2 (1/2): 71. doi:10.1504/IJSN.2007.012826.
^Kachuee, Mohamad; Kiani, Mohammad Mahdi; Mohammadzade, Hoda; Shabany, Mahdi (2015). "Cuff-less high-accuracy calibration-free blood pressure estimation using pulse transit time". 2015 IEEE International Symposium on Circuits and Systems (ISCAS). pp. 1006–1009. doi:10.1109/ISCAS.2015.7168806. ISBN978-1-4799-8391-9.
^Goldberger, Ary L.; Amaral, Luis A. N.; Glass, Leon; Hausdorff, Jeffrey M.; Ivanov, Plamen Ch.; Mark, Roger G.; Mietus, Joseph E.; Moody, George B.; Peng, Chung-Kang; Stanley, H. Eugene (13 June 2000). "PhysioBank, PhysioToolkit, and PhysioNet: Components of a New Research Resource for Complex Physiologic Signals". Circulation. 101 (23): E215-20. doi:10.1161/01.CIR.101.23.e215. PMID10851218.
^Korotcenkov, G.; Cho, B. K. (2014). "Engineering approaches to improvement of conductometric gas sensor parameters. Part 2: Decrease of dissipated (consumable) power and improvement stability and reliability". Sensors and Actuators B: Chemical. 198: 316–341. Bibcode:2014SeAcB.198..316K. doi:10.1016/j.snb.2014.03.069.
^Torres-Sospedra, Joaquin, et al. "UJIIndoorLoc-Mag: A new database for magnetic field-based localization problems." Indoor Positioning and Indoor Navigation (IPIN), 2015 International Conference on. IEEE, 2015.
^Theodoridis, Theodoros; Huosheng Hu (2007). "Action classification of 3D human models using dynamic ANNs for mobile robot surveillance". 2007 IEEE International Conference on Robotics and Biomimetics (ROBIO). pp. 371–376. doi:10.1109/ROBIO.2007.4522190. ISBN978-1-4244-1761-2.
^Etemad, Seyed Ali; Arya, Ali (2009). "3D human action recognition and style transformation using resilient backpropagation neural networks". 2009 IEEE International Conference on Intelligent Computing and Intelligent Systems. pp. 296–301. doi:10.1109/ICICISYS.2009.5357690. ISBN978-1-4244-4754-1.
^ abAndrianesis, Konstantinos; Tzes, Anthony (2015). "Development and control of a multifunctional prosthetic hand with shape memory alloy actuators". Journal of Intelligent & Robotic Systems. 78 (2): 257–289. doi:10.1007/s10846-014-0061-6. S2CID207174078.
^Bacciu, Davide; et al. (2014). "An experimental characterization of reservoir computing in ambient assisted living applications". Neural Computing and Applications. 24 (6): 1451–1464. doi:10.1007/s00521-013-1364-4. hdl:11568/237959. S2CID14124013.
^Palumbo, Filippo; Barsocchi, Paolo; Gallicchio, Claudio; Chessa, Stefano; Micheli, Alessio (2013). "Multisensor Data Fusion for Activity Recognition Based on Reservoir Computing". Evaluating AAL Systems Through Competitive Benchmarking. Communications in Computer and Information Science. Vol. 386. pp. 24–35. doi:10.1007/978-3-642-41043-7_3. ISBN978-3-642-41042-0.
^Reiss, Attila; Stricker, Didier (2012). "Introducing a New Benchmarked Dataset for Activity Monitoring". 2012 16th International Symposium on Wearable Computers. pp. 108–109. doi:10.1109/ISWC.2012.13. ISBN978-0-7695-4697-1.
^Sztyler, Timo; Stuckenschmidt, Heiner (2016). "On-body localization of wearable devices: An investigation of position-aware activity recognition". 2016 IEEE International Conference on Pervasive Computing and Communications (PerCom). pp. 1–9. doi:10.1109/PERCOM.2016.7456521. ISBN978-1-4673-8779-8.
^Dolatabadi, Elham; Zhi, Ying Xuan; Ye, Bing; Coahran, Marge; Lupinacci, Giorgia; Mihailidis, Alex; Wang, Rosalie; Taati, Babak (2017). "The toronto rehab stroke pose dataset to detect compensation during stroke rehabilitation therapy". Proceedings of the 11th EAI International Conference on Pervasive Computing Technologies for Healthcare. pp. 375–381. doi:10.1145/3154862.3154925. ISBN978-1-4503-6363-1.
^Jung, Merel M.; Poel, Mannes; Poppe, Ronald; Heylen, Dirk K. J. (March 2017). "Automatic recognition of touch gestures in the corpus of social touch". Journal on Multimodal User Interfaces. 11 (1): 81–96. doi:10.1007/s12193-016-0232-9.
^Aeberhard, S., D. Coomans, and O. De Vel. "Comparison of classifiers in high dimensional settings." Dept. Math. Statist., James Cook Univ., North Queensland, Australia, Tech. Rep 92-02 (1992).
^Kaya, Heysem, Pınar Tüfekci, and Fikret S. Gürgen. "Local and global learning methods for predicting power of a combined gas & steam turbine." International conference on emerging trends in computer and electronics engineering (ICETCEE'2012), Dubai. 2012.
^Ortigosa, I.; Lopez, R.; Garcia, J. "A neural networks approach to residuary resistance of sailing yachts prediction". Proceedings of the International Conference on Marine Engineering MARINE. 2007.
^Gerritsma, J., R. Onnink, and A. Versluis.Geometry, resistance and stability of the delft systematic yacht hull series. Delft University of Technology, 1981.
^Palmer, Christopher R.; Faloutsos, Christos (2003). "Electricity Based External Similarity of Categorical Attributes". Advances in Knowledge Discovery and Data Mining. Lecture Notes in Computer Science. Vol. 2637. pp. 486–500. doi:10.1007/3-540-36175-8_49. ISBN978-3-540-04760-5.
^Tsanas, Athanasios; Xifara, Angeliki (2012). "Accurate quantitative estimation of energy performance of residential buildings using statistical machine learning tools". Energy and Buildings. 49: 560–567. Bibcode:2012EneBu..49..560T. doi:10.1016/j.enbuild.2012.03.003.
^De Wilde, Pieter (2014). "The gap between predicted and measured energy performance of buildings: A framework for investigation". Automation in Construction. 41: 40–49. doi:10.1016/j.autcon.2014.02.009.
^Brooks, Thomas F., D. Stuart Pope, and Michael A. Marcolini. Airfoil self-noise and prediction. Vol. 1218. National Aeronautics and Space Administration, Office of Management, Scientific and Technical Information Division, 1989.
^Lavine, Michael (1991). "Problems in extrapolation illustrated with space shuttle O-ring data". Journal of the American Statistical Association. 86 (416): 919–921. doi:10.1080/01621459.1991.10475132.
^Wang, J.; Yu, B.; Gasser, L. (2002). "Concept tree based clustering visualization with shaded similarity matrices". 2002 IEEE International Conference on Data Mining, 2002. Proceedings. pp. 697–700. doi:10.1109/ICDM.2002.1184032. ISBN0-7695-1754-4.
^Bock, R. K.; et al. (2004). "Methods for multidimensional event classification: a case study using images from a Cherenkov gamma-ray telescope". Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment. 516 (2): 511–528. Bibcode:2004NIMPA.516..511B. doi:10.1016/j.nima.2003.08.157.
^Sikora, Marek; Sikora, Beata (2012). "Rough Natural Hazards Monitoring". Rough Sets: Selected Methods and Applications in Management and Engineering. Advanced Information and Knowledge Processing. pp. 163–179. doi:10.1007/978-1-4471-2760-4_10. ISBN978-1-4471-2759-8.
^Yeh, I–C (1998). "Modeling of strength of high-performance concrete using artificial neural networks". Cement and Concrete Research. 28 (12): 1797–1808. doi:10.1016/s0008-8846(98)00165-3.
^Zarandi, MH Fazel; et al. (2008). "Fuzzy polynomial neural networks for approximation of the compressive strength of concrete". Applied Soft Computing. 8 (1): 488–498. Bibcode:2008ApSoC...8...79S. doi:10.1016/j.asoc.2007.02.010.
^Yeh, I. "Modeling slump of concrete with fly ash and superplasticizer." Computers and Concrete5.6 (2008): 559–572.
^Gencel, Osman; et al. (2011). "Comparison of artificial neural networks and general linear model approaches for the analysis of abrasive wear of concrete". Construction and Building Materials. 25 (8): 3486–3494. doi:10.1016/j.conbuildmat.2011.03.040.
^Buscema, Massimo; Tastle, William J.; Terzi, Stefano (2013). "Meta Net: A New Meta-Classifier Family". Data Mining Applications Using Artificial Adaptive Systems. pp. 141–182. doi:10.1007/978-1-4614-4223-3_5. ISBN978-1-4614-4222-6.
^Barnard, Amanda; Sun, Baichuan; Motevalli Soumehsaraei, Ben; & Opletal, George (2019): Silver Nanoparticle Data Set. v3. CSIRO. Data Collection. https://doi.org/10.25919/5d22d20bc543e
^Barnard, Amanda; Sun, Baichuan; & Opletal, George (2019): Platinum Nanoparticle Data Set. v2. CSIRO. Data Collection. https://doi.org/10.25919/5d3958d9bf5f7
^Donchin, Emanuel; Spencer, Kevin M.; Wijesinghe, Ranjith (2000). "The mental prosthesis: assessing the speed of a P300-based brain-computer interface". IEEE Transactions on Rehabilitation Engineering. 8 (2): 174–179. doi:10.1109/86.847808. PMID10896179. S2CID84043.
^Detrano, Robert; et al. (1989). "International application of a new probability algorithm for the diagnosis of coronary artery disease". The American Journal of Cardiology. 64 (5): 304–310. doi:10.1016/0002-9149(89)90524-9. PMID2756873.
^Street, W. N.; Wolberg, W. H.; Mangasarian, O. L. (1993). "Nuclear feature extraction for breast tumor diagnosis". In Acharya, Raj S.; Goldgof, Dmitry B. (eds.). Biomedical Image Processing and Biomedical Visualization. Vol. 1905. pp. 861–870. doi:10.1117/12.148698.
^Abuse, Substance. "Mental Health Services Administration, Results from the 2010 National Survey on Drug Use and Health: Summary of National Findings, NSDUH Series H-41, HHS Publication No.(SMA) 11-4658." Rockville, MD: Substance Abuse and Mental Health Services Administration 201 (2011).
^Hong, Zi-Quan; Yang, Jing-Yu (1991). "Optimal discriminant plane for a small number of samples and design method of classifier on the plane". Pattern Recognition. 24 (4): 317–324. Bibcode:1991PatRe..24..317H. doi:10.1016/0031-3203(91)90074-f.
^ abLi, Jinyan; Wong, Limsoon (2003). "Using Rules to Analyse Bio-medical Data: A Comparison between C4.5 and PCL". Advances in Web-Age Information Management. Lecture Notes in Computer Science. Vol. 2762. pp. 254–265. doi:10.1007/978-3-540-45160-0_25. ISBN978-3-540-40715-7.
^Fung, Glenn; Dundar, Murat; Bi, Jinbo; Rao, Bharat (2004). "A fast iterative algorithm for fisher discriminant using heterogeneous kernels". In Greiner, Russell; Schuurmans, Dale (eds.). Proceedings of the Twenty-first International Conference on Machine Learning. ACM. p. 40. doi:10.1145/1015330.1015409. ISBN978-1-58113-838-2.
^Quinlan, J. R.; Compton, P. J.; Horn, K. A.; Lazarus, L. (1987). "Inductive knowledge acquisition: a case study". In Quinlan, John Ross (ed.). Applications of Expert Systems: Based on the Proceedings of the Second Australian Conference. Turing Institute Press. pp. 137–156. ISBN978-0-201-17449-6.
^ abZhi-Hua Zhou; Yuan Jiang (2004). "NeC4.5: Neural ensemble based C4.5". IEEE Transactions on Knowledge and Data Engineering. 16 (6): 770–773. doi:10.1109/tkde.2004.11.
^Er, Orhan; et al. (2012). "An approach based on probabilistic neural network for diagnosis of Mesothelioma's disease". Computers & Electrical Engineering. 38 (1): 75–81. doi:10.1016/j.compeleceng.2011.09.001.
^Er, Orhan; Tanrikulu, A. Çetin; Abakay, Abdurrahman (10 May 2015). "Use of artificial intelligence techniques for diagnosis of malignant pleural mesothelioma". Dicle Medical Journal / Dicle Tip Dergisi. 42 (1). doi:10.5798/diclemedj.0921.2015.01.0520 (inactive 23 November 2024).{{cite journal}}: CS1 maint: DOI inactive as of November 2024 (link)
^Javadi, Soroush; Mirroshandel, Seyed Abolghasem (June 2019). "A novel deep learning method for automatic assessment of human sperm images". Computers in Biology and Medicine. 109: 182–194. doi:10.1016/j.compbiomed.2019.04.030. PMID31059902.
^Clark, David, Zoltan Schreter, and Anthony Adams. "A quantitative comparison of dystal and backpropagation." Proceedings of 1996 Australian Conference on Neural Networks. 1996.
^Ontañón, Santiago; Plaza, Enric (2009). "On Similarity Measures Based on a Refinement Lattice". Case-Based Reasoning Research and Development. Lecture Notes in Computer Science. Vol. 5650. pp. 240–255. doi:10.1007/978-3-642-02998-1_18. ISBN978-3-642-02997-4.
^Cortez, Paulo, and Aníbal de Jesus Raimundo Morais. "A data mining approach to predict forest fires using meteorological data." (2007).
^Farquad, M. A. H.; Ravi, V.; Raju, S. Bapi (2010). "Support vector regression based hybrid rule extraction methods for forecasting". Expert Systems with Applications. 37 (8): 5577–5589. doi:10.1016/j.eswa.2010.02.055.
^Mallah, Charles; Cope, James; Orwell, James (2013). "Plant Leaf Classification using Probabilistic Integration of Shape, Texture and Margin Features". Computer Graphics and Imaging / 798: Signal Processing, Pattern Recognition and Applications. doi:10.2316/P.2013.798-098. ISBN978-0-88986-944-8.
^Yahiaoui, Itheri; Mzoughi, Olfa; Boujemaa, Nozha (2012). "Leaf Shape Descriptor for Tree Species Identification". 2012 IEEE International Conference on Multimedia and Expo. pp. 254–259. doi:10.1109/ICME.2012.130. ISBN978-1-4673-1659-0.
^Sanchez, Mauricio A.; et al. (2014). "Fuzzy granular gravitational clustering algorithm for multivariate data". Information Sciences. 279: 498–511. doi:10.1016/j.ins.2014.04.005.
^Blackard, Jock A.; Dean, Denis J. (December 1999). "Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables". Computers and Electronics in Agriculture. 24 (3): 131–151. Bibcode:1999CEAgr..24..131B. doi:10.1016/s0168-1699(99)00046-0.
^Fürnkranz, Johannes (2001). "Round Robin Rule Learning"(PDF). In Danyluk, Andrea Pohoreckyj; Brodley, Carla E. (eds.). Machine Learning: Proceedings of the Eighteenth International Conference (ICML 2001) : Williams College, June 28-July 1, 2001. Morgan Kaufmann Publishers. pp. 146–153. ISBN978-1-55860-778-1.
^Nilsback, Maria-Elena, and Andrew Zisserman. "A visual vocabulary for flower classification."Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on. Vol. 2. IEEE, 2006.
^Giselsson, Thomas M.; et al. (2017). "A Public Image Database for Benchmark of Plant Seedling Classification Algorithms". arXiv:1711.05458 [cs.CV].
^Rahman, Abdur; Lu, Yuzhen; Wang, Haifeng (February 2023). "Performance evaluation of deep learning object detectors for weed detection for cotton". Smart Agricultural Technology. 3: 100126. doi:10.1016/j.atech.2022.100126.
^Nakai, Kenta; Kanehisa, Minoru (1991). "Expert system for predicting protein localization sites in gram-negative bacteria". Proteins: Structure, Function, and Bioinformatics. 11 (2): 95–110. doi:10.1002/prot.340110203. PMID1946347. S2CID27606447.
^Ling, Charles X., et al. "Decision trees with minimal costs." Proceedings of the twenty-first international conference on Machine learning. ACM, 2004.
^Mahé, Pierre; Arsac, Maud; Chatellier, Sonia; Monnin, Valérie; Perrot, Nadine; Mailler, Sandrine; Girard, Victoria; Ramjeet, Mahendrasingh; Surre, Jérémy; Lacroix, Bruno; van Belkum, Alex; Veyrieras, Jean-Baptiste (May 2014). "Automatic identification of mixed bacterial species fingerprints in a MALDI-TOF mass-spectrum". Bioinformatics. 30 (9): 1280–1286. doi:10.1093/bioinformatics/btu022. PMID24443381.
^Campos, Guilherme O.; Zimek, Arthur; Sander, Jörg; Campello, Ricardo J. G. B.; Micenková, Barbora; Schubert, Erich; Assent, Ira; Houle, Michael E. (July 2016). "On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study". Data Mining and Knowledge Discovery. 30 (4): 891–927. doi:10.1007/s10618-015-0444-8.
^Zampieri, Marcos; Malmasi, Shervin; Nakov, Preslav; Rosenthal, Sara; Farra, Noura; Kumar, Ritesh (16 April 2019). "Predicting the Type and Target of Offensive Posts in Social Media". arXiv:1902.09666 [cs.CL].
^Harshaw, Christopher R.; Bridges, Robert A.; Iannacone, Michael D.; Reed, Joel W.; Goodall, John R. (5 April 2016). "GraphPrints". Proceedings of the 11th Annual Cyber and Information Security Research Conference. CISRC '16. New York, NY, USA: Association for Computing Machinery. pp. 1–4. doi:10.1145/2897795.2897806. ISBN978-1-4503-3752-6.
^Mehra, Srishti; Louka, Robert; Zhang, Yixun (2022). "ESGBERT: Language Model to Help with Classification Tasks Related to Companies' Environmental, Social, and Governance Practices". Embedded Systems and Applications. pp. 183–190. doi:10.5121/csit.2022.120616. ISBN978-1-925953-65-7.
^ This article incorporates text available under the CC BY 4.0 license.
^Diggelmann, Thomas; Boyd-Graber, Jordan; Bulian, Jannis; Ciaramita, Massimiliano; Leippold, Markus (2 January 2021). "CLIMATE-FEVER: A Dataset for Verification of Real-World Climate Claims". arXiv:2012.00614 [cs.CL].
^Brown, Michael Scott; Pelosi, Michael J.; Dirska, Henry (2013). "Dynamic-Radius Species-Conserving Genetic Algorithm for the Financial Forecasting of Dow Jones Index Stocks". Machine Learning and Data Mining in Pattern Recognition. Lecture Notes in Computer Science. Vol. 7988. pp. 27–41. doi:10.1007/978-3-642-39712-7_3. ISBN978-3-642-39711-0.
^Shen, Kao-Yi; Tzeng, Gwo-Hshiung (2015). "Fuzzy Inference-Enhanced VC-DRSA Model for Technical Analysis: Investment Decision Aid". International Journal of Fuzzy Systems. 17 (3): 375–389. doi:10.1007/s40815-015-0058-8. S2CID68241024.
^Shmueli, Galit; Russo, Ralph P.; Jank, Wolfgang (December 2007). "The BARISTA: A model for bid arrivals in online auctions". The Annals of Applied Statistics. 1 (2). doi:10.1214/07-AOAS117.
^Peng, Jie; Müller, Hans-Georg (September 2008). "Distance-based clustering of sparsely observed stochastic processes, with applications to online auctions". The Annals of Applied Statistics. 2 (3). doi:10.1214/08-AOAS172.
^Eggermont, Jeroen; Kok, Joost N.; Kosters, Walter A. (2004). "Genetic Programming for data classification: Partitioning the search space". Proceedings of the 2004 ACM symposium on Applied computing. pp. 1001–1005. doi:10.1145/967900.968104. ISBN978-1-58113-812-2.
^Payne, Richard D.; Mallick, Bani K. (2014). "Bayesian Big Data Classification: A Review with Complements". arXiv:1411.5653 [stat.ME].
^Akbilgic, Oguz; Bozdogan, Hamparsum; Balaban, M. Erdal (2014). "A novel Hybrid RBF Neural Networks model as a forecaster". Statistics and Computing. 24 (3): 365–375. doi:10.1007/s11222-013-9375-7. S2CID17764829.
^Jabin, Suraiya (20 August 2014). "Stock Market Prediction using Feed-forward Artificial Neural Network". International Journal of Computer Applications. 99 (9): 4–8. Bibcode:2014IJCA...99i...4J. doi:10.5120/17399-7959.
^Yeh, I-Cheng; Che-hui, Lien (2009). "The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients". Expert Systems with Applications. 36 (2): 2473–2480. doi:10.1016/j.eswa.2007.12.020. S2CID15696161.
^Lin, Shu Ling (2009). "A new two-stage hybrid approach of credit risk in banking industry". Expert Systems with Applications. 36 (4): 8333–8341. doi:10.1016/j.eswa.2008.10.015.
^Xu, Yumo; Cohen, Shay B. (2018). "Stock Movement Prediction from Tweets and Historical Prices". Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 1970–1979. doi:10.18653/v1/P18-1183.
^Pelckmans, Kristiaan; et al. (2005). "The differogram: Non-parametric noise variance estimation and its use for model selection". Neurocomputing. 69 (1): 100–122. doi:10.1016/j.neucom.2005.02.015.
^Bay, Stephen D.; Kibler, Dennis; Pazzani, Michael J.; Smyth, Padhraic (December 2000). "The UCI KDD archive of large data sets for data mining research and experimentation". ACM SIGKDD Explorations Newsletter. 2 (2): 81–85. doi:10.1145/380995.381030.
^Pales, Jack C.; Keeling, Charles D. (1965). "The concentration of atmospheric carbon dioxide in Hawaii". Journal of Geophysical Research. 70 (24): 6053–6076. Bibcode:1965JGR....70.6053P. doi:10.1029/jz070i024p06053.
^Sigillito, Vincent G., et al. "Classification of radar returns from the ionosphere using neural networks." Johns Hopkins APL Technical Digest10.3 (1989): 262–266.
^Zhang, Kun; Fan, Wei (March 2008). "Forecasting skewed biased stochastic ozone days: analyses, solutions and beyond". Knowledge and Information Systems. 14 (3): 299–326. doi:10.1007/s10115-007-0095-1.
^Kohavi, Ron (1996). "Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid". KDD. 96.
^Oza, Nikunj C., and Stuart Russell. "Experimental comparisons of online and batch versions of bagging and boosting." Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2001.
^Bay, Stephen D. (November 2001). "Multivariate Discretization for Set Mining". Knowledge and Information Systems. 3 (4): 491–512. doi:10.1007/pl00011680.
^Ruggles, Steven (1995). "Sample designs and sampling errors". Historical Methods. 28 (1): 40–46. doi:10.1080/01615440.1995.9955312.
^Zhan, Xianyuan; et al. (2013). "Urban link travel time estimation using large-scale taxi data with partial information". Transportation Research Part C: Emerging Technologies. 33: 37–49. Bibcode:2013TRPC...33...37Z. doi:10.1016/j.trc.2013.04.001.
^Hwang, Ren-Hung; Hsueh, Yu-Ling; Chen, Yu-Ting (2015). "An effective taxi recommender system based on a spatio-temporal factor analysis model". Information Sciences. 314: 28–40. doi:10.1016/j.ins.2015.03.068.
^H. V. Jagadish, Johannes Gehrke, Alexandros Labrinidis, Yannis Papakonstantinou, Jignesh M. Patel,
Raghu Ramakrishnan, and Cyrus Shahabi. Big data and its technical challenges. Commun. ACM,
57(7):86–94, July 2014.
^Kushmerick, Nicholas (1999). "Learning to remove Internet advertisements". Proceedings of the third annual conference on Autonomous Agents. pp. 175–181. doi:10.1145/301136.301186. ISBN978-1-58113-066-9.
^Fradkin, Dmitriy; Madigan, David (2003). "Experiments with random projections for machine learning". Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. pp. 517–522. doi:10.1145/956750.956812. ISBN978-1-58113-737-8.
^This data was used in the American Statistical Association Statistical Graphics and Computing Sections 1999 Data Exposition.
^Ma, Justin; Saul, Lawrence K.; Savage, Stefan; Voelker, Geoffrey M. (2009). "Identifying suspicious URLs: An application of large-scale online learning". Proceedings of the 26th Annual International Conference on Machine Learning. pp. 681–688. doi:10.1145/1553374.1553462. ISBN978-1-60558-516-1.
^Levchenko, K.; Pitsillidis, A.; Chachra, N.; Enright, B.; Felegyhazi, M.; Grier, C.; Halvorson, T.; Kanich, C.; Kreibich, C.; He Liu; McCoy, D.; Weaver, N.; Paxson, V.; Voelker, G. M.; Savage, S. (2011). "Click Trajectories: End-to-End Analysis of the Spam Value Chain". 2011 IEEE Symposium on Security and Privacy. pp. 431–446. doi:10.1109/SP.2011.24. ISBN978-0-7695-4402-1.
^Singh, Ashishkumar; Rumantir, Grace; South, Annie; Bethwaite, Blair (2014). "Clustering Experiments on Big Transaction Data for Market Segmentation". Proceedings of the 2014 International Conference on Big Data Science and Computing. pp. 1–7. doi:10.1145/2640087.2644161. ISBN978-1-4503-2891-3.
^Bollacker, Kurt; Evans, Colin; Paritosh, Praveen; Sturge, Tim; Taylor, Jamie (2008). "Freebase: A collaboratively created graph database for structuring human knowledge". Proceedings of the 2008 ACM SIGMOD international conference on Management of data. pp. 1247–1250. doi:10.1145/1376616.1376746. ISBN978-1-60558-102-6.
^Mintz, Mike, et al. "Distant supervision for relation extraction without labeled data." Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2-Volume 2. Association for Computational Linguistics, 2009.
^Mesterharm, Chris; Pazzani, Michael J. (2011). "Active learning using on-line algorithms". Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. pp. 850–858. doi:10.1145/2020408.2020553. ISBN978-1-4503-0813-7.
^ ab"The Pile". pile.eleuther.ai. Retrieved 14 April 2022.
^"JSON Lines". jsonlines.org. Retrieved 14 April 2022.
^Gao, Leo; Biderman, Stella; Black, Sid; Golding, Laurence; Hoppe, Travis; Foster, Charles; Phang, Jason; He, Horace; Thite, Anish; Nabeshima, Noa; Presser, Shawn (31 December 2020). "The Pile: An 800GB Dataset of Diverse Text for Language Modeling". arXiv:2101.00027 [cs.CL].
^"OSCAR". oscar-project.org. Retrieved 12 August 2023.
^Ortiz Suarez, Pedro, et al. "[2]." Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures. CMLC-7, 2019.
^Abadji, Julien, et al. "[3]." Towards a Cleaner Document-Oriented Multilingual Crawled Corpus. LREC, 2022.
^Cohen, Vanya. "OpenWebTextCorpus". OpenWebTextCorpus. Retrieved 9 January 2023.
^Belsley, David A., Edwin Kuh, and Roy E. Welsch. Regression diagnostics: Identifying influential data and sources of collinearity. Vol. 571. John Wiley & Sons, 2005.
^Li, Lihong; Chu, Wei; Langford, John; Wang, Xuanhui (2011). "Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms". Proceedings of the fourth ACM international conference on Web search and data mining. pp. 297–306. arXiv:1003.5956. doi:10.1145/1935826.1935878. ISBN978-1-4503-0493-1.
^Yeung, Kam Fung; Yang, Yanyan (2010). "A Proactive Personalized Mobile News Recommendation System". 2010 Developments in E-systems Engineering. pp. 207–212. doi:10.1109/DeSE.2010.40. ISBN978-1-4244-8044-9.
^Gass, Susan E.; Roberts, J. Murray (2006). "The occurrence of the cold-water coral Lophelia pertusa (Scleractinia) on oil and gas platforms in the North Sea: colony growth, recruitment and environmental controls on distribution". Marine Pollution Bulletin. 52 (5): 549–559. Bibcode:2006MarPB..52..549G. doi:10.1016/j.marpolbul.2005.10.002. PMID16300800.
^Obradovic, Zoran, and Slobodan Vucetic.Challenges in Scientific Data Mining: Heterogeneous, Biased, and Large Samples. Technical Report, Center for Information Science and Technology Temple University, 2004.
^Van Der Putten, Peter; van Someren, Maarten (2000). "CoIL challenge 2000: The insurance company case". Published by Sentient Machine Research, Amsterdam. Also a Leiden Institute of Advanced Computer Science Technical Report. 9: 1–43.
^Mao, K. Z. (2002). "RBF neural network center selection based on Fisher ratio class separability measure". IEEE Transactions on Neural Networks. 13 (5): 1211–1217. doi:10.1109/tnn.2002.1031953. PMID18244518.
^Lizotte, Daniel J.; Madani, Omid; Greiner, Russell (2012). "Budgeted Learning of Naive-Bayes Classifiers". arXiv:1212.2472 [cs.LG].
^Lebowitz, Michael (1984). Concept Learning in a Rich Input Domain: Generalization-Based Memory (Report). doi:10.7916/D8KP8990.
^Yeh, I-Cheng; Yang, King-Jang; Ting, Tao-Ming (2009). "Knowledge discovery on RFM model using Bernoulli sequence". Expert Systems with Applications. 36 (3): 5866–5871. doi:10.1016/j.eswa.2008.07.018.
^Sariyar, Murat; Borg, Andreas; Pommerening, Klaus (2011). "Controlling false match rates in record linkage using extreme value theory". Journal of Biomedical Informatics. 44 (4): 648–654. doi:10.1016/j.jbi.2011.02.008. PMID21352952.
^Candillier, Laurent; Lemaire, Vincent (August 2013). "Active learning in the real-world design and analysis of the Nomao challenge". The 2013 International Joint Conference on Neural Networks (IJCNN). Vol. 8. pp. 1–8. doi:10.1109/IJCNN.2013.6706908. ISBN978-1-4673-6129-3.