Sunday, April 22, 2012

What is data mining, and does it encourage the creation of a specific kind of history?



Kurt Thearling defines data mining in its simplest form, ‘the process of efficient` discovery of nonobvious valuable patterns from a large collection of data.’[1] His definition was intended for business companies. Even though the same definition is astute, for academics, historians and scholars this notion is expand further by describing data mining as, ‘‘mining of knowledge’ as distinct from the ‘retrieval of data.’’[2] This is a general consensus which goes beyond historians and has added more depth to the process of data mining. It shows the difference between methodologies such as ‘keyword’ searches, which highlights a specific piece of data (a word), compared to highlighting information through the Semantic Web;  an implied  meaning within the results.[3] Data mining is summarised into three steps: classification, clustering and regression. These processes will demonstrate the different methodologies within them. Furthermore, different methods such as Ngram and Topic Modeling will evaluate how data mining is presented, as well as showing how historians interpret them. Finally, the question of whether analysing data online creates a different kind of history will be developed further on.


      
The concept of data mining is when relevant information is depicted from a large corpus of data and is then used to formulate a conclusion. However, defining the complexity of the process can be differentiated. One can organise data into tables and charts and from there interpret their own conclusion based on a pattern drawn from the information given. This is feasible with small amounts of data, yet when dealing with Big Data, reams of data sets are too complex for just one person to analyse. Therefore, data mining accomplished this by using algorithms to find a consistency and compare it with inconsistent data. This is done in marketing and usually to make a prediction.[4]  In addition, data mining includes databases that are unapparent to the user who uses them because they provide a larger variety of information. When variables are associated with each other, predictions can be made which can either disagree or confirm the users theory. Data mining can form different links between different variables. Implementing some of these can be risky because the user does not know of the solution that the mining process has concluded. The data is presented in different ways and visual presentation is a key part of that. This aided the user in understanding the relations more coherently, whether it be a graph, table or image. Data mining consist of many tools which analyse information from different perspectives. It is used particularly to compress large databases which span across different fields.[5] They process data in order to make it usable.

Classification is the first step in data mining; it contains different techniques in separating data depending on the variables. The tree method is notably used to draw up relations between these variables. When using the tree method the facts gathered from a range of databases are divided into categories. These are further divided until each group contains only a few factors and has created many branches in the process. A limit is needed to prevent ‘overfitting;’ where categories divide repeatedly leaving as little as one factor within it, this is one of the dangers and disadvantages in the decision tree methodology. For analytical evaluation the tree primarily highlights key variables.[6] This means that it would be quick and easy to locate variables that had initially branched out, thus showing their immediate importance in relation rather than further down on the second or third branch. Interestingly, Classification and Regression Trees (CART) and Chi Square Automatic Interaction Detection (CHAID) are techniques within the decision tree that has accepted unclassified data in creating new predictions.[7]  Another type of classification is the Naïve Bayesian, which is predominantly reliable and specifically deals with textual data.[8] This is different to the decision tree in one fundamental way; the Naïve Bayesian is based on subjunctive probability.  This means that instead of categorizing words based on their association to each other, the Bayesian system relies on the number of times the word appears or does not appear in the text. The extent in which this type of Text mining is useful for historians or researchers is debatable. The algorithms would create more results in searching for a word, yet the relevance of the search may not always be practical because words may have more than one significant connotation; meaning some results will be unrelated to the question. Matt Kirschenbaum applied analytical data to poems written by Emily Dickinson. He used the Naïve Bayesian method to compare whether the system could detect an erotic poem or not, compared to Scholarly standards. He argued that if technology proved to be effective then it would confirm the information scholars already possessed.[9] This may be the case but it does not produce new and exciting relationships to debate about, or spark a new outlook to be discussed. It might however, give the reader more confidence in their method to evaluate other poems or literature in this way.             

David Blei’s paper on the Topic Modeling caused quite a commotion. With the rise of its popularity, people applied such methodology with ease yet had not quite grasped the concept.[10] In one of Beli’s lectures, Topic Modeling shows a relationship to the Bayesian system but a ‘hierarchical Bayesian system.’ This would provide the most relevant information first.  Essentially, Topic Modeling is selecting topics from a body of data, Beli mentioned ‘Wikipedia.’ Then by selecting a file you can connect the topics, which means ‘annotating’ the file with the use of algorithms to locate different topics.[11] This is equivalent to the classification process in data mining but by using a different method. Once located, there are different ways in which they can be presented. For example, seeing topics change over time can be depicted through a graph; similar to Ngram. To see connections or relations between topics, branching would assist with this. Finally, images can also be annotated by the algorithms or gridding,  thereby it is treated like a document. Latent Semantic Analysis (LSA) and Latent Dirichlit Allocation (LDA) are some of the forms of modeling and in association, classification in data mining.  LDA helps associate word to a topic but the topic is not named.   

Secondly, clustering techniques are used to group data to detect a possible outcome. This section causes further complications within data mining, because the results need to be grouped logically before being analysed.[12] K-means is used commonly due to its manageability and the main factor is the K centroids, which are present in each cluster. The position of the centroid is vital during the process as each centroid moves from their location and in doing so generates different results. Keeping a fair distance from one centroid to another optimizes results. Once in place data sets are placed near the closest centroid; and after the process new centroids are binded to an old set of data and continue to moves until it reaches its final point. This creates an algorithm in ‘minimizing an objective function.’[13] Instead of K-means, search engines use TF*IDF in order to locate documents that are the most significant to the input data (question) first.  Historians have credited the Inverse Document Frequency (IDF) by highlighting the experimental nature. However, when it is amalgamated with the Frequency of the Term (TF) it is advance in its, ‘text retrieval into methods for retrieval of other media, and into language processing techniques for other purposes.’ [14] This proves to be a good way of correlating data to provide maximum results.         



Lastly, regression in data mining usually uses mathematical formulas; these are algorithms to determine the prediction or outcome of the results and to establish connections between them.[15] This is used for quantum-numerical data, yet text data is depicted and effectively used for conclusions and interpretations through the semantic web. Within the semantic web there are web agents, one of these tools is called Armadillo which exhibit their findings on a Resource Description Framework (RDF.) Despite this useful tool, humanities resources are equipped to function without it. This does not mean that they are equipped to deal with the ‘black box’ problem in data mining.[16] Unfortunately, the 'black box' problem is when some output data does not correspond with the input data and thus presents unsatisfactory results. In some ways it is similar to the Bayesian system from the previous step of clustering, when the output data is not relevant  to  the input data on some occasions. This is especially impractical for historians.            


Historian Ben Schmidt created a Ngram on Google, but like most historians he found it difficult to subject much information from it. Ngrams presents data on a graph format. The relationship of both axis in producing a linear sequence of data can visually be interpreted. From the rise and fall of the information displayed on the graph obvious correlations can be seen, but further interpretation seems quite difficult to deduce.[17]  Nevertheless, more than one variable at a time can be analysed and the data presented can be displayed as a direct comparison or similarity depending on the lines representing each word on the graph.  For these reasons Ngrams would not be entirely useful for historians; as no further analysis is made. The vertical axis has always displayed the year; therefore in searching a specific question for example, when was there depression in America? The words ‘Depression’ cross referenced against ‘America’ would analyse each words as separate entities, counting how many times they were mentioned rather than the significance the words have together.  Mathew Hurst analysed it from a language perspective, he accumulated data on the words themselves rather than correlating them visually. He compared words from different versions of English, American English and British English, to see how they had changed over the years, which was hardly at all. Subsequently, he compared the same word but by one beginning with a capital letter and the other one with a lower case letter. From the results he gathered the words with capital letters were used at the start of the sentence and were noted more than those in lower cases. Mathew Hurst enjoyed coming up with these types of conclusions and seemed quite excited about the topic.  [18]       




It can be argued that digitalising history has ushered in a new era through the way in which information is researched. Stephen Ramsay credits this in his appraisal of big data, but he states that this would further encourage the traditional humanist research nonetheless.  Methodologies in data mining assisted historians by categorising and producing new and perhaps unthought-of perspectives behind the information displayed. Furthermore, Tim Hitchcock and some of his colleagues consider historians who discredit this type of history as fairly old fashioned and isolated people within archives.[19] Intriguingly, this interpretation of historians may have stemmed from the debate of books vs. Internet books and journals; arguing whether reading a physical book is better than reading text off a screen and vice versa. In relation to data mining, it could be argued that digital history is easier and more precise for research. On the other hand, books were the original form of information that cannot replace personal interaction and sentiment that one feels with the document. This is a further point to discuss: nonetheless, everyone has their own preference, but what can be noted is that more and more data is being uploaded.

To conclude, data mining is mixture of classification, clustering and regression. Classification organises the data which can be done through a number of methods but in particular for text mining, decision tree or Naïve Bayesian would be appropriate. When the data is passed down through clustering it is grouped through systems such as K-mean, and lastly, algorithms use mathematical methods during regression to correlates the data, for future predictions.  In comparison, Topic Modeling has been credited as a well known form of classification. It has the potential to develop the way we perceive and form new correlations between data.  Ngram may be considered as the progression to open new questions, yet most scholars can state that it is limited in what information can be extracted from it. Furthermore, this is the case for some historians; military historians would argue, ‘this ontology does not represent their knowledge.’[20] This shows that categorised data cannot replace the deeper meaning that historians depict when analysing documents, rather than forming parallels.             



[1] A Data Mining Glossary, http://www.thearling.com/glossary.htm; consulted 15 April 2012
[2] Fabio Ciravegna, Mark Greengrass, Tim Hitchcock, Sam Chapman, Jamie Mc Laughlin and Ravish Bhagdev, ‘Finding Needles in the Haystacks: Data-mining in Distributed Historical Datasets’ in Mark Greengrass and Lorna Hughes, The virtual representation of the past (surrey, 2008) p.66
[3] Fabio Ciravegna, Mark Greengrass, Tim Hitchcock, Sam Chapman, Jamie Mc Laughlin and Ravish Bhagdev, haystack, pp65-67
[4] Data mining techniques, http://www.obgyn.cam.ac.uk/cam-only/statsbook/stdatmin.html; consulted 15 April 2012  
[5] Data Mining: What is Data Mining?, http://www.anderson.ucla.edu/faculty/jason.frand/teacher/technologies/palace/datamining.htm; consulted 14 April 2012
[6] Data mining for process improvement, http://www.crosstalkonline.org/storage/issue-archives/2011/201101/201101-Below.pdf; consulted 15 April 2012
[7] Data, http://www.anderson.ucla.edu/faculty/jason.frand/teacher/technologies/palace/datamining.htm;
[8] New Methods for Humanities Research; http://people.lis.illinois.edu/~unsworth/lyman.htm; consulted 15 April 2012               
[9] New Methods for Humanities Research; http://people.lis.illinois.edu/~unsworth/lyman.htm; consulted 15 April 2012             
[10] Topic Modeling and Network Analysis,http://www.scottbot.net/HIAL/?p=221; consulted 18 April 2012
[11] Topic models, http://videolectures.net/mlss09uk_blei_tm/; consulted 18 April 2012
[12] New, http://people.lis.illinois.edu/~unsworth/lyman.htm;               
[13]New, http://people.lis.illinois.edu/~unsworth/lyman.htm;                
[14] Stephen Robertson, Understanding Inverse Document Frequency: On theoretical arguments for IDF, Journal of Documentation, 60 no. vol. 5, pp. 503–520
[15] Introduction to data mining, http://www.youtube.com/watch?v=_QH4oIOd9nc; consulted 13 April 2012 
[16] Fabio Ciravegna, Mark Greengrass, Tim Hitchcock, Sam Chapman, Jamie Mc Laughlin and Ravish Bhagdev, haystack, pp.67-78
[17]  Sapping Attention, http://sappingattention.blogspot.co.uk/; consulted 18 April 2012
[19]  With Criminal Intent, http://criminalintent.org/; consulted 18 April 2012
[20]  Fabio Ciravegna, Mark Greengrass, Tim Hitchcock, Sam Chapman, Jamie Mc Laughlin and Ravish Bhagdev, haystack, p. 72

No comments:

Post a Comment

Creative Commons Licence
This work is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.