Text mining involves the use of statistical and machine learning techniques to learn structural elements of text in order to search for useful information in previously unseen text. The need for these techniques have emerged out of the rapidly growing information era. Token identification is an important component of any text mining tool. The accomplishment of this task enhances the function of diverse applications involving searching for patterns in textual data. Several different identification methods have been reported in the literature. HMMs and PPM models have been successfully used in language processing tasks. They have also been applied separately to learning-based token identification. Most of the existing systems are domain- and language-dependent. In this thesis, we implement a system that bridges the two well known methods through words new to the identification model. The system is fully domain- and language-independent. No changes of code are necessary when applying to other domains or languages. The only thing required is an annotated corpus. The system has been tested on two corpora and achieved an overall F-measure of 76:59% for TCC, and 69:02% for BIB. This is not as good as would be expected from a system which includes language-dependent components. However, our system is more generalized. The identification of date has the best result, 73% and 92% of correct tokens are identified respectively. The system also performs reasonably well on people's name with correct tokens of 68% for TCC, and 76% for BIB. ii Acknowledgements During the time of my MPhil. study, I have been so lucky to have had a huge amount of help in academic, financial and personal from a number of people. First and foremost, I would like to thank my chief supervisor, Ian Witte...