This project considers a number of the methods for instance/example selection in training data for language models with the most promising being experimented with and evaluated via hypothesis testing. The most successful, the expansion on the perplexity based work of Roger Moore was selected for further development due to its good test results and ability to locate related sentences. A number of possible filter methods were produced for improving the performance and results of that method. Each of these filters were tested with a decrease in data size of between 2.6 and 75 % being returned. The best performing of these filters with a decrease in data of 57 % was then selected and after some fine tuning a combination of it and the original method were tested to gauge its full abilities. The results show that the combination of methods managed to form a scalable solution to the problem with datasets with on average 48 % lower perplexity than a baseline approach being produced. The additional optimization features were shown to reduce the time to run by between 50 and 60%. i Acknowledgements Many thanks to my supervisor Miles Osbourne for his advice and guidance and to my colleges whose opinions helped me gain a full perspective on my work. Also to my proof readers for dealing with countless unnecessary commas. ii Declaration I declare that this thesis was composed by myself, that the work contained herein is my own except where explicitly stated otherwise in the text, and that this work has not been submitted for any other degree or professional qualification except as specified.