Using machine learning algorithms can be interesting to come to conclusions for your thesis. However, it also becomes very easily overly complicated. In order to prevent getting stuck with codes, data, and programming environments, I present a few tips and tricks.
Firstly, make sure you understand what it takes when starting to program in a programming language. If you do not have any experience yet, using python probably is the best way to go. Before deciding whether to use a certain language, it is advised to find similar codes to the one you intend to make and to replicate these codes. This can be a great way of practicing and determining whether you are able to actually work with the programming language. There is plenty of environments in which this can be done, for example using TensorFlow. For my thesis, I used TensorFlow given the simplicity of using it and the great amount of documentation available. A great source for finding examples of codes is GitHub.
Secondly, make sure that you verify whether it is possible to come to your desired outputs with the machine learning algorithm you intend to use. For example, determine whether you want a categorical or continuous variable as output and choose the appropriate machine learning technique for getting to this output.
Thirdly, and probably most importantly, make sure that the data you want to use is available. And, make sure there is enough data. When using data from financial statements, using the Wharton Research Database may be a good way to go. You can get free access to this database through Tilburg University. Moreover, carefully consider how much time you need to prepare the data. For example, using data from companies from many different industries or using data from many different years may be a pain. Therefore, try to be as consistent as possible in collecting your data. Moreover, start with making a framework on what (meta)data you need. For example, data such as company information, dates of collecting data, etc. In case you find later in your research that you lack some data, this framework makes it relatively easy to repair your mistakes.
Fourthly, developing machine learning algorithms and being able to explain what actually happens in these algorithms requires some statistical and mathematical knowledge. Make sure you read into what is happening in the algorithm, or make sure you have access to the right people to explain this to you. There are quite a few standardized packages for e.g. neural networks (e.g. through Scikit). However, you may still be asked to explain what happens in the neural network.
Lastly, when you compare your algorithm to other algorithms, it is important that you measure the performance of your model in a similar way as done in the measurement of the other models. Otherwise, your comparison may be inaccurate. For example, F1 is generally considered to better measure the ability of the model to discriminate than e.g. hit ratio. Reading into the meaning of these measurements and choosing the appropriate one may be fundamental to your research.
Good luck on your thesis!
This letter is part of the collection “letter to the younger self” and has been written for helping the “new generation of students” learning from who was there before. You can see all the letters at the following link: