*by Hugo Herrero *Antón de Vez, MD

Recently, I have begun studying machine learning (ML) algorithms for medical applications. I have applied some of them at a very simple level (Regressions, NN, SVM, k-means…).

These algorithms have special value as a brute force approximation to problems that are difficult to solve analytically. My first impression is that perhaps in many occasions, it is more worthwhile to think a little and to program some concise directives to obtain the result we want instead of using an ML algorithm to find a solution.

I must add that using the ML is not as simple as it might seem a priori, we must choose the variables carefully and devise a model that approximates to our problem with the risk that our hypothesis finishes overfitting or underfitting (high variance and high bias respectively); next we must train with a piece of the sample (training set) the parameters of our hypothesis [cost function and gradient descent calculation (do we want to perform a batch GD or a stochastic GD?) and in case of NN backpropagation to compute the derivatives of the cost function, etc.]; then we must test the hypothesis obtained in another piece of the sample to test its generalization (cross-validation set) and analyze (for example, plotting the learning curve) how we could make our approach more efficient [getting more examples, regularizing, increasing or reducing variables, reducing dimensions (eliminating redundant or highly correlated variables using main component analysis), etc.] and finally test our improved hypothesis in another piece of the sample (test set).

In all this process we have to take into account that a low training error is not a good predictor of the success of our hypothesis (we have to take into account the “skewed classes” when the ratio of positive examples is much greater than that of negative examples in our sample) so we can affirm that it is a more than complex process.

In addition, if you want to build a system that applies these algorithms in real time, I think that first, we should address a basic issue to not have problems in the future: The Database Structure on which we are going to work.

We must bear in mind that the quantitative data provided by the radiological image is much more valuable if it is related to the clinical data of the patient (this integration is known as the Patient-Specific Model), therefore, we should have a database (BD) that allows us to easily search and update the existing relationships between the different variables that our model integrates. Remember that the value of our data is proportional to the number of meaningful relationships we can find.

Nowadays we usually work with relational databases (RDB), these become especially complex and rigid when we want to express relationships, especially in the case of complex relationships (for example, when we want to obtain information from the database (query) from two or more related tables (multilevel joints)). Incidentally, these are difficult to scale horizontally.

I think a possible approach to assess would be to structure patient data in graph databases (GDB). A GBD can model the complexity of the dependencies between the data we see in real life. In addition, most machine learning algorithms work much better with vectorized implementations, and such implementation makes it much easier to use libraries to process data in parallel; since GDBs are usually represented using an affinity matrix, they use the same structure as ML vectorized applications (incidentally, functions in a graph can be represented as a vector).

Another advantage of a GDB is to have the same environment to perform information analysis and to perform searches in an efficient and even compartmentalized way, even keeping the patient information encrypted.

While relational BDs are optimized for data aggregation, GDBs are optimized for data connections.

Enabling seamless interaction between graphical analysis and ML in a single environment or language, allows us to take advantage of powerful graphics algorithms to complement the process of ML with graphical metrics calculated as predictor variables.

In my opinion, ML is currently used for very specific applications, but until the integration between these algorithms and the databases is dealt with in a standardized way, I am afraid that the full potential of this promising tool.

* *