{"id":537,"date":"2025-01-07T15:42:39","date_gmt":"2025-01-07T15:42:39","guid":{"rendered":"https:\/\/statisticalbiophysicsblog.org\/?p=537"},"modified":"2025-01-07T15:42:39","modified_gmt":"2025-01-07T15:42:39","slug":"machine-learning-for-outsiders-1-very-basics","status":"publish","type":"post","link":"https:\/\/statisticalbiophysicsblog.org\/?p=537","title":{"rendered":"Machine Learning for Outsiders 1 &#8211; Very basics"},"content":{"rendered":"<p>I want to introduce machine learning (ML) to people outside the field, both non-mathematical scientists totally new to machine learning and quantitative folks who are novices.  My qualifications for this are that I\u2019m an outsider to ML myself, maybe an \u201cadvanced beginner\u201d \u2013 with several years of experience.  As I have learned ML, I try to keep an eye out for what\u2019s important and what isn\u2019t.<br \/>\n<!--more--><br \/>\nOne thing that\u2019s definitely not important for beginners is the equations, so we will keep those to a minimum.<\/p>\n<p><strong>Jargon<\/strong> is a bit of a different story, as we need to be comfortable interpreting and potentially designing ML studies.  So we want to embrace the essential jargon, but of course with clear definitions.  We want to train ourselves not to be intimidated by the jargon.<\/p>\n<p><strong>What is machine learning?<\/strong><\/p>\n<p>Let\u2019s start with the most obvious jargon, ML itself.  The \u2018machine\u2019 of course is just the computer and \u2018learning\u2019 is a stretch term.  Computers themselves know nothing, even if it seems they do, especially in the artificial intelligence era.  Computers only follow instructions, i.e., computer programs or code.<\/p>\n<p>Put the M and L together and we merely mean a computer program that takes in a data and fits some <strong>parameters<\/strong> (numerical constants optimized for the data) of a \u201c<strong>model<\/strong>.\u201d  In ML, a model is just the set of equations chosen by the person doing the programming.<\/p>\n<p>Let\u2019s make it concrete.  It\u2019s a good idea to construct a simple example in your mind, or on paper, or a computer if that\u2019s natural for you.  There\u2019s little chance to learn a new and complicated thing in the abstract.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"639\" height=\"536\" alt=\"A graph of a graph with black dots and red text\n\nDescription automatically generated\" src=\"https:\/\/statisticalbiophysicsblog.org\/wp-content\/uploads\/2025\/01\/a-graph-of-a-graph-with-black-dots-and-red-text-d.png\" class=\"wp-image-539\" srcset=\"https:\/\/statisticalbiophysicsblog.org\/wp-content\/uploads\/2025\/01\/a-graph-of-a-graph-with-black-dots-and-red-text-d.png 639w, https:\/\/statisticalbiophysicsblog.org\/wp-content\/uploads\/2025\/01\/a-graph-of-a-graph-with-black-dots-and-red-text-d-300x252.png 300w\" sizes=\"auto, (max-width: 639px) 100vw, 639px\" \/><\/p>\n<p>The simplest example is <strong>linear<\/strong> <strong>fitting<\/strong>, as shown in the figure.  For now, this just means fitting a line through points, as you\u2019ve likely seen before.  It\u2019s the simplest form of machine \u2018learning\u2019 because you can use the equation of the line to <strong>predict<\/strong> a y value for any new x value you\u2019re given.<\/p>\n<p>It\u2019s the ability to make a prediction, based on learning (aka <strong>training<\/strong>) from data that makes it ML.  Simple as pie.<\/p>\n<p>For example, imagine we want to understand the relationship between the length and weight of a certain species of turtle, based on a whole bunch of measurements of each quantity (from a whole bunch of turtles).  We can choose to treat the length as our single \u2018<strong>independent variable<\/strong>\u2019 and so for each turtle, the length will be the x value and the weight will be the y value, i.e., <strong>dependent variable<\/strong>.  In a scatter plot of (x, y) points, each point would represent one turtle.  Always keep track of what each point in a figure represents.<\/p>\n<p>In ML we build models for our dependent variable(s) \u2013 what we want to be able to predict \u2013 based on the chosen dependent variables, also called <strong>features<\/strong>.<\/p>\n<p>This is called <strong>supervised ML<\/strong> because we are using known y values (e.g., turtle weights) to train the parameters of the model.  For now, we will focus only on supervised ML.  For reference, <strong>unsupervised ML<\/strong> typically involves clustering data into groups or something similar.<\/p>\n<p>Thus far, we have been talking about <strong>one-dimensional<\/strong> (<strong>1D) data<\/strong>, by which we mean just a bunch of numbers as x values, as opposed to vectors.  A higher dimensional model for the weight of a turtle could build a model based on more than one \u2018feature,\u2019 say the length and height of the shell.  Visualizing this data would require a three-dimensional plot, but we don\u2019t need to pursue that now.<\/p>\n<p>It&#8217;s also useful to know that any kind of data fitting based on continuous x and y values is called <strong>regression<\/strong>.<\/p>\n<p><strong>How does machine learning work?<\/strong><\/p>\n<p>Supervised machine learning is just fitting data, such as the linear model we just described.<\/p>\n<p>The linear model assumes that y is simply a multiple of x, possibly with an additive constant: , where a and b are the parameters of the model.<\/p>\n<p>The essence of the process is not hard to understand.  We can simply try every possible combination of the a and b parameters, and then choose the pair that gives the best fit.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"1290\" height=\"548\" alt=\"A diagram of data space\n\nDescription automatically generated\" src=\"https:\/\/statisticalbiophysicsblog.org\/wp-content\/uploads\/2025\/01\/a-diagram-of-data-space-description-automatically.png\" class=\"wp-image-540\" srcset=\"https:\/\/statisticalbiophysicsblog.org\/wp-content\/uploads\/2025\/01\/a-diagram-of-data-space-description-automatically.png 1290w, https:\/\/statisticalbiophysicsblog.org\/wp-content\/uploads\/2025\/01\/a-diagram-of-data-space-description-automatically-300x127.png 300w, https:\/\/statisticalbiophysicsblog.org\/wp-content\/uploads\/2025\/01\/a-diagram-of-data-space-description-automatically-1024x435.png 1024w, https:\/\/statisticalbiophysicsblog.org\/wp-content\/uploads\/2025\/01\/a-diagram-of-data-space-description-automatically-768x326.png 768w, https:\/\/statisticalbiophysicsblog.org\/wp-content\/uploads\/2025\/01\/a-diagram-of-data-space-description-automatically-788x335.png 788w\" sizes=\"auto, (max-width: 1290px) 100vw, 1290px\" \/><\/p>\n<p>This procedure would be a <strong>grid search<\/strong> because we are effectively creating a plane with a and b as axes (instead of x and y), upon which we can place a grid of evenly spaced values, i.e., (a, b) pairs, to try. This new plane is sometimes called a <strong>parameter space<\/strong> to distinguish it from the space of (x, y) data values.<\/p>\n<p>How do we evaluate each pair of (a, b) values?  We do so using a <strong>loss function<\/strong> which is a mathematical expression that quantifies our idea of goodness of fit.  (I stress the word idea because one can invent different loss functions for different purposes, and this is a key but little discussed point for more advanced practitioners.  I hope to return to this point in a future post.)<\/p>\n<p>The most common loss function measures the average squared (vertical) distance of points from the line.  That is, for every point, draw a vertical line up or down to the candidate fit line \u2013 which is based on candidate (a, b) values \u2013 and measure that distance, then square it, and average over all such distances.  See red lines in the first figure, above.<\/p>\n<p>In sum, machine learning for this case means writing a computer program to evaluate the loss function for every (a, b) pair of values on the grid, and assigning the minimum \u201closs\u201d as the best fit.  Then for any new value of x, the function  with the best (a, b) values yields a prediction for y.<\/p>\n<p>If you understand that, consider yourself a machine learner!<\/p>\n<p><strong>Say it in jargon: Optimization in parameter space<\/strong><\/p>\n<p>We already mentioned the plane of (a, b) values as a parameter space.  By doing a grid search over this space \u2013 to find the best (a, b) pair which minimizes our choice of loss function \u2013 we are performing what\u2019s called an <strong>optimization<\/strong>.  No big deal, it\u2019s just finding the smallest value.<\/p>\n<p><strong>What model should I use?<\/strong><\/p>\n<p>We have illustrated the essentials of machine learning by a linear fit.  But there\u2019s no rule of science that fits have to be linear.  In the case of turtles, we would not expect the weight to vary linearly with length, but rather with a higher power of length.  (A na\u00efve guess would be the 3<sup>rd<\/sup> power, because weight should be proportional to volume, which in turn should vary with the cube of the linear dimension.)<\/p>\n<p>You could imagine trying more complicated models, such as .  These more complicated models are certain to fit the data as well or better than the simpler model.  This is true even if the underlying process actually was linear because, with noise, the data won\u2019t come out perfectly on a line, and thus the more complicated quadratic function is almost certain to provide a better fit.<\/p>\n<p>But a better fit to the training data is not always desirable, as we\u2019ll discuss.<\/p>\n<p><strong>The overfitting problem and test\/validation data<\/strong><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"841\" height=\"625\" alt=\"A diagram of a graph\n\nDescription automatically generated\" src=\"https:\/\/statisticalbiophysicsblog.org\/wp-content\/uploads\/2025\/01\/a-diagram-of-a-graph-description-automatically-ge.png\" class=\"wp-image-541\" srcset=\"https:\/\/statisticalbiophysicsblog.org\/wp-content\/uploads\/2025\/01\/a-diagram-of-a-graph-description-automatically-ge.png 841w, https:\/\/statisticalbiophysicsblog.org\/wp-content\/uploads\/2025\/01\/a-diagram-of-a-graph-description-automatically-ge-300x223.png 300w, https:\/\/statisticalbiophysicsblog.org\/wp-content\/uploads\/2025\/01\/a-diagram-of-a-graph-description-automatically-ge-768x571.png 768w, https:\/\/statisticalbiophysicsblog.org\/wp-content\/uploads\/2025\/01\/a-diagram-of-a-graph-description-automatically-ge-788x586.png 788w\" sizes=\"auto, (max-width: 841px) 100vw, 841px\" \/><\/p>\n<p>The overfitting issue, in my view, is the single most important thing to know about machine learning.  It teaches us about the limitations of ML, but also points the way toward improving models.<\/p>\n<p>When we build a machine learning model, such as for the weight of turtles based on their length, the data used to fit the parameters is called the <strong>training data<\/strong>.  It is critical to appreciate that the quality of the model cannot be evaluated based on the quality of the fit to the training data.  Even if there is zero error in the fit, the model could be terrible in the sense of not being predictive for new data.<\/p>\n<p>A ML model must be evaluated on set of <strong>test data<\/strong> (sometimes called <strong>validation data<\/strong>, although we won\u2019t go into the nuances here) which has not been used in the training of the model.  That is, the test data must not have been used to fit the parameters, such as a and b in the linear example.<\/p>\n<p>The figure above (which is made up and not for turtles) suggests that a quadratic model  was used to fit the training data in black, while the red test data suggests the trend is actually linear.  Thus the black curve overfits the training data because it poorly describes independent data from the same system.<\/p>\n<p><strong>Overfitting<\/strong> describes the situation where an overly complex model was used to fit the training data such that the model is poor fit to new, test data.<\/p>\n<p>Hence, any time you see a machine learning study, make sure you check that suitable, independent testing data was used.  We will discuss some nuances of validation in a future post.<\/p>\n<p><strong>Summary<\/strong><\/p>\n<p><strong>Supervised ML<\/strong>, known as <strong>regression<\/strong> for continuous variables (regular numbers) entails fitting <strong>parameters<\/strong>, namely, the constants for the equations which are chosen by the practitioner.  Parameters are chosen that minimize the error in the fit.  Testing on independent data is essential to check for <strong>overfitting<\/strong>.  If the model does not provide a good fit for the test data, it\u2019s likely a simpler model should be used.  A future post will explore the important issue of constructing simple enough models.<\/p>\n<p>It may seem ironic that I am emphasizing the importance of simpler models in the era of deep neural networks and artificial intelligence.  The fact is that doing good science with ML requires \u201cright-sized\u201d models, not necessarily the model with the most intimidating name.  Stay tuned.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I want to introduce machine learning (ML) to people outside the field, both non-mathematical scientists totally new to machine learning and quantitative folks who are novices. My qualifications for this are that I\u2019m an outsider to ML myself, maybe an \u201cadvanced beginner\u201d \u2013 with several years of experience. As I have learned ML, I try [&hellip;]<\/p>\n","protected":false},"author":6,"featured_media":538,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[35],"tags":[],"class_list":["post-537","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-machine-learning"],"_links":{"self":[{"href":"https:\/\/statisticalbiophysicsblog.org\/index.php?rest_route=\/wp\/v2\/posts\/537","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/statisticalbiophysicsblog.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/statisticalbiophysicsblog.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/statisticalbiophysicsblog.org\/index.php?rest_route=\/wp\/v2\/users\/6"}],"replies":[{"embeddable":true,"href":"https:\/\/statisticalbiophysicsblog.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=537"}],"version-history":[{"count":2,"href":"https:\/\/statisticalbiophysicsblog.org\/index.php?rest_route=\/wp\/v2\/posts\/537\/revisions"}],"predecessor-version":[{"id":543,"href":"https:\/\/statisticalbiophysicsblog.org\/index.php?rest_route=\/wp\/v2\/posts\/537\/revisions\/543"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/statisticalbiophysicsblog.org\/index.php?rest_route=\/wp\/v2\/media\/538"}],"wp:attachment":[{"href":"https:\/\/statisticalbiophysicsblog.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=537"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/statisticalbiophysicsblog.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=537"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/statisticalbiophysicsblog.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=537"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}