<img src="https://d5nxst8fruw4z.cloudfront.net/atrk.gif?account=J5kSo1IWhd105T" style="display:none" height="1" width="1" alt="">

Nexosis @ Work & Play

How Much Data is Needed to Train a (Good) Model?

August 4, 2017 @ 2:42 PM | Technical


How Much Data is Needed to Build a (Good) Model?

We've gotten some questions recently about how much data is needed to train a good time series model. While there is no "one-size-fits-all" approach, there are some general best practices to follow and questions to ask about your data before building a model. Ryan elaborates below.


How much data do you need to train a model? Arguably, only a single data point. Is this model likely to make accurate predictions? Probably not. So how much data is necessary to train a decent model that will generalize well, i.e. make accurate predictions?

A typical image classification problem could require tens of thousands of images or more in order to create classifier. Sentiment analysis or document classification could require thousands of examples due to the sheer number of words and phrases, i.e. n-grams. For many regression problems, it’s suggested that you have 10x as many observations as you do features. A more general rule of thumb is that the number of observations should be proportional to 1/d^p where p = # of features and d = the maximum spacing between consecutive or neighboring data points after each feature is scaled to the range 0-1 (yikes!).


Let's talk time series forecasting

Time series forecasting also tends to follow the same general mathematical rule of thumb, but it’s certainly not the most intuitive. Additionally, the equation is not helpful when the time series has no explicit additional features, which is pretty common in time series forecasting. So when you have just a single data feed you want to forecast, what rule of thumb should you follow to make sure you have enough data? From a data science perspective, a decent model should always have more parameters than observations in the time series. For most time series applications, this means that the submitted data should have as many observations as the period of the maximum expected seasonality.

From a data science perspective, a decent model should always have more parameters than observations in the time series. For most time series applications, this means that the submitted data should have as many observations as the period of the maximum expected seasonality.

For example, if you have daily sales data and you expect that it exhibits annual seasonality, you should have more than 365 data points to train a successful model. If you have hourly data and you expect your data exhibits weekly seasonality, you should have more than 7*24 = 168 observations to train a model.

For example, if you have daily sales data and you expect that it exhibits annual seasonality, you should have more than 365 data points to train a successful model. 

However, these are the bare minimum number of points needed to train these types of models- more data is required if you want to effectively test how accurately your model performs at making predictions (see Training Set vs. Test Set for a quick review!). Your test set should be about 25% the size of your training set. So with a dataset that is expected to exhibit annual seasonality, the minimum number of points required to train and test multiple models is 365 + 365/4 ~ 456 observations.


There's still no "Golden Rule"

Of course, these are rough generalizations not intended to be taken as golden rules to follow for every time series problem. Depending on how correlated different variables are, you may require more or less data. However, to refrain from making some fatal mistakes you should ask a few questions about your data before jumping in and building a forecasting model.

  • What’s the granularity of my data, e.g., seconds, hours, years? A year’s worth of data can imply 365 data points, 52 data points, 12 data points, or even a single data point depending on how the data was recorded, and all are equally valid.
  • What are my underlying assumptions about my data? If you expect your data is annually seasonal, make sure you have at least 365 days, 52 weeks, or 12 months of data plus some additional data points for testing- note how important the granularity of data is in this scenario.
  • How far out am I trying to predict? If you’re trying to predict 12 months into the future, you should have at least 12 months worth (a data point for every month) to train on before you can expect to have trustworthy results.

Modeling is a tricky task that really should be left to the experts. Our API simplifies it for you so you can easily power your application with the best machine learning model. 


Ready to start building machine learning applications?

Get your free API key  Talk to an expert


Ryan West

Ryan is one our machine learning engineers. In addition to being the unofficial face of Nexosis, he spends his days building and testing models, fine tuning algorithms, and generally being a nice guy.