Invidious

How to Frame a Machine Learning Problem | How to Plan a Data Science Project Effectively

If you are working as a junior data scientist and your team is working on a very important project, in the beginner phase, you might be assigned a small part of the company project to complete. Suppose the company is working on a recommendation system or a prediction system, then you might be assigned to preprocess the data, etc. But after gaining some experience and leading the team, you will be the leader of your team. You have to plan everything while getting the project.

Here are 7 main steps to follow to become a good data scientist and a good team lead in the data science department:

1. Business Problem
2. Types of Problem
3. Current Solution
4. Getting Data
5. Metrics to Measure
6. Online or Batch Training
7. Checking Assumptions

Business Problem:

Suppose you are the head of the data science department at Netflix. You are in a meeting discussing how to generate more revenue for Netflix. All department heads ask for your opinion on increasing revenue. You will discuss, from your point of view, three points to gain more profit: first, bringing more users to Netflix using marketing; second, decreasing the prices of Netflix plans; and third, decreasing the churn rate, meaning focusing on existing users so they don’t quit the platform. There is a 4% churn rate, and you create a meeting with your team to discuss the existing churn rate and your 6-month target to decrease this churn rate with full effort.

Type of Problem:

After setting your goal, you now discuss what type of problem this is. Is it a regression problem or a classification problem? You analyze that this user might leave the platform next month, and you are giving a 50% discount to the leaving user. This is a classification problem because a user either leaves or not (yes or no). If a user has a 10% chance of leaving the platform or some users have a 100% chance of leaving the platform, then you focus on discounts: if a user has a 10% chance of leaving, you give a lesser discount compared to a 100% chance of leaving.

Current Solution:

In the current solution, you connect with the CTO to ask if there is any model predicting user behavior, etc. Suppose there is a model predicting the churn rate of Netflix, so it is easy to check the working of this model and connect with the team that made this model.

Getting Data:

This is a crucial step. You can connect with the data engineering team and share the details that you need certain features for this problem. For example, you need users' movie watching time and user searches, etc.

Metrics to Measure:

You have to define metrics to measure whether the model is predicting correctly or not. For instance, if you are sure that this user will leave the platform, are you giving a discount or not?

Online or Batch Training:

After your model works well, you need to train the model on upcoming data. Online training means you directly connect with the warehouse and train the model online, while batch training means you train the model later using a batch of data.

Check Assumptions:

Now you check assumptions: is this one model good for all regions or not? Definitely not, because all regions have different preferences and dislikes, so you have to check assumptions accordingly.

1 year ago | [YT] | 4

Hi! Looks like you have JavaScript turned off. Click here to view comments, keep in mind they may take a bit longer to load.