# Choosing a Project

### Focus on Problems First, Then Data

Rather than starting with a dataset and trying to find something interesting to do with it, identify a meaningful problem that could benefit from machine learning. This approach tends to lead to more impactful and realistic projects.

### Look for Human Pain Points

Consider processes or tasks that are:

* Time-consuming for humans
* Tedious or repetitive
* Prone to human error

These often make excellent candidates for machine learning solutions.

### Project Sources

#### Bring Your Own Project

If you already have a problem in mind or access to interesting data:

* Present your idea in the course or course chat
* If you have any doubts about the project, formulate specific questions and ask in the course chat

#### Partner with Organizations

Local companies and academic departments often have real-world problems waiting for ML solutions:

* Reach out to businesses or research groups in your area
* Inquire about data-intensive problems they're facing
* We can help you define project parameters with your partner
* If data confidentiality is a concern, you can use the NDA template provided [here](#non-disclosure-agreement-nda)

#### Public Datasets

While convenient, using public datasets sometimes comes with limitations:

* Projects tend to focus more on model optimization than real-world implementation
* You miss valuable experience in data preparation and feature engineering
* The problem may be artificially clean compared to real-world scenarios

However, if you choose this route, consider adding complexity by:

* Combining multiple datasets
* Creating your own validation methodology
* Adding constraints that reflect real-world conditions

### Data Resources

If you're looking for public datasets, here are some valuable repositories:

* [**ChallengeData**](https://challengedata.ens.fr/challenges/challenges_search): Datasets from real-world challenges
* [**Hugging Face Datasets**](https://huggingface.co/datasets): Collections ready for NLP and other ML tasks
* [**UCI Machine Learning Repository**](https://archive.ics.uci.edu/): Classic, well-documented datasets
* [**Kaggle**](https://www.kaggle.com/datasets): Competitive datasets with community solutions
* [**Roboflow Universe**](https://universe.roboflow.com/): Computer vision datasets
* [**Papers with Code**](https://paperswithcode.com/datasets): Datasets linked to research papers

### Examples of Past Projects

To get inspiration for a project, you might also want to review the projects nominated for the past VDE Machine Learning Prize in 2025 presented in the document below.

{% file src="<https://4020123021-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MHobCAnoTQkN71lOgdv%2Fuploads%2FGa55oICkEufV5t6lcIDL%2FVDE%20Machine%20Learning%20Prize%202025.pdf?alt=media&token=cae28e6f-a79b-44e1-b2fe-38a461352a76>" %}

### Non-Disclosure Agreement (NDA)

If you need an NDA for data you are getting from an organization or partner, you can use the following:

{% file src="<https://4020123021-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MHobCAnoTQkN71lOgdv%2F-MOVSu6Vf8w7NUH23Etm%2F-MOXBtzubKR3rCJmwf6R%2FNDA_Projects_engl_v0.2.pdf?alt=media&token=3df79608-5200-4cb9-a0dc-94fd558185be>" %}
Non-Disclosure Agreement (NDA)
{% endfile %}
