Data Collection Assistant
Our organization is looking to build a robust, evidence based dataset which will be used to train a large language model and a corresponding RAG agent. We are seeking to pull data in the form of PDF's from various Canadian cancer websites like CCO and Cancer.ca. There are over a 100 different cancers we need to collect fact sheets on. This will involve several different steps for the students, including: Familiarizing themselves with the companies goal and mission Scanning the web and downloading and sorting PDF's related to cancer information factsheets as well as symptom management algorithm fact sheets. . Verifying that data is recorded accurately. Cleaning data to ensure consistency and usability. Suggesting new fact sheets or additional data they believe would augment the companies mission