Project CDS: A Career Data Scraper for Teaching and Research
Centre de Recherche Créatech sur les Organisations Intelligentes
The current labour shortage in Canada has become a considerable risk for many organisations. Businesses struggle to recruit and retain their employees. This shortage is particularly acute in the technology sector. Despite the situation, little is known about the specific skills or knowledge that are in high demand. Instead, businesses rival on salary and working conditions in a attempt to solve their talent crisis. This reaction may in part be explained by the low digital literacy of Human Resources specialists that turn to traditional management methods rather than use novel approaches like Data Science to get a better understanding of their talent gap.
The Centre de Recherche Créatech sur les Organisations Intelligentes is a research center at Université de Sherbrooke that focuses on the use of data and information to improve the performance of organizations. This project is an initiative of Pr. Daniel Chamberland-Tremblay.
Pr. Chamberland-Tremblay teaches Business Technology Management and Business Intelligence at École de gestion at Université de Sherbrooke. His research interests include data management, data governance, data security and the data-based value creation process.
Most job postings are free form textual data to be utilized by humans. Typically, job data analysis is a manual labor, time consuming, human intensive task. To better understand market trends in transferable and domain specific competencies, abilities, and skills, businesses and universities must turn to more sophisticated methods.
Objectives and limits
The main objective of this project is to develop a functional Web data scraper capturing job postings from different sources to support research and teaching in the field of business technology management. More specifically, the project goals are:
- Create a functional data scraper that works for specific job sites and, if time allows, that can be adapted to different scraping contexts.
- Collect, clean and store job posting data as input for future data science initiatives and research projects.
This project uses the open-source MIKE2.0 (Method for an Integrated Knowledge Environment).
MIKE2.0 is an open-source methodology for Enterprise Information Management that provides a framework for iterative information development. The main goals are:
- Driving an overall approach through an organization's Information Strategy
- Enabling people with the right skills to build and manage new information systems while creating a culture of information excellence
- Moving to a new organisational model that delivers an improved information management competency
- Improving processes around information compliance, policies, practices, and measurement
- Delivering contemporary technology solutions that meet the needs of highly federated organizations
Information development starts with the assessment of the organizational business and technological contexts. Combined with the organizational information objectives, theses phases result in a gap analysis that details priorities for the implementation phases. The iterative nature of information development ensures that information resources are developed in an incremental manner that fits the evolving organizational needs.
As the project progresses, steps can be added, modified, or removed to improve the different aspects of data management from data identification and collection to data preparation and analysis.
This project will rely on two well-known data collection Python framework. Scrapy is an open-source fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.
Though Scrapy has greatly simplified Web access, users must have a strong understanding of HTML and CSS. An understanding of XML and XPath is also an asset.
Selenium is an umbrella project for a range of tools and libraries that enable and support the automation of web browsers.
It provides extensions to emulate user interaction with browsers, a distribution server for scaling browser allocation, and the infrastructure for implementations of the W3C WebDriver specification that lets you write interchangeable code for all major web browsers.
The Selenium Client Driver in Python is available on PyPi. The project will be hosted on GitHub.
The iterative nature of the project warrants for a regular monitoring. Typically, students are expected to present their progress every week, or every two weeks, in short briefings similar to Scrum stand-up meetings. During these meetings, the student will be asked to report on the work done, on the tasks to come and on the issues blocking progress.
Students can also contact the supervisor through email or Teams at any time to resolve blocking issues.
This project can be designed to fit internships ranging from 45h to 225h, or event more. Given the exact availability of the candidate, the project can be limited to few data sources or be expanded to a full-blown Data Science, project including a phase of Text Mining Analysis.
The project is divided in five stages:
Stage 1: Project setup and kickoff
This stage is dedicated to the installation of the project technological environment, including Python, Selenium, Scrapy and their dependencies. Functional tests to ensure the adequate behavior of the components will be carried out. During this stage, the candidate will be required to familiarise his or herself with the basics of each framework.
Stage 2: Building a proof of concept
At this stage, the candidate will build a simple, yet complete, data scraper from a standardized source. Deliverables include the proof-of-concept scraper, a model of the data acquisition process and a first sample of data.
Stage 3: Scaling data acquisition
This stage is dedicated to expanding the capabilities of the scraper to multiple sites in a software architecture that enables reuse of the scraper. Deliverables include the scraper and a data set supporting Text Mining skills analysis. This stage can be scaled up or down depending on the internship duration by adding new sites or restricting the collection to a few important ones.
Stage 4: Data science from cleaning to analytics (optional stage)
If the candidate reaches a substantial data set within a reasonable timeframe of the internship, data exploration and analysis will be encouraged. Various Text Mining techniques will be applied to the data set to prepare and analyse the data. A basic knowledge of nltk and scikit-learn are required if the candidate wishes include this stage during the internship.
Stage 5: Reporting on the project
The last, but mandatory, stage is project reporting. The candidate will be asked to package all development in a manner that supports reuse by others. A short project report will also be required before the completion of the internship.
The salary for the internship ranges from 18$/h to 22$/h depending on experience and skills.
The internship is open to undergraduate and graduate students. The ideal candidate should be knowledgeable in Python, HTML, CSS and SQL. A basic knowledge of data cleaning, XML and XPath would be an asset.
Candidates that want to know more about this project can contact Pr. Chamberland-Tremblay through email at firstname.lastname@example.org.