Project BDS: A Business Data Simulator for Teaching and Research
Centre de Recherche Créatech sur les Organisations Intelligentes
Business data analysis is a very diverse field that spans various specializations, technologies, and techniques. Many courses focus on one or a few aspects of business data analysis effectively shielding the students from the complexity of the entire process. This is partly due to the absence of data usable in different contexts like learning SQL, data architecture, data analysis, analytics, machine learning or even business intelligence as a whole. Most available databases and datasets tend to serve one specific purpose.
A business data simulator could however create databases and datasets in various formats with specific characteristics that could fulfill the needs of different instructors. The data created could serve purposes as simple as to illustrate the entire data value chain with a single example or as elaborate as to immerse the student into the mechanics of each step of a coherent data value creation process in a controlled learning environment over many courses or professional trainings.
The Centre de Recherche Créatech sur les Organisations Intelligentes is a research center at Université de Sherbrooke that focuses on the use of data and information to improve the performance of organizations. This project is an initiative of Pr. Daniel Chamberland-Tremblay.
Pr. Chamberland-Tremblay teaches Business Technology Management and Business Intelligence at École de gestion at Université de Sherbrooke. His research interests include data management, data governance, data security and the data-based value creation process.
Tasks associated with creating interesting business databases and datasets are human intensive and time consuming. For this reason, instructors tend to rely on the same data over the years and for all their students. This situation increases the risk of plagiarism and limits the exposition of students to typical data problems across specializations.
Objectives and limits
The main objective of this project is to develop a business data simulator to support research and teaching in the field of business technology management. More specifically, the project goals are:
- Create a data referential for the simulator, ex. names, products, addresses.
- Develop a simulator to produce plausible business databases and datasets for different functional contexts with specific characteristics like size, span, trend, events, and typical data quality problems.
This project uses the open-source MIKE2.0 (Method for an Integrated Knowledge Environment).
MIKE2.0 is an open-source methodology for Enterprise Information Management that provides a framework for iterative information development. The main goals are:
- Driving an overall approach through an organization's Information Strategy
- Enabling people with the right skills to build and manage new information systems while creating a culture of information excellence
- Moving to a new organisational model that delivers an improved information management competency
- Improving processes around information compliance, policies, practices, and measurement
- Delivering contemporary technology solutions that meet the needs of highly federated organizations
Information development starts with the assessment of the organizational business and technological contexts. Combined with the organizational information objectives, theses phases result in a gap analysis that details priorities for the implementation phases. The iterative nature of information development ensures that information resources are developed in an incremental manner that fits the evolving organizational needs.
As the project progresses, steps can be added, modified, or removed to improve the different aspects of data management from data identification and collection to data preparation and analysis.
This project will rely on the Python language dans data simulation capabilities of certain libraries like scikit-learn.
Scikit-learn is an open-source machine learning library that supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection, model evaluation, and many other utilities.
The project will be hosted on GitHub.
The iterative nature of the project warrants for a regular monitoring. Typically, students are expected to present their progress every week, or every two weeks, in short briefings similar to Scrum stand-up meetings. During these meetings, the student will be asked to report on the work done, on the tasks to come and on the issues blocking progress.
Students can also contact the supervisor through email or Teams at any time to resolve blocking issues.
This project can be designed to fit internships ranging from 45h to 225h. Given the exact availability of the candidate, the project can be limited to few data topics and attributes or be expanded to a full-blown business data simulator. The project is divided in five stages:
Stage 1: Project setup and kickoff
This stage is dedicated to the installation of the project technological environment, including Python, scikit-learn and their dependencies. Functional tests to ensure the adequate behavior of the components will be carried out.
During this stage, the candidate will be required to familiarise his or herself with the basics of each framework.
Stage 2: Building a proof of concept
At this stage, the candidate will build a simple, yet complete, data simulator that can generate a simple data set. Deliverables include the proof-of-concept simulator, a model of the data generation process and a first sample dataset.
Stage 3: Scaling data simulation
This stage is dedicated to expanding the capabilities of the simulator from generating simple datasets to complex databases. The simulator should accommodate for more objects and more complex data attributes. Deliverables include the simulator and different databases. This stage can be scaled up or down depending on the internship duration by adding new business areas and data attributes or restricting the generation to a few important business cases.
Stage 4: Personalization: introducing variations on a theme (optional stage)
If the simulator reaches a certain level of sophistication within a reasonable timeframe of the internship, the capability of slightly altering data models and content will be added to the project. This capability will allow instructors to offer personalized databases to students while offering a traceability of the alterations to support and ease student evaluation. This will reduce the risk of plagiarism while retaining the possibility of student cooperation. This capability may include such minor changes in database object names or slight alterations in values.
Stage 5: Reporting on the project
The last, but mandatory, stage is project reporting. The candidate will be asked to package all development in a manner that supports reuse by others. A short project report will also be required before the completion of the internship.
The salary for the internship ranges from 18$/h to 22$/h depending on experience and skills.
The internship is open to undergraduate and graduate students. The ideal candidate should be knowledgeable in Python and SQL. A basic knowledge of business processes and standard information systems is an asset.
Candidates that want to know more about this project or that wish to apply should contact Pr. Chamberland-Tremblay through email at firstname.lastname@example.org.