Data Mining Report: A Patchwork Assignment
The coursework is an individual piece of assessment, requiring you to analyse the ORGANICS dataset within SAS Enterprise Miner, using the directed data mining techniques covered in the IMAT3613 module, and detailing your results, interpretations, conclusions and recommendations in a well-structured technical report. You are provided with:
1. This Brief.
2. The ORGANICS dataset contains 10,000 observations and 13 variables shown in Appendix B.
3. The coursework will be assessed according to the marking grid in Appendix C.
4. Self/Peer Assessment Rubric Appendix D.
5. Template Report in Appendix A.
Lab Journal and Reflection
To help you produce this report in a timely manner, the report is built up from four biweekly activities. You have an opportunity to modify your work from each activity in light of your own reflection and self-assessment feedback. Consider using a weekly diary to track your progress in this course. To help you produce this reflection you may make use of the self-assessment grid (Appendix D) to record your progress.
The last and fifth activity is to produce an integrated report with conclusions and recommendations you will complete independently.
In the odd weeks it is suggested that you upload your answer to the activity in the journal
In the even weeks, you are expected to comment on your work using the self-assessment rubric and assign a grade A, B, C, D or F. This has been designed to help you structure your work and pace the development of the report over the term.
You may modify your weekly contributions through your engagement in the lab journal. In fact, you are encouraged to do so. You should treat the lab journal as a notebook of your activities for the week.
In the final exercise you will integrate all four activities and the final activity into a report.
In each activity you are expected to produce a piece of writing from between 200 and 400 words, producing a final report to a maximum of 2000 words excluding, table of contents, diagrams and appendices. You are provided a template report to complete, existing words in the template do not count to the report maximum.
This type of assessment is known as a patchwork assessment and is more UDL friendly compared to traditional forms of report assessment.
The patchwork assessment gives you an opportunity to improve your work over the term and reduces the stress of having to produce one piece of writing at the last minute.
The final report is summative and is marked by your tutor according to the attached marking grid.
Individual Data Set
You will each individually generate a unique model set personal to you.
Each of you will be working on your own random sample of data generated by typically inserting the last 5 figures of your DMU student id number into the random seed generator within the Data Partition node. You will be shown how to do this in the labs.
Note: If spurious output for any of the models should occur, insert the last 4 figures (or the last 3 figures) of your DMU student id number into the random seed generator – to enable you to generate sensible output that you can interpret.
In SAS Enterprise Miner
Submission
You will need to submit a copy of your report using the Learning Zone link in the assessments section of the Data Mining module shell on Learning Zone (to be made available prior to the coursework deadline).
SCENARIO: THE ORGANICS DATASET
1. A supermarket is beginning to offer a line of organic products. The supermarket’s management would like to determine which customers are likely to purchase these products.
2. The supermarket has a customer loyalty program. As an initial buyer incentive plan, the supermarket provided coupons for the organic products to all of their loyalty program participants and have now collected data that includes whether or not these customers have purchased any of the organic products.
You are a data miner and have been commissioned by the supermarket’s manager to analyse the ORGANICS data and to provide the manager with the best model that s/he should use to identify the customers who are likely to buy the supermarket’s new line of organic products.
The analysis you are conducting will represent the first flow of the virtuous cycle of data mining.
You will be assessed on producing a technical, well-structured, comprehensive but concise report to the manager of the supermarket. This report is broken up into five activities, four of which you are encouraged to do biweekly and self-assess your work using the lab journal. The final activity integrates the pieces into one report detailing:
Activity 1: Week 3 – Week 4
a) Develop a description of the business problem and appropriate data mining problem and describe a data mining framework that is appropriate for your brief. Identify the target variable.
b) Make appropriate use of Exploratory Data Analysis on your data set to develop insights that will inform your data mining process suggest any transformations which might be appropriate.
Activity 2: Week 5 – Week 6
a) Apply regression analyses to your dataset including the full model and the Selection Methods: Forward, Backward and Stepwise. Develop a regression equation which includes only significant parameters at the 95% confidence interval.
b) Conduct a Decision Tree analysis on the data set, vary the default parameters and present an interpretation of your results. Identify the target path(s) and critical path.
Activity 3: Week 7 – Week 8
a) Conduct a Neural Network analysis on the data set, vary the default parameters and present an interpretation of your results.
b) Choose to try different neural network architectures. Identify the most important weights together with a diagram identifying the neural network architecture.
Activity 4: Remaining time
a) Justification of your final selected model, by considering appropriate data mining strategies: Cumulative Lift Charts, Non-Cumulative Lift Charts and Diagnostic Charts.
b) Conclusions
c) Recommendations on how to improve the quality of the supermarket’s data collection process in the future, to enable you as a data miner the opportunity to improve on the accuracy of the data mining model in further flows of the data mining cycle. Develop and integrate your activities into a full technical report.
Appendix
In the Appendix of the report, you need to include:
a) A table of the model roles and measurement levels of the variables (to produce sensible analyses).
b) A view of the random seed generator illustrating the digits of your DMU student id number that you have used (to produce sensible analyses).
c) A copy of the process flow diagram.
d) A reflection of at least 200 words describing how your interaction with the discussion board modified or shaped the development of your report during the patchwork process.
Check List for Written Report
(not all of the below will be
relevant to your report)
1.
Title page
Does
this include the:
Title?
Author’s
name?
Module/course
details?
2.
Acknowledgements
Have
you acknowledged all sources of help?
3.
Contents
Have
you listed all the main sections in sequence?
Have
you included a list of illustrations?
4.
Abstract or summary
Does
this state:
The
main task?
The
methods used?
The
conclusions reached?
The
recommendations made?
5.
Introduction
Does
this include:
Your
terms of reference?
The
limits of the report?
An
outline of the method?
A
brief background to the subject matter?
6.
Methodology
Does
this include:
The
form your enquiry took?
The
way you collected your data?
7.
Reports and findings
Are
your diagrams clear, labelled and simple?
Do
they relate closely to the text?
If
you have used colour keys in your diagrams,
have you made provision that these keys are understandable if you have submitted
a black and white report
8.
Discussion
Have
you identified key issues?
Have
you suggested explanations for your findings?
Have
you outlined any problems encountered?
Have
you presented a balanced view?
9.
Conclusions and recommendations
Have
you drawn together all of your main ideas?
Have
you avoided any new information?
Are
any recommendations clear and concise?
10.
References
Have
you listed all references?
Have
you included all the necessary information for locating each reference?
Are
your references accurate?
Are
your references in Harvard Notation?
11.
Appendices
Have
you only included supporting information?
Does
the reader need to read these sections?
12.
Writing style
Have
you used clear and concise language?
Are
your sentences short and jargon free?
Are
your paragraphs tightly focused?
Have
you used the active or the passive voice?
Notes
Report Guidance:
Your contribution to the report should be no longer than 2000 words use a minimum font size 12. You are given a report template, with the first steps of the work already written and a recommended structure, table of contents. You are free to modify the layout to suit your own style. The aim of the template is to provide you guidance as to the level of presentation that is expected in a technical report. To get marks for this section of the work you must complete the blanks. You should also grey out the font to indicate that these are not your words, the grey words do not count to the total word count.
Marks are awarded for technical correctness, descriptions of models, appropriate justification of node and parameter choices, appropriate actions to guard against overfitting, indications of model robustness, model limitations, data insights, analysis supported by appropriate charts.
Reports are expected to be written to a professional standard, clear concise. Text supported by relevant choice of diagrams and use of tables to summarise data, avoidance of repetition and redundancy, appropriate use of appendices, table of contents and use of page numbers, table numbering and figure numbering, presence of an informative abstract or executive summary.
All diagrams must be legible and appropriately labelled if short of space use appendices. If you use coloured diagrams to illustrate or contrast points then you must provide a key for the colour. Assume that the audience of the report is senior management with no knowledge of the technical details of data mining.
I do not expect you to describe everything that you attempted, poor models or models of no consequence can be summarised in a table in the appendix. You should make your report concise by only providing one model description for logistic regression, decision tree and neural network. i.e. the best performant model for each type of classifier. Only report surprising or contrasting details or exceptional model performance to preserve the word limit.
Time Management Guidance:
You should budget half your time on using the SAS enterprise miner software to generate informative models, explore nodes and appropriate non-default options. The other half of your time should be budgeted on production of a clear, well structure report which describes your work and addresses the assignment brief.
The SAS software is a professional data mining products full of features and options. You are free to explore these options, however if you use something not covered in the course you must justify its use to receive credit. You will be penalized for inappropriate use of features you cannot adequately justify or explain.
There will come a point of diminishing returns where no matter how much effort you put into the software you cannot improve on model performance. This will be the point you should focus on the report. You should start work on the assignment immediately, and make allowance for any difficulties you might face. Leaving the work till the last minute will result in poor quality report. You should not underestimate the time it takes to become fluent in the use of the software. If you have followed all the labs attentively and with understanding you should not face too many hurdles. The completion of the assignment represents a cap-stone moment which will integrate everything you have learnt on the course.