






Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
Detailed project descriptions and tasks for students in an applied statistics course. The projects involve data preprocessing, visualization, descriptive statistics, and modeling using various methods. The datasets include house prices, student performance, diets, flights, salaries, insurance, and supermarket sales. The goal is to gain insights and draw conclusions from the data.
Typology: Assignments
1 / 11
This page cannot be seen from the preview
Don't miss anything!
Due: Session 12.
The class will be divided into groups. Each group, with 5 to 6 students, will be assigned a topic to study and present in the class. The objective of this assessment is to encourage students in doing research in groups and communicate their results in an oral presentation. Presentation should be created using PowerPoint and should address:
Presentations should generally not exceed 15 minutes, to allow time for questions and discussion.
The presenters will be evaluated by the lecturer (50%) as well as the rest of the class (50%) based on the following criteria: i. Content: Is the presentation clear and focused? Does it cover all important content of the assigned topic? ii. Preparation: How well prepared is this group? How good are the slides and supporting materials? How well does this group know their materials? iii. Presentation and Communication: How well organized is the presentation? How effectively does this group present, interact and involve the rest of the class? Does this group use time effectively? iv. Addressing questions: How effective does this group deal with questions and com- ments? v. Interest and Creativity: How interesting and creative is this group presentation?
The file ”houseprice.csv” contains house sale prices for King County, which includes Seat- tle. It includes homes sold between May 2014 and May 2015. Besides the house prices, the dataset also provides the details of the houses which are helpful for determining the house price. Use this dataset to build a regression model to predict the house price.
Main variables are:
price: price of the houses
floors: number of floors
condition: rating from 1 to 5 (from worse to great)
view: rating from 0-4 (from worse to great)
sqft above: area of the house
sqft living: living area (includes land around the house)
sqft basement: area of the basement.
bedrooms: number of bedrooms
Part 1. Data Preprocessing
Part 2. Visualization and Descriptive Statistics
Part 3. Models and Analyzing data
The data set ”Diet.csv” contains information on 78 people who undertook one of three diets. There is background information such as age, gender (Female=0, Male=1) and height. The aim of the study was to see which diet was best for losing weight but it was also thought that the best diets for males and females may be different so the independent variables are diet and gender.
Main variables are:
Person: index of the participant
gender:
Age:
Height:
pre:weight: weight before the diet
Diet: type of diets (1,2 or 3)
weight6weeks: weight after 6 weeks on the chosen diet
Part 1. Data Preprocessing
Part 2. Visualization and Descriptive Statistics
Part 3. Models and Analyzing data
The dataset ”flights.csv” contains information about all flights that departed from the two major airports of the Pacific Northwest (PNW), SEA in Seattle and PDX in Portland, in 2014: 162,049 flights in total. The main goal of the project is to use this dataset and try to find out the major factors cause the delay or postpone of the flights.
Main variables are:
year, month, day: Date of departure
carrier: Two letter carrier abbreviation. See airlines to get name.
origin, dest: Origin and destination. See airports for additional metadata
dep delay, arr delay: Departure and arrival delays, in minutes. Negative times represent early departures/arrivals.
dep time, arr time: Actual departure and arrival times (format HHMM or HMM), local tz.
distance: Distance between airports, in miles.
Part 1. Data Preprocessing
Part 2. Visualization and Descriptive Statistics
Part 3. Models and Analyzing data
The dataset ”insurance.csv” consists of 1338 records of insurance contracts. The aim of this project is to build a model to predict the insurance costs.
Main variables are:
age: age of primary beneficiary
sex: insurance contractor gender( female, male)
bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg/m^2 ) using the ratio of height to weight, ideally 18.5 to 24.
children: Number of children covered by health insurance / Number of dependents
smoker: smoking or non-smoking
region: the beneficiary’s residential area in the US, northeast, southeast, southwest, northwest.
charges: Individual medical costs billed by health insurance
Part 1. Data Preprocessing
Part 2. Visualization and Descriptive Statistics
Part 3. Models and Analyzing data
The dataset ”supermarket sales.csv” is the historical sales of supermarket company which has recorded in 3 different branches for 3 months data. The aim of this project is to inves- tigate the customer’s satisfaction based on the rating in difference branches.
Main variables are:
Invoice id: Computer generated sales slip invoice identification number
Branch: Branch of supercenter (3 branches are available identified by A, B and C).
Customer type: Type of customers, recorded by Members for customers using mem- ber card and Normal for without member card.
Product line: General item categorization groups - Electronic accessories, Fashion accessories, Food and beverages, Health and beauty, Home and lifestyle, Sports and travel
Unit price: Price of each product in US dollar
Quantity: Number of products purchased by customer
Total: Total price including tax
Rating: Customer stratification rating on their overall shopping experience (On a scale of 1 to 10)
Part 1. Data Preprocessing
Part 2. Visualization and Descriptive Statistics
Part 3. Models and Analyzing data
This dataset “OnlineNewsPopularity.xlsx” summarizes a heterogeneous set of features about articles published by Mashable in a period of two years. The goal is to predict the number of shares in social networks (popularity).
Main variables are:
n tokens title: Number of words in the title
n tokens content: Number of words in the content
num hrefs: Number of links
num imgs: Number of images
num videos: Number of videos
data channel
weekday
global subjectivity: Text subjectivity
global rate positive words: Rate of positive words in the content
global rate negative words: Rate of negative words in the content
shares: Number of shares (target)
Part 1. Data Preprocessing
Part 2. Visualization and Descriptive Statistics
Part 3. Models and Analyzing data
References
[1] Douglas C. Montgomery, George C. Runger. Hoboken. Applied Statistics and Probability for Engineers. NJ: Wiley, (2007).
[2] Peter Dalgaard Introductory Statistics with R. Springer, (2008).
[3] Gareth, J., Daniela, W., Trevor, H. and Robert, T. An introduction to statistical learning: with applications in R. Springer, (2013).