0.096

五煦查题

快速找到你需要的那道考题与答案

中国大学Data Mining for Transportation答案(慕课2023完整答案)

74 min read

中国大学Data Mining for Transportation答案(慕课2023完整答案)

Week 1. Introduction to Data Mining

Test 1

1、中国整答Which one is 大学n答not the description of data mining?
A、Extraction of interesting patterns or knowledge
B、案慕案Explorations and analysis 课完by automatic or semi-automatic means
C、Discover meaningful patterns from large quantities of data
D、中国整答Appropriate statistical analysis 大学n答methods to analyze the data collected

2、Which one describes the right process of knowledge discovery?案慕案
A、Selection-Preprocessing-Transformation-Data mining-Interpretation/Evaluation
B、课完Preprocessing-Transformation-Data mining- Selection- Interpretation/Evaluation
C、中国整答Data mining- Selection- Interpretation/Evaluation- Preprocessing-Transformation
D、大学n答Transformation-Data mining- election-Preprocessing- Interpretation/Evaluation

3、案慕案Which one is 课完not belong to the process of KDD?
A、Data mining
B、中国整答Data description
C、大学n答Data cleaning
D、案慕案Data selection

4、Which one is not the right alternative name of data mining?
A、Knowledge extraction
B、Data archeology
C、Data dredging
D、Data harvesting

5、Which one is the nominal variables?
A、Occupation
B、Education
C、Age
D、Color

6、Which one is wrong about classification and regression?
A、Regression analysis is a statistical methodology that is most often used for numeric prediction.
B、We can construct classification models (functions) without some training examples.
C、Classification predicts categorical (discrete, unordered) labels.
D、Regression models predict continuous-valued functions.

7、Which one is wrong about clustering and outliers?
A、Clustering belongs to supervised learning.
B、Principles of clustering include maximizing intra-class similarity and minimizing interclass similarity.
C、Outlier analysis can be useful in fraud detection and rare events analysis.
D、Outlier means a data object that does not comply with the general behavior of the data.

8、About data process, which one is wrong?
A、When making data discrimination, we compare the target class with one or a set of comparative classes (the contrasting classes).
B、When making data classification, we predict categorical labels excluding unordered one.
C、When making data characterization, we summarize the data of the class under study (the target class) in general terms.
D、When making data clustering, we would group data to form new categories.

9、Outlier mining such as density based method belongs to supervised learning.

10、Support vector machines can be used for classification and regression.

Week 2. Data Pre-processing

test 2

1、Which is not the reason we need to preprocess the data?
A、to save time
B、to avoid unreliable output
C、to eliminate noise
D、to make result meet our hypothesis

2、How to construct new feature space by PCA?
A、New feature space by PCA is constructed by choosing the most important features you think.
B、New feature space by PCA is constructed by normalizing input data.
C、New feature space by PCA is constructed by selecting features randomly.
D、New feature space by PCA is constructed by eliminating the weak components to reduce the size of the data.

3、Which one is right about wavelet transforms?
A、Wavelet transforms store large fractions of the strongest of the wavelet coefficients.
B、Wavelet transforms are completely different from discrete Fourier transform (DFT).
C、It can be used for reducing data and smoothing data.
D、Wavelet transforms means applying to pairs of data, resulting in two set of data of length L.

4、Which one is wrong about methods for discretization?
A、Histogram analysis and Binging are both unsupervised methods.
B、Clustering analysis only belongs to top-down split.
C、Interval merging by ?2 Analysis can be applied recursively.
D、Decision-tree analysis is Entropy-based discretization.

5、Which one is wrong about Equal-width (distance) partitioning and Equal-depth (frequency) partitioning?
A、Equal-width partitioning is the most straightforward, but outliers may dominate presentation.
B、Equal-depth partitioning divides the range into N intervals, each containing approximately same number of samples.
C、The interval of the former one is not equal.
D、The number of tuples is the same when using the latter one.

6、Which one is wrong way to normalize data?
A、Min-max normalization
B、Simple scaling
C、Z-score normalization
D、Normalization by decimal scaling

7、Which are the major tasks in data preprocessing?
A、Cleaning
B、Integration
C、Transition
D、Reduction

8、Which are the right way to fill in missing values?
A、Smart mean
B、Probable value
C、Ignore
D、Falsify

9、Which are the right way to handle noise data?
A、Regression
B、Cluster
C、WT
D、Manual

10、Which are the common used ways to sampling?
A、Simple random sample without replacement
B、Simple random sample with replacement
C、Stratified sample
D、cluster sample

11、Discretization means dividing the range of a continuous attribute into intervals.

assignment 2

1、Suppose you obtained a dataset which has some missing values, how will you deal with these missing values?

2、Gave the following data (in increasing order) for the attribute age: 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70. (a) Use min-max normalization to transform the value 35 for age onto the range [0.0, 1.0]. (b) Use z-score normalization to transform the value 35 for age, where the standard deviation of age is 12.94 years. (c) Use normalization by decimal scaling to transform the value 35 for age. (d) Comment on which method you would prefer to use for the given data, giving reasons as to why.

Week 3. Instance based Learning

test 3

1、What's the difference between eager learner and lazy learner?
A、Eager learners would generate a model for classification while lazy learner would not.
B、Eager learners classify the tuple based on its similarity to the stored training tuple while lazy learner not.
C、Eager learners simply store data (or does only a little minor processing) while lazy learner not.
D、Lazy learner would generate a model for classification while eager learner would not.

2、How to choose the optimal value for K?
A、Cross-validation can be used to determine a good value by using an independent dataset to validate the K values.
B、Low values for K (like k=1 or k=2) can be noisy and subject to the effect of outliers.
C、A large k value can reduce the overall noise so the value for 'k' can be as big as possible.
D、Historically, the optimal K for most datasets has been between 3-10.

3、What’s the major components in KNN?
A、How to measure similarity?
B、How to choose "k"?
C、How are class labels assigned?
D、How to decide the distance?

4、Which the following ways can be used to obtain attribute weight for Attribute-Weighted KNN?
A、Prior knowledge / experience.
B、PCA, FA (Factor analysis method)
C、Information gain.
D、Gradient descent, simplex methods and genetic algorithm.

5、At learning stage KNN would find the K closest neighbors and then decide classify K identified nearest label.

6、At classification stage KNN would store all instance or some typical of them.

7、Normalizing the data can solve the problem of different attributes have different value ranges.

8、By Euclidean distance or Manhattan distance, we can calculate the distance between two instances.

9、Data normalization before Measure Distance is generally to avoid errors caused by different dimensions, self-variations, or large numerical differences.

10、The way to obtain the regression for a new instance from the k nearest neighbors is to calculate the average value of k neighbors.

11、The way to obtain the classification for a new instance from the k nearest neighbors is to calculate the majority class of k neighbors.

12、The way to obtain instance weight for Distance-Weighted KNN is to calculate the reciprocal of the distance squared between object and neighbors.

assignment 3

1、You are required to build a KNN model with the given data sets. In a section of the highway, 19 sensors are set to collect speed and volume of vehicles in each point. Travel time required to pass this section is also captured. So each instance contains 41 attributes, including serial number, time tag, speed and volume in 19 positions and travel-time. There are totally 1605 instances. We generated 5 files in xlsx format by different random sample proportion. Four files of them consist of two sheets. The sheet named ‘train’ is the training set and ‘test’ is the testing set. One data set consists of one sheet named ‘cv-data’ generated by cross-validation. You need to finish the following tasks. Input: Speed1, volume1, speed2, volume2, speed3,volume3, …,speed19, volume19 Output: Travel time Note: 1) Different attributes have different value ranges, so normalization needs to be done firstly before distances are calculated. 2) In task 3 and task 4, you are required to apply DW-KNNA (distance-weighted K-nearest neighbor algorithm) to predict travel time. The definition of DW-KNNA can be referred in the following paper. [1] Song J, Zhao J, Dong F, et al. A Novel Regression Modeling Method for PMSLM Structural Design Optimization Using a Distance-Weighted KNN Algorithm[J]. IEEE Transactions on Industry Applications, 2018, 54(5): 4198-4206. Task: (1)Use different k (k=3,4,6), the number of neighbors, to prediction travel time, how about their prediction accuracy? (Use data sets from file, named 60% for training and 40% for testing_KNN.xlsx) (2)Use hold-one-out cross-validation to select the best k value. What is it? Plot the scatter diagram of predicted travel time and measured travel time. (Use data sets from file, named hold-one-out_cv_KNN.xlsx) (3)Use different k (k=3,4,6), the number of neighbors, to prediction travel time, how about their prediction accuracy when using DW-KNNA? (Use data sets from, named 60% for training and 40% for testing_DW-KNNA.xlsx) (4)Use different proportions of training set (60%, 70%, 80%), how about their prediction accuracy when using DW-KNNA (k=10)? (Use data sets from , named 60% for training and 40% for testing_DW-KNNA.xlsx, 70% for training and 30% for testing_DW-KNNA.xlsx, 80% for training and 20% for testing_DW-KNNA.xlsx)

Week 4. Decision Trees

test 4

1、Which description is right about nodes in decision tree?
A、Internal nodes test the value of particular features
B、Leaf nodes specify the class
C、Branch nodes decide the result
D、Root nodes decide the start point

2、Computing information gain for continuous value attribute when using ID3 consists of the following procedure:
A、Sort the value A in increasing order.
B、Consider the midpoint between each pair of adjacent values as a possible split point.
C、Select the minimum expected information requirement as the split-point.
D、Split

3、Which is the typical algorithm to generate trees?
A、ID3
B、C4.5
C、CART
D、PCA

4、Which one is right about underfitting and overfitting?
A、Underfitting means poor accuracy both for training data and unseen samples.
B、Overfitting means high accuracy for training data but poor accuracy for unseen samples.
C、Underfitting implies the model is too simple that we need to increase the model complexity.
D、Overfitting occurs too many branches that we need to decrease the model complexity.

5、Which one is right about pre-pruning and post-pruning?
A、Both of them are methods to deal with overfitting problem.
B、Pre-pruning does not split a node if this would result in the goodness measure falling below a threshold.
C、Post-pruning removes branches from a “fully grown” tree.
D、There is no need to choose an appropriate threshold when making pre-pruning.

6、Post-pruning in CART consists of the following procedure:
A、First, consider the cost complexity of a tree.
B、Then, for each internal node, N, compute the cost complexity of the subtree at N.
C、And also compute the cost complexity of the subtree at N if it were to be pruned.
D、At last, compare the two values. If pruning the subtree at node N would result in a smaller cost complexity, the subtree is pruned. Otherwise, the subtree is kept.

7、The cost complexity pruning algorithm used in CART evaluate cost complexity by the number of leaves in the tree, and the error rate.

8、Gain ratio is used as attribute selection measure in C4.5 and the formula is GainRatio(A) = Gain(A)/ SplitInfo(A).

9、Rule is created for each part from its root to its leaf notes.

10、ID3 use information gain as its attribute selection measure. And the attribute with the lowest information gain is chosen as the splitting attribute for note N.

assignment 4

1、Calculation of Information Gain of a Traffic Conflict Problem. The questions can be seen in file Assignment 4.docx.

Week 5. Support Vector Machine

test 5

1、What is the feature of SVM?
A、Extremely slow, but are highly accurate.
B、Much less prone to overfitting than other methods.
C、Black box model.
D、Provide a compact description of the learned model.

2、Which is the typical common kernel?
A、Linear
B、Polynomial
C、Radial basis function (Gaussian kernel)
D、Sigmoid kernel

3、What adaptations can be made to allow SVM to deal with Multiclass Classification problem?
A、One versus rest (OVR)
B、One versus one (OVO)
C、Error correcting input codes (ECIC)
D、Error correcting output codes (ECOC)

4、What is the problem of OVR?
A、Sensitive to the accuracy of the confidence figures produced by the classifiers.
B、The scale of the confidence values may differ between the binary classifiers.
C、The binary classification learners see unbalanced distributions.
D、Only when the class distribution is balanced can balanced distributions attain.

5、Which one is right about the advantages of SVM?
A、They are accurate in high-dimensional spaces.
B、They are memory efficient.
C、The algorithm is not prone for over-fitting compared to other classification method.
D、The support vectors are the essential or critical training tuples.

6、Kernel trick was used to avoid costly computation and deal with mapping problems.

7、There is no structured way and no golden rules for setting the parameters in SVM.

8、Error correcting output codes (ECOC) is a kind of problem transformation techniques.

9、Regression formulas including three types: linear, nonlinear and general form.

10、If you have a big dataset, SVM is suitable for efficient computation.

assignment 5

1、SVM for Incident duration prediction In this exercise, you are required to use support vector machines (SVMs) to prediction incident duration. By this assignment on SVMs, you can get deep understanding of how to use SVMs. The data comes from the national incident management center for towing operations. These data were provided by towing officers, police and Rijkswaterstaat road-inspectors who perform incident handling. The data was collected from 1st May to 13th September 2005 on the region of Utrecht. You can find 1853 registrations of incident in total in the incidentduration.csv. Test_set.csv extract 50% of data and is used to test SVM, while train_set.csv contains the remaining and is used to train SVM. Attributes includes as follows: 1. Incident type (Stopped vehicle, lost load, accident); 2. Kind of vehicles involved (passenger cars, trucks, N/A); 3. Police required (yes, no) 4. Track research (yes, no) 5. Ambulance required (yes, no) 6. Fire brigades required (yes, no) 7. Repair service required (yes, no) 8. Tow truck required (yes, no) 9. Road inspector (yes, no) 10. Lane closer (yes, no) 11. Road repair required (yes, no) 12. Fluid to be cleaned (yes, no) 13. Damage of road equipment (yes, no) 14. Number of vehicles (Involved Single, two, more) 15. Type bergings task (onb, CMI, CMV) 16. By the week (workdays, weekend) 17. Start and end time (during peek hour, off peek hour) 18. Duration (short, long) Build prediction models with SVM to complete the following tasks. A. Use train_set.csv as train data set to build a model, and test on test_set.csv. Report the accuracy of train model and test model you get. (Suggestions: In data preprocess, you can deal with independent variables which are nominal variables by using one-hot representation, so that the corresponding value of each feature is guaranteed to be 0 or 1. It is easy to implement with get_dummies(data) function in the pandas package.) B. Use incidentduration.csv to build a model using 10-fold cross validation. Report the accuracy of train model and test model you get. C. Build a prediction model again after Feature Reduction (Keep 80% variance). Report the accuracy of train model and test model you get. D: Which model gives the highest accuracy on the test set? Why? Give you explanation.?

Week 6. Outlier Mining

test 6

1、Which description is right to describe outliers?
A、Outliers caused by measurement error
B、Outliers reflecting ground truth
C、Outliers caused by equipment failure
D、Outliers needed to be dropped out always

2、What is application case of outlier mining?
A、Traffic incident detection
B、Credit card fraud detection
C、Network intrusion detection
D、Medical analysis

3、Which one is the method to detect outliers?
A、Statistics-based approach
B、Distance-based approach
C、Bulk-based approach
D、Density-based approach

4、How to pick the right k by a heuristic method for density based outlier mining method?
A、K should be at least 10 to remove unwanted statistical fluctuations.
B、Pick 10 to 20 appears to work well in general.
C、Pick the upper bound value for k as the maximum of “close by” objects that can potentially be global outliers.
D、Pick the upper bound value for k as the maximum of “close by” objects that can potentially be local outliers.

5、Which one is right about three methods of outlier mining?
A、Statistics-based approach is simple and fast but difficult to deal with periodicity data and categorical data.
B、The efficiency of distance-based approach is low for the great data set in high dimensional space.
C、Distance-based approach cannot be used in multidimensional data set.
D、Density-based approach spends low cost on searching neighborhood.

6、Distance-based outlier Mining is not suitable to data set that does not fit any standard distribution model.

7、Statistic-based method needs to require knowing the distribution of the data and the distribution parameters in advance.

8、When identifying outliers with a discordancy test, the data point is considered as an outlier if it falls within the confidence interval.

9、Mahalanobis Distance accounts for the relative dispersions and inherent correlations among vector elements, which is different from Euclidean Distance.

10、An outlier is a data object that deviates significantly from the rest of the objects, as if it were generated by a different mechanism.

assignment 6

1、You are required to use outlier mining methods to detect the outliers with given data sets. In a section of a city road, several cameras are set to collect the plate of vehicles from 2017-06-09 to 2017-06-12, as well as the date and time when passing the start point and the finish point. Travel time is calculated later. Time serial is another form of transformation from start time. So each instance contains 8 attributes, including serial number, license plate number, date and time passing start/end point, time serial and travel time. There are totally 4977 instances. You need to finish the following tasks. Task: (1)Use statistic-based approach to detect the outliers of travel time. Calculate the mean value and the variance of travel time. Write out the confidence interval. Take time serial as X-axis and the travel time as Y-axis. Plot the scatter diagram and mark the outliers you have recognized. (2)Use distance-based approach to detect the outliers of travel time. An object o in data set D is defined as an outlier with parameters r and π, described as DB(r, π), if a fraction of the objects in D lie at a distance less than r from o is less than π, o is an outlier. Let parameter r vary from 0.1 to 0.3 with the step of 0.1, and π vary from 30 to 90 with the step of 30, find the outliers and the number of the outliers. You can use the Euclidian distance. (3)Use density-based approach to detect the outliers of travel time. With different k (from 3 to 400 with the step of 5), the number of neighbors, calculate the LOF for each data point. Set 2.0 as a threshold for LOF and an object is labeled as an outlier if its LOF exceeds 2.0. Firstly, take k value as X-axis and the number of outliers as Y-axis. Plot the line chart. Secondly, calculate the LOF for each data point and give the top 4 outliers. Use k=350 and the Euclidian distance.

Week 7. Ensemble learning

test 7

1、How to deal with imbalanced data in 2-class classification?
A、Oversampling
B、Undersampling
C、Threshold-moving
D、Ensemble techniques

2、Which one is right when dealing with the class-imbalance problem?
A、Oversampling works by decreasing the number of minority positive tuples.
B、Undersampling works by increasing the number of majority negative tuples.
C、Smote algorithm adds synthetic tuples that are close to the minority tuples in tuple space.
D、Threshold-moving and ensemble methods were empirically observed to outperform oversampling and undersampling.

3、Which step is necessary when constructing an ensemble model?
A、Creating multiple data set
B、Constructing a set of classifiers from the training data
C、Combining predictions made by multiple classifiers to obtain final class label
D、Find the best performing predictions to obtain final class label

4、Ensembles tend to yield better results when there is a significant diversity among the base models.

5、Ensemble method cannot parallelizable because not every base classifier can be allocated to a different CPU.

6、To generate the single classifier, different model may be used to deal with different data subset.

7、In random forest, using a random selection of attributes at each node to determine the split.

8、Forest RI creates new attributes that are a linear combination of the existing attributes.

9、The principle threshold-moving is to move the decision threshold, so that the rare class tuples are easier to classify.

10、Neutral network classifiers can be used as classifiers in threshold-moving approach.

assignment 7

1、Suppose you have trained three support vector machines, h1, h2, and h3, returning binary classifications (+1 or ?1). The observed accuracy of each of the hypotheses is 70%. Assuming that the errors of these hypotheses are independent, what is the predicted accuracy of an ensemble hypothesis H?

2、Suppose you have done one iteration of Adaboost to produce classifier h1 and find that h1 has the following results on the training data. (Assume the initial weights on the training data are uniform.)

Coursework

Analysis of Driving Behavior

1、In this coursework, you are required to use techniques of data mining to study the abnormal driving behavior. Please download the attachment and read the detail information of the coursework in coursework.docx file. You need to choose one to do from task 1 and task 2, and then choose one to do from task 3 and task 4. Hope you get good understanding after learning this course.

中国大学Data Mining for Transportation

数据挖掘是当今信息时代的一个重要工具,它可以通过收集、分析和解释大量数据来发现内在的模式和规律。数据挖掘在很多领域都有广泛的应用,包括交通运输领域。

为了更好地应用数据挖掘技术来解决交通运输领域的问题,中国大学开始了一项名为“Data Mining for Transportation”的研究项目。该项目旨在利用数据挖掘技术来解决交通运输领域中的一些关键问题,如交通拥堵、交通事故等。

数据采集

数据挖掘的第一步是数据采集。对于交通运输领域来说,数据采集可以通过各种传感器和监控设备来实现,如交通信号灯、交通摄像头、车载传感器等。这些设备可以收集各种交通数据,如车辆数、速度、方向等。

另外,还可以通过社交媒体、移动设备和其他互联网应用程序来收集有关交通运输的数据。这些数据可以包括交通事件、路况信息等。

数据预处理

对于采集到的数据,还需要进行一些预处理工作,以使其适合进行数据挖掘分析。这些预处理工作包括数据清洗、缺失值处理、数据转换等。

数据清洗是指去除数据中的噪声和无效数据。缺失值处理是指处理数据中缺失的数据点,通常使用插值技术来填充缺失值。数据转换是指将数据进行格式转换或标准化,以便进行后续的数据挖掘分析。

数据分析

经过数据预处理后,可以开始进行数据挖掘分析。数据挖掘分析技术包括聚类分析、分类分析、回归分析、关联规则挖掘等。

聚类分析是将数据集中的数据分成若干个组,每个组内的数据具有相似的特征。分类分析是将数据集中的数据分成若干个类别,每个类别内的数据具有相似的属性。回归分析是根据已知的数据来预测未知的数据。关联规则挖掘是发现数据集中的不同属性之间的关系和模式。

应用场景

通过数据挖掘技术,可以在交通运输领域中应用到很多场景中,如以下几个方面:

  • 交通预测:通过对历史交通数据的分析,可以预测未来的交通情况,以便进行交通调度和规划。
  • 交通监测:通过实时监测交通数据,可以及时发现交通拥堵、事故等问题,以便及时处理。
  • 路径优化:通过分析不同路径的交通情况,可以找到最优的路径,以减少行驶时间和成本。
  • 交通安全:通过分析交通事故的数据,可以找出交通事故的发生规律和原因,以便采取措施预防交通事故的发生。

结论

数据挖掘技术在交通运输领域中的应用具有很大的潜力。通过收集、分析和解释大量的交通数据,可以发现内在的模式和规律,以便更好地进行交通规划和管理。中国大学的“Data Mining for Transportation”项目为推动交通运输领域的发展提供了重要的支持。