# MatthewElkherjDiary

Check out:

- MatthewElkherjDiary/Winter2012LogRatio , TwitterMining , TrackingElections2012 for notes on processing twitter data.
- GordonNotes , Setup_On_Gordon_Node for notes about the Gordon supercomputer.
- PI_Data_Notes for notes about processing the PI dataset

## Contents

# Education Project

## Predicting Correct Answers

### Motivation/Applications

The ideal problems assigned to students are ones that challenge, but don't overwhelm. Predicting whether a student will be able to answer a particular question correctly, and how long it will take the student, can be seen as fundamentally important problems in online education. Assuming very accurate prediction could be done, just the right questions could be asked in the right order.

Specific Applications:

- Hints, easier problems, sections of the book, or the solution to a question part could be given if it is unlikely the student will give a correct answer in a reasonable time-frame.
- For motivation, students could be shown expected performance on exams/quizzes. Quizzes could be a randomly held-out selection of homework problems.
- Assuming prediction is done based on word-content of questions, potentially difficult topics could be identified.

### Correctness Prediction

78,994 total attempts at parts problems were made by all users. This comes from (~7 parts/problem)(~100 problems)(~100 users) = 70,000.

The following results use very simple bit-predictors to predict student success/failure at answering some future question. The bit-predictor keeps track of the number of correct/incorrect, and simply predicts the more likely of these two possibilities.

Grouping improves performance, because the probability a question attempt is successful should vary across groups. More difficult questions will have lower attempt success probabilities, stronger students will have higher success probabilities.

Correctness prediction accuracy:

All True: 0.539 Online bit predictor : 0.539 Segmenting by question parts: 0.695 Segmenting by users : 0.585

Putting a decay factor of gamma = 0.95 improves results for the user/baseline bit predictor.

Online bit predictor : 0.601 Segmenting by question parts: 0.692 Segmenting by users : 0.633

Since we're computing accuracy over 78,994 examples, a difference of about (#stds)*(std) <= (3) * ( 1/(2*sqrt(n)) ) = 3*0.00177 = 0.00534 is statistically significant. This is not a very rigorously done analysis of statistical significance- independence doesn't hold for predicting student success. It gives a general order of magnitude though, and is somewhat convincing that a difference of ~0.10 is a very significant improvement over a baseline predictor.

Potential Improvements:

- Time Locality. Tune gamma, have predictors trained on multiple different values of gamma vote.
- In addition to grouping by user/question parts, group by question topics/words used in questions.
- Combine predictions from multiple groupings. Eg: an estimate of the probability user U answers question part Q should be based on a history of Q's difficulty, and U's history at successfully answering questions.

# PI Data Processing Fall/Winter/Spring 2013

## Processing Using Spark

The lines for all tag time series data (in ascii form) were counted using the spark scala REPL. These lines (~300GB) were cached in memory, and repeatedly counted.

As expected, performance improved drastically after the first count. This is because the first count required moving data from disk into memory.

The time to count all lines (seconds) were:

196.684725583 90.055808169 91.879617186 78.027200801 73.516663513 76.215198276 80.594169574 80.031273924

And the results of the computation:

res1: Long = 33615367

Did this same experiment again:

317.98181764 ~138 (missed this time) 119.08671908 101.012987785

lines.filter(line => line.contains("EBU3B")).filter(line => line.contains("kW_tot")).count()

113.717242766 s res7: Long = 11910

We redid the same experiments with PySpark. The line count times (in seconds) for oledb_tag_aggregate were:

317.25871489 364.827935057 ~390

332.642289312 283.624932272 287.766959389

To ensure it wasn't a cluster issue (other jobs were being run), I re-ran the analysis using the spark-shell, and obtained similar results to before (the time for count() went down to 99.509759298). It appears that PySpark from ipython is running significantly more slowly.

Running the spark shell: MASTER=spark://ion-21-14.sdsc.edu:7077 ./spark-shell

Preprocessing the data and loading to memory: 482.624353031 seconds

Result: 51551214

## Newton's Cooling Analysis

A description of the model is available in Yogesh's_Diary.

We tested a generalized version of this model, allowing any linear relationships of the form <slope of inside temp> = alpha + beta * ( <outside temp> - <inside temp> ).

We approximated the slope of inside temperature at minute i during a day by the least-square best-fit line over minutes i:i+10.

To elaborate:

- We compared the EBU3B outside temperature to the inside temperature of various rooms across campus. Our goal was to test whether the slope of inside temperatures during times with no AC is proportional to the difference between inside and outside temperatures.
- We tested for any linear relationship between inside slope and the inside/outside temperature difference.
- For every building/room, for every date, we selected minutes of the day that had an air flow of at most 20. This threshold was chosen by eye-balling some air-flow plots, and selecting a threshold to discriminate on/off.

Slope was approximated by a sliding window over 10-minute intervals.

### RESULTS, filtering by airflow <= 20

- Rooms without windows: beta mean=1.893e-04, std=9.850e-05
- Rooms with windows: beta mean=3.816e-04, std=1.583e-04
- window/non-window t-test p-value: 2.084e-03
- The number of non-negative betas is >=24 standard deviations above the expected number of betas, assuming equal distributions of postitive/negative betas
- 797 out of 1114 p-values are < 10^-6

### RESULTS, filtering by airflow >= 20

- Rooms without windows: beta mean=2.626e-04, std=1.409e-04
- Rooms with windows: beta mean=1.468e-03, std=6.453e-03
- window/non-window t-test p-value: 7.234e-01
- The number of non-negative betas is >=15 standard deviations above the expected number of betas, assuming equal distributions of postitive/negative betas
- 710 out of 1119 p-values are < 10^-6

Some documentation of the pearson's r^2 p-values is available at: http://stackoverflow.com/questions/13653951/pearson-correlation-coefficient-2-tailed-p-value-meaning

There is a weak but statistically significant difference (p=0.002) between the beta values (strength of outside/inside temperature relationship) for window/not window rooms. Note: only rooms with pearson's p<10^-9 were considered. These are the rooms with the strongest linear relationship.

It is easy to see this difference visually:

## Jump Analysis over all Tags

The overarching goal here is to cluster tags by the times and days spikes (jumps) are seen in their time series.

### A First Approach: Correlation between "Jump Days"

For the first iteration of this method, we only considered whether a jump occurred on some day or not when clustering tags. We selected tags that have jumps on 150-200 days: if jumps happened on every day, or no days, there is less information content in the co-occurrence of two tags' jumps.

For every tag, define the "jump dates" as the dates we detect some 3std jump on. We computed the similarity between two tags as the jaccard distance between their jump dates.

We sorted pairs of tags by this jump date similarity metric. Some examples of tags with high jump date similarity:

NAE-15_SAS.N2-2.RM-317.VAV3-7.VMA-011.SUP-FLOW.SUPPLY_AIR_FLOW NAE-15_SAS.N2-2.RM-301.VAV3-2.VMA-006.SUP-FLOW.SUPPLY_AIR_FLOW Similarity = 1.000000 NAE-15_SAS.N2-2.RM-205.VAV3-37.VMA-041.ZN-T.ZONE_TEMPERATURE NAE-15_SAS.N2-2.RM-329.VAV3-10.VMA-014.AHTG-STPT.ACTUAL_HEATING_SETPOINT Similarity = 0.896104 NAE-23_MUSIC.FC-1.MTWP4-VFD.VFD-FC1-10680.FREQUENCY.OUTPUT_FREQUENCY NAE-08_PHARM.N2-2.PHARM-2ND_FL-RM-2180-CV-2S31.VMA-136.SUPFLOW.SUPPLY_AIR_FLOW Similarity = 0.405882

Looking through some pairs, we found a trend: clustering by jump similarity approximately clustered tags by building! To quantify this: increasing tag jump similarity increases the probability the first 10 characters of the tags are the same.

If there were no relationship between jump similarity and the building for a tag, the following plot should be linear:

## Collection PI Data from the terminal server

Collection was done in two phases:

- Phase 1: the dates in PI_data/phase1_dates were collected, and uploaded to PI_data/oledb_phase1
- Phase 2: some of this data is being collected/uploaded

## Air Flow Data Analysis

### Motivation

We looked at daily variation in tags containing both the words "air" and "flow". There are 3953 such tags, and 1230446 daily time series (tag/date pairs).

The following are common types of air flow plots:

Almost every air flow plot has a few large jumps, and has relatively constant noise for the remainder of the day. A first start in modelling this type of data is detecting spikes. Then we can characterize days by intervals separated by large jumps.

### Analysis

For the time series for each day, we did the following:

- Moved a 60-minute-length sliding window over that day, computing the variance over the intervals [1:60],[2:61],...[1381:1440]
- Took the median of these variances as the baseline noise level. We're assuming (based on skimming some air flow plots) that daily time series are mostly steady, with a few spikes. The mean might be skewed by these few spikes.
- Detect intervals from part (2) that had a std at least 3*(baseline std).
- Merged overlapping intervals, outputting the rightmost point - 60 as a representative of the overlapping intervals. The middle value would make for a better interval representative: we'll use this in future work.

After this process, 644706 "jumps" were output. This is about 1/2 a jump per daily time series plot. Here is a histogram of the times of these jumps:

### Breakdown per Building

We broke down these jump histograms further into 16 per-building jump-event plots. This clearly demonstrates: almost all the spikes in the previous jump histogram plot can be attributed to spikes from a single building.

The 16 buildings are: apandm, b2, biomed-lib, calit, ebu3b, emer-svcs, housing_and_dining, mayer, mpb, music, otterson, pharm, robert-paine-center, sas, sdsc, sverdru

Each of the plots fits one of the following descriptions:

- Has <=5 distinctive spikes in the jump histogram. The remainder of the histogram values are relatively uniform.
- Has very few values in the histogram: very few air flow jumps were found for this building

Buildings with distinctive spikes:

Buildings with spikes that are difficult to distinguish:

--Yoavfreund 11:51, 14 February 2014 (PST) SVD?