User:Sachin as/Diary

From seed
Jump to: navigation, search

Project: Building Sensor Analysis and Compression

--Yoavfreund 08:25, 11 March 2016 (PST)Sachin, please add a pointer to the final report.

--Sachin as 13:44, 13 March 2016 (PDT)

  • The masters project report is available here[1]
  • The following s3 bucket contains all the relevant files s3://building-sensor-analysis[2]
  • The pyspark and sql notebooks are available on github here[3]
  • The masters project presentation is available here[4]

Accessing data in s3

--Sachin as 10:32, 15 March 2016 (PDT) All the s3 files are listed here http://data-analysis-sachin.s3.amazonaws.com/list.html. The entire directory structure is as below

directory:data-analysis-sachin folder1:csv2jsons_new (has the json files after parsing all the csvs) Has 16441 json files and is accessible via url as below https://s3-us-west-2.amazonaws.com/data-analysis-sachin/csv2jsons_new/DF_0.json​ https://s3-us-west-2.amazonaws.com/data-analysis-sachin/csv2jsons_new/DF_10.json i.e https://s3-us-west-2.amazonaws.com/data-analysis-sachin/csv2jsons_new/DF_i.json with i in (0,16441)

folder2: csvParquetFiles_new (the spark dataframes are saved as parquet files) has 822 parquet directories with each directory having 20+ parquet splits. The http url for the each of the parquet directory is as below https://s3-us-west-2.amazonaws.com/data-analysis-sachin/csvParquetFiles_new/processed0

file3:tags.json has all the tags data. The url is https://s3-us-west-2.amazonaws.com/data-analysis-sachin/tags.json

In order to mount the files in databricks use the snipet below with the correct access and secret keys dbutils.fs.mount("s3n://%s:%s@%s" % (ACCESS_KEY, ENCODED_SECRET_KEY, 'data-analysis-sachin'), "/mnt/%s" % assignment1) dbutils.fs.ls('/mnt/assignment1/csvParquetFiles_new/')

Data preparation

update-29 Feb 2016 Further procesed the datasets and updated the summary spreadsheet. Its available here - https://docs.google.com/spreadsheets/d/1Sbah89Q6sKzdbmZ1RoK63_bOx-EAx_8hgPPnJ6zJbGw/edit?usp=sharing

This spreadsheet takes into account a difference of 5 days for missing data (since most of the analysis is done on atleast a week's data, only those roomxtemplate combinations are useful which have atleast a week's length of data) and removes roomxtemplate combination which have no values associated with it. The total sensorpoints(combinations) is now reduced to 4607 with an average of 14 templates per room.

Also, in order to fasten the data load, individual rooms are saved as sparksql tables andload function is added to BSA-results notebook to load quickly for the corresponding datasets.

Update-26 Feb 2016

A summary of the entire dataset is available here https://docs.google.com/spreadsheets/d/1Sbah89Q6sKzdbmZ1RoK63_bOx-EAx_8hgPPnJ6zJbGw/edit?usp=sharing

The total rowcount now is 2461823200 split between 4822 different sensorPoints (unique room x type). Each room has an average of 9 sensors. A detailed list with starting and ending timestamps for the signals is available in the neighboring sheet.

The processed json files are available here https://console.aws.amazon.com/s3/home?region=us-west-2#&bucket=data-analysis-sachin&prefix=csv2jsons_new/. There are around 16000 json files with an average size of 10mb amounting to 160GB. The parquet files are available here https://console.aws.amazon.com/s3/home?region=us-west-2#&bucket=data-analysis-sachin&prefix=csvParquetFiles_new/ There are 822 parquet files with an average size of 40 mb amounting to 30gb of data.

Results

--Yoavfreund 08:23, 11 March 2016 (PST) Sachin, please give pointers to the notebooks used to generate this data. Also add (either there or here) analysis of what we see in each figure.

--Sachin as 10:42, 14 March 2016 (PDT) The results notebook is available here[5]

Below we have plotted the dataset for different rooms and sensors at different time intervals of 14 days length. Each of these sub datasets are compressed and reconstructed back and the reconstruction is plotted alongwith as well


Below we have plotted the dataset for different rooms and sensors at different time intervals of 7 days length. Each of these sub datasets are compressed and reconstructed back and the reconstruction is plotted alongwith as well

'Room 3214, 2014-02-10:2014-02-20', 'Tolerance of 96 Rm3214 feb14.png

--Sachin as 11:35, 14 March 2016 (PDT) The reconstruction very nearly follows the actual values. Zone temperature was modeled using piecewise linear and hence has low error at high tolerance values. Also, there is significant correlation between cooling command and HVAC zone power

'Room 2138, 2013-12-01:2013-12-07', 'Tolerance of 96 Rm2138 Dec2103.png

Again, the actual supply flow and damper position show a lot of correlation, although the scale for both of them is very different

'Room 4226, 2013-12-01:2013-12-07', 'Tolerance of 96 Rm4226 Dec2013.png

Here its interesting to see that the actual zone temperature looks more broken and follows a step function pattern, whle the reconstructed one is more smooth and changes linearly. This is mostly due to the high tolerance value that we are achieving the high degree of smoothness. Also, the small aberration around Dec 05 is due to the linear interpolation happening in reverse fashion.

'Room 4226, 2014-03-01:2014-03-07', 'Tolerance of 96 Rm4226 mar2014.png

'Room 2138, 2013-07-01:2013-07-07', 'Tolerance of 96 Rm2138 jul103.png Most of the unruly data corresponds to damper position and it does not seem to follow any pattern as such


'Actual Heating Setpoint', 'Tolerance of 96 Ahs.png

Original size:3544 points. Reduced size: 34 points. method:piecewise constant.tolerance:96 Reconstruction error: 0.0 99% compression, zero reconstruction error. Has a fixed pattern of a maximum in the working hours and minimum during evening and night


'Zone Temperature', 'Tolerance of 96 Zt.png

Original size:2504 points. Reduced size:179 points. method:piecewise linear. Tolerance:36. Std dev: 1.97. Reconstruction error: 0.0327. error / std= 0.016 93% Compression, 0.03 RMSE Its interesting to see that while the zone temperature builds up linearly to a maximum, it then falls down drastically to the low point.

'Actual Supply Flow', 'Tolerance of 16 Asf.png

Original size:3544 points. Reduced size: 294 points. method:piecewise constant. Tolerance:16. Std dev: 225.030. Reconstruction error= 3.15. error / std= 0.014 92% Compression, 3.15 RMSE Although ASF looks very unruly, it doesnt seem to have a very clear pattern. It goes up at the same time each day and falls down in the same manner each day


'Actual Cooling Setpoint', 'Tolerance of 96 Acs.png

Original size:3544 points. Reduced size: 101 points. method:piecewise constant.tolerance:96 Std dev: 3.511. Reconstruction error: 0.0546 error / std= 0.015 98% compression, 0.05 RMSE


--Yoavfreund 18:20, 9 February 2016 (PST) I wonder what is "current". Seems like a signal that is sometimes fast varying and sometimes static. Might be air-current. We need a different representation for these. Something like mean+variance, or lower and upper bounds.

'Current', 'Tolerance of 16 Current.png

--Yoavfreund 18:20, 9 February 2016 (PST) In this and the next there seems to be a constant offset between the reconstruction and the original, the reconstruction is higher. 'Supply Velociy Pressure', 'Tolerance of 16 Svp.png

No idea why the reconstructon is different from the actual value. Will look into it.

'Damper Position', 'Tolerance of 96 Dp.png

The reconstruction is a notch higher here, mainly because the damper position has 2 values for the same time point in that the line shift has infinite slope. Therefore, the interpolation will fill in values choosing one of the two.

'Warm Cool Adjust', 'Tolerance of 96 Wca.png

'Actual Supply Flow SP', 'Tolerance of 96 Asf sp.png

Sensors

High Temp Setpoint, Humidifier Problem, AI 1 Actual, TUNING POINTS PM2TI, Drive Ready Status, Low Suction Pressure, Occupied Command, Reheat Valve Command, Occupied Clg Min, Common Setpoint, Temp Occ Sts, Water Flush, Cooling Command, Occupied Htg Flow 

all have single unique value and hence dont require any compression

Objective:

To model the room sensor data for the UCSD, computer science building and come up with a machine learning backed lossy compression method to save the data optimally without losing any information.

Dataset:

11887 CSV files with 1 csv per sensor_id.

A sensor_id can correspond to a group of sensors and each sensor data is represented by a line in the csv file. The lines are a set of {timeseries,values} set for that sensor. 

A tags.json file which has meta information for each sensor id. A sample json structure for one of the sensor_id looks as follows

metadata = json.load('tags.json') uuid = '4c3b1eca-77de-11e2-83c4-00163e005319' oneSensor = metadata[uuid] oneSensor['template'] = itsType oneSensor['context'] = itsContext  %floor, buliding, etc. sensorpoints = oneSensor['sensorpoints'] % list of information about sensorpoints. each element is about one sensorpoint in the sensor print sensorpoints[0] ///////// printed result (important info is bold) [{u'active': True,

 u'created_time': u'2013-01-01T00:00:00+00:00',
 u'data_type': u'int',
 u'description': u'PresentValue',
 u'from_template': False,
 u'latest_datapoint': {u'2015-12-23T00:07:04+00:00': -0.97},
 u'max_val': None,
 u'min_val': None,
 u'readonly': False,
 u'shorthand_unit': None,
 u'timeseries_span': {u'begin': u'2013-07-02T18:29:14+00:00',
  u'end': u'2015-12-23T00:07:04+00:00'},
 u'timeseries_type': u'continuous',
 u'type': u'PresentValue',
 u'unit': None,
 u'update_period': None,
 u'uri': u'/admin/api/sensors/4c3b1eca-77de-11e2-83c4-00163e005319/sensorpoints/PresentValue'}]

Data preprocessing

The data is shared on google drive. Since, there is no easy and free tool to sync between google drive, I used grive to download the files from drive to linux machine and then process it into spark dataframes. The dataframe has he columns time, value, room, sensor_id, sensor_name and each row corresponds to one value for the sensor. The daaframes were saved as json parts with one json per sensor in csv file, and pushed to s3. The json files are then read from s3, appended and saved as an databricks table.

Data summary

There are 16441 json files for 11887 csv files with the extra jsons corresponding to sensors from sensor_ids with multiple sensors.

tbd- These are converted into 822 spark dataframes with first 1 dataframe for 20 jsons. These dataframes are then read and saved as a databricks parquet table with 21057369 rows.

Note: Only 5850 csv files are greater than 1mb in size and the rest of the files are not timeseries files but correspnd to command sensorpoints.

--Yoavfreund 18:30, 12 February 2016 (PST) Does a row correspond to one measurement or to one day? In the summary file there are 66 sensor X room pairs. For each of those there is 1-2 years of data. A back of the envelope calculation gives 66*700 < 50,000 days. This means that there are about 10M/50,000 = 2000 rows for each (room,day,sensor). There are 1440 minutes in 24 hours. Does this mean that there is one row per minute for each individual sensor?

--Sachin as Professor Freund, the entire table is available here Each row has 1 time,value pair. Most of the sensors have values collected at 5 minute intervals and are between July 1 2013 to May 31 2015, 700 days i.e. Also, the total unique sensor x rom pairs come to be 56 since around 10 of the pairs have a break at the start of the measurement and then go on to capture continuously. I have added the actual row count as found on the table for each of the sensor x room pair in the summary spreadsheet, and the sum tallies up to the total row numbers of 10.207.073 rows.


The summary of the entire dataset is available Data Summary

--Yoavfreund 23:09, 20 February 2016 (PST) I am confused, do you now have 56 pairs (above) or 4822 pairs (in the table) ? Also please use the signature button in the editor (second from the right), and pleae use reverse chronological order so that the latest entries are at the top.

Analysis

We have 2 compression classes.

a) Piecewise constant b) Piecewise linear

We use dynamic programming approach to take a set of timeseries values, check if there is any significant change in value(constant), slope(linear) based on a tolerance parameter, and save the new timeseries point only if there is any significant change observed. These compressed (time,value) points are saved as a dataframe.

Data is reconstructed back by using ffill method which propagates last valid observation forward to next valid observation.

Tolerance is found by plotting the error vs compression points in order to optimize the tradeoff between error and increase in compression set. The error reduces as compression size increases, and the plot will have a knee where the error vs compression size is most optimized. We chose that knee point as the tolerance value.