Data set
Explanation
MeteoSchweiz / MeteoSuisse / MeteoSvizzera / MeteoSwiss stn|time|rhs150m0|ths200m0 ||[mm]|[°C]
Data are separated by a “|” and do not contains blankspace.
# | Name | Example | Description |
---|---|---|---|
1 | stn | BAS |
Station's names BAS: Basel / Binningen BER: Bern / Zollikofen CHM: Chaumont CHD: Château-d'Oex GSB: Col du Grand St-Bernard DAV: Davos ENG: Engelberg GVE: Genève-Cointrin LUG: Lugano PAY: Payerne SIA: Segl-Maria SIO: Sion SAE: Säntis SMA: Zürich / Fluntern |
2 | time | 201503 | Year and months of the measure. Format: yyyyMM |
3 | rhs150m0 | 36.9 | Sum of precipitation in mm at 1.5 meter |
4 | ths200m0 | 4.5 | Mean temperature in degree Celsius at 2 meters |
Data update
Data Access
Warnings
Some of stations do not have data like Payerne or others stations did not exist in 1864. In consequence, data must be filtered before used for statistics.
Hive
Creating a database, a table and loading data
1. Downloading the data and remove the header
$ wget http://data.geo.admin.ch.s3.amazonaws.com/ch.meteoschweiz.homogenereihen/VQAA60.txt $ tail -n +5 VQAA60.txt > VQAA60.txt.new && mv VQAA60.txt.new VQAA60.txt
2. Copy the locally unzipped data into your home folder in HDFS (the tailing “.” points you to /user/$(whoami)). SeeHDFSHelloWorld if you’re not familiar with HDFS.
$ hdfs dfs -copyFromLocal VQAA60.txt
3. Create a database in Hive
$ hive create database ${env:USER}_meteo;
4. Create a temp table to load the file into
$ hive use ${env:USER}_meteo; create table temp_meteo (col_value STRING); LOAD DATA INPATH '/user/${env:USER}/VQAA60.txt' OVERWRITE INTO TABLE temp_meteo;
5. Create the final table and insert the data into. The date are in format YYYYMM, they will be cast in string because of the request language. It is easier to extract the year from a string than an int
$ hive use ${env:USER}_meteo; create table meteo (station STRING, date STRING, precipitation FLOAT, temperature FLOAT); insert overwrite table meteo SELECT regexp_extract(col_value, '^(?:([^\|]*)\.){1}', 1) station, regexp_extract(col_value, '^(?:([^\|]*)\.){2}', 1) date, regexp_extract(col_value, '^(?:([^\|]*)\.){3}', 1) precipitation, regexp_extract(col_value, '^(?:([^\|]*)\.){4}', 1) temperature from temp_meteo;
6. Run your first query
$ hive --database ${USER}_meteo SELECT station, avg(precipitation) as mean_precipitation, avg(temperature) as mean_temperature FROM meteo GROUP BY station;
7. Woot!
8. Run a more complex query. Summerize the precipitation by station and year
$ hive --database ${USER}_meteo SELECT station, dateYear, sum(precipitation) as sumPre FROM (SELECT substring(date,1,4) as dateYear, precipitation, station FROM meteo) as T1 GROUP BY dateYear, station ORDER BY sumPre desc