Ugarit
Check-in Differences
Login

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Difference From 20e685628c65ada83b3dd70aa0cf85224c1c2fb3 To eef210bde31057111521c6fa35625d07bf1ddbcc

2016-02-21
21:44
Changed the URL of Rommel's article check-in: b7d078cbb6 user: alaric tags: trunk
2015-07-31
13:38
Merge from trunk, and added flushing of output on progress updates, doc updates, etc.

Sloppy to forget I'd got uncommited work when I did the merge, but now too late to undo... check-in: 235560787d user: alaric tags: alaricsp

2015-07-08
08:55
Updated the Ugarit tutorial link check-in: 20e685628c user: alaric tags: alaricsp
2015-06-15
21:10
Updated installation manual to reflect forgotten things about the changes to local caching. check-in: eef210bde3 user: alaric tags: trunk
2015-06-14
20:53
Cleaned up the archive schema page. check-in: e1cd1c6f30 user: alaric tags: trunk
2015-06-12
22:27
Added sanity checks for input/output tags to "ugarit merge" check-in: 5cb794335d user: alaric tags: alaricsp

Changes to DOWNLOAD.wiki.

1
2
3
4
5
6
7
8
9
10
11
12

13
14
15
16
17
18
19
<h1>Releases</h1>

  *  [https://www.kitten-technologies.co.uk/project/ugarit/tarball/ugarit-1.0.tar.gz?uuid=1.0|1.0]
  *  [https://www.kitten-technologies.co.uk/project/ugarit/tarball/ugarit-1.0.1.tar.gz?uuid=1.0.1|1.0.1]
  *  [https://www.kitten-technologies.co.uk/project/ugarit/tarball/ugarit-1.0.2.tar.gz?uuid=1.0.2|1.0.2]
  *  [https://www.kitten-technologies.co.uk/project/ugarit/tarball/ugarit-1.0.3.tar.gz?uuid=1.0.3|1.0.3]
  *  [https://www.kitten-technologies.co.uk/project/ugarit/tarball/ugarit-1.0.4.tar.gz?uuid=1.0.4|1.0.4]
  *  [https://www.kitten-technologies.co.uk/project/ugarit/tarball/ugarit-1.0.5.tar.gz?uuid=1.0.5|1.0.5]
  *  [https://www.kitten-technologies.co.uk/project/ugarit/tarball/ugarit-1.0.6.tar.gz?uuid=1.0.6|1.0.6]
  *  [https://www.kitten-technologies.co.uk/project/ugarit/tarball/ugarit-1.0.7.tar.gz?uuid=1.0.7|1.0.7]
  *  [https://www.kitten-technologies.co.uk/project/ugarit/tarball/ugarit-1.0.8.tar.gz?uuid=1.0.8|1.0.8]
  *  [https://www.kitten-technologies.co.uk/project/ugarit/tarball/ugarit-1.0.9.tar.gz?uuid=1.0.9|1.0.9]


<h1>Source Control</h1>

You can obtain the latest sources, all history, and a local copy of
the ticket database using [http://www.fossil-scm.org/|Fossil], like so:

fossil clone https://www.kitten-technologies.co.uk/project/ugarit ugarit.fossil












>







1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
<h1>Releases</h1>

  *  [https://www.kitten-technologies.co.uk/project/ugarit/tarball/ugarit-1.0.tar.gz?uuid=1.0|1.0]
  *  [https://www.kitten-technologies.co.uk/project/ugarit/tarball/ugarit-1.0.1.tar.gz?uuid=1.0.1|1.0.1]
  *  [https://www.kitten-technologies.co.uk/project/ugarit/tarball/ugarit-1.0.2.tar.gz?uuid=1.0.2|1.0.2]
  *  [https://www.kitten-technologies.co.uk/project/ugarit/tarball/ugarit-1.0.3.tar.gz?uuid=1.0.3|1.0.3]
  *  [https://www.kitten-technologies.co.uk/project/ugarit/tarball/ugarit-1.0.4.tar.gz?uuid=1.0.4|1.0.4]
  *  [https://www.kitten-technologies.co.uk/project/ugarit/tarball/ugarit-1.0.5.tar.gz?uuid=1.0.5|1.0.5]
  *  [https://www.kitten-technologies.co.uk/project/ugarit/tarball/ugarit-1.0.6.tar.gz?uuid=1.0.6|1.0.6]
  *  [https://www.kitten-technologies.co.uk/project/ugarit/tarball/ugarit-1.0.7.tar.gz?uuid=1.0.7|1.0.7]
  *  [https://www.kitten-technologies.co.uk/project/ugarit/tarball/ugarit-1.0.8.tar.gz?uuid=1.0.8|1.0.8]
  *  [https://www.kitten-technologies.co.uk/project/ugarit/tarball/ugarit-1.0.9.tar.gz?uuid=1.0.9|1.0.9]
  *  [https://www.kitten-technologies.co.uk/project/ugarit/tarball/ugarit-2.0.tar.gz?uuid=2.0|2.0]

<h1>Source Control</h1>

You can obtain the latest sources, all history, and a local copy of
the ticket database using [http://www.fossil-scm.org/|Fossil], like so:

fossil clone https://www.kitten-technologies.co.uk/project/ugarit ugarit.fossil

Changes to README.wiki.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
<center><img src="https://www.kitten-technologies.co.uk/project/ugarit/doc/trunk/artwork/logo.png" /></center>

<h1>Introduction</h1>

Ugarit is a backup/archival system based around content-addressible
storage.

<h1>News</h1>

<p>Development priorities are: Performance, better error handling, and
fixing bugs! After I've cleaned house a little, I'll be focussing on
replicated backend storage (ticket [f1f2ce8cdc]), as I now have a
cluster of storage devices at home.</p>

<ul>

<li>FIXME: Version 2.0 is released, containing rudimentary archive
mode, plus many minor improvements! See the release notes at the
bottom for more details.</li>

<li>2014-11-02: Chicken itself has gained
[http://code.call-cc.org/cgi-bin/gitweb.cgi?p=chicken-core.git;a=commit;h=a0ce0b4cb4155754c1a304c0d8b15276b11b8cd2|significantly
faster byte-vector I/O]. This is only on the trunk at the time of
writing; I look forward to it being in a formal release, as it sped up
Ugarit snapshot benchmarks (dumping a 256MiB file into an sqlite
backend) by a factor of twenty-something.</li>

<li>2014-02-21: User [http://rmm.meta.ph/|Rommel Martinez] has written
[https://ebzzry.github.io/blog/2014/02/21/an-introduction-to-ugarit/|An introduction to Ugarit]!</li>

</ul>

<h1>About Ugarit</h1>

<h2>What's content-addressible storage?</h2>

Traditional backup systems work by storing copies of your files
somewhere. Perhaps they go onto tapes, or perhaps they're in archive
files written to disk. They will either be full dumps, containing a
complete copy of your files, or incrementals or differentials, which
only contain files that have been modified since some point. This
saves making repeated copies of unchanging files, but it means that to
do a full restore, you need to start by extracting the last full dump
then applying one or more incrementials, or the latest differential,
to get the latest state.

Not only do differentials and incrementals let you save space, they
also give you a history - you can restore up to a previous point in
time, which is invaluable if the file you want to restore was deleted
a few backup cycles ago!

This technology was developed when the best storage technology for
backups was magnetic tape, because each dump is written sequentially
(and restores are largely sequential, unless you're skipping bits to
pull out specific files).

However, these days, random-access media such as magnetic disks and
SSDs are cheap enough to compete with magnetic tape for long-term bulk
storage (especially when one considers the cost of a tape drive or
two). And having fast random access means we can take advantage of
different storage techniques.

A content-addressible store is a key-value store, except that the keys
are always computed from the values. When a given object is stored, it
is hashed, and the hash used as the key. This means you can never
store the same object twice; the second time you'll get the same hash,
see the object is already present, and re-use the existing
copy. Therefore, you get deduplication of your data for free.

But, I hear you ask, how do you find things again, if you can't choose
the keys?

When an object is stored, you need to record the key so you can find
it again later. In Ugarit, everything is stored in a tree-like
directory structure. Files are uploaded and their hashes obtained, and
then a directory object is constructed containing a list of the files
in the directory, and listing the key of the Ugarit objects storing
the contents of each file. This directory object itself has a hash,
which is stored inside the directory entry in the parent directory,
and so on up to the root. The root of a tree stored in a Ugarit vault
has no parent directory to contain it, so at that point, we store the
key of the root in a named "tag" that we can look up by name when we
want it.

Therefore, everything in a Ugarit vault can be found by starting with
a named tag and retrieving the object whose key it contains, then
finding keys inside that object and looking up the objects they refer
to, until we find the object we want.

When you use Ugarit to back up your filesystem, it uploads a complete
snapshot of every file in the filesystem, like a full dump. But
because the vault is content-addressed, it automatically avoids
uploading anything it already has a copy of, so all we upload is an
incremental dump - but in the vault, it looks like a full dump, and so
can be restored on its own without having to restore a chain of incrementals.

Also, the same storage can be shared between multiple systems that all
back up to it - and the incremental upload algorithm will mean that
any files shared between the servers will only need to be uploaded
once. If you back up a complete server, than go and back up another
that is running the same distribution, then all the files in <tt>/bin</tt>
and so on that are already in the storage will not need to be backed
up again; the system will automatically spot that they're already
there, and not upload them again.

As well as storing backups of filesystems, Ugarit can also be used as
the primary storage for read-only files, such as music and photos. The
principle is exactly the same; the only difference is in how the files
are organised - rather than as a directory structure, the files are
referenced from metadata objects that specify information about the
file (so it can be found) and a reference to the contents. Sets of
metadata objects are pointed to by tags as well, so they can also be
found.

<h2>So what's that mean in practice?</h2>

<h3>Backups</h3>
You can run Ugarit to back up any number of filesystems to a shared
storage area (known as a <i>vault</i>, and on every backup, Ugarit
will only upload files or parts of files that aren't already in the
vault - be they from the previous snapshot, earlier snapshots,
snapshot of entirely unrelated filesystems, etc. Every time you do a
snapshot, Ugarit builds an entire complete directory tree of the
snapshot in the vault - but reusing any parts of files, files, or
entire directories that already exist anywhere in the vault, and
only uploading what doesn't already exist.

The support for parts of files means that, in many cases, gigantic
files like database tables and virtual disks for virtual machines will
not need to be uploaded entirely every time they change, as the
changed sections will be identified and uploaded.

Because a complete directory tree exists in the vault for any
snapshot, the extraction algorithm is incredibly simple - and,
therefore, incredibly reliable and fast. Simple, reliable, and fast
are just what you need when you're trying to reconstruct the
filesystem of a live server.

Also, it means that you can do lots of small snapshots. If you run a
snapshot every hour, then only a megabyte or two might have changed in
your filesystem, so you only upload a megabyte or two - yet you end up
with a complete history of your filesystem at hourly intervals in the
vault.

Conventional backup systems usually either store a full backup then
incrementals to their archives, meaning that doing a restore involves
reading the full backup then reading every incremental since and
applying them - so to do a restore, you have to download *every
version* of the filesystem you've ever uploaded, or you have to do
periodic full backups (even though most of your filesystem won't have
changed since the last full backup) to reduce the number of
incrementals required for a restore. Better results are had from
systems that use a special backup server to look after the archive
storage, which accept incremental backups and apply them to the
snapshot they keep in order to maintain a most-recent snapshot that
can be downloaded in a single run; but they then restrict you to using
dedicated servers as your archive stores, ruling out cheaply scalable
solutions like Amazon S3, or just backing up to a removable USB or
eSATA disk you attach to your system whenever you do a backup. And
dedicated backup servers are complex pieces of software; can you rely
on something complex for the fundamental foundation of your data
security system?

<h3>Archives</h3>

You can also use Ugarit as the primary storage for read-only
files. You do this by creating an archive in the vault, and importing
batches of files into it along with their metadata (arbitrary
attributes, such as "author", "creation date" or "subject").

Just as you can keep snapshots of multiple systems in a Ugarit vault,
you can also keep multiple separate archives, each identified by a
named tag.

However, as it's all within the same vault, the usual de-duplication
rules apply. The same file may be in multiple archives, with different
metadata in each, as the file contents and metadata are stored
separately (and associated only within the context of each
archive). And, of course, the same file may appear in snapshots and in
archives; perhaps a file was originally downloaded into your home
directory, where it was backed up into Ugarit snapshots, and then you
imported it into your archive. The archive import would not have had
to re-upload the file, as its contents would have already been found
in the vault, so all that needs to be uploaded is the metadata.

Although we have mainly spoken of storing files in archives, the
objects in archives can be files or directories full of files, as
well. This is useful for storing MacOS-style files that are actually
directories, or for archiving things like completed projects for
clients, which can be entire directory structures.

<h2>System Requirements</h2>

Ugarit should run on any POSIX-compliant system that can run
[http://www.call-with-current-continuation.org/|Chicken Scheme]. It
stores and restores all the file attributes reported by the <code>stat</code>
system call - POSIX mode permissions, UID, GID, mtime, and optionally
atime and ctime (although the ctime cannot be restored due to POSIX
restrictions). Ugarit will store files, directories, device and
character special files, symlinks, and FIFOs.

Support for extended filesystem attributes - ACLs, alternative
streams, forks and other metadata - is possible, due to the extensible
directory entry format; support for such metadata will be added as
required.

Currently, only local filesystem-based vault storage backends are
complete: these are suitable for backing up to a removable hard disk
or a filesystem shared via NFS or other protocols. However, the
backend can be accessed via an SSH tunnel, so a remote server you are
able to install Ugarit on to run the backends can be used as a remote
vault.

However, the next backend to be implemented will be one for Amazon S3,
and an SFTP backend for storing vaults anywhere you can ssh
to. Other backends will be implemented on demand; a vault can, in
principle, be stored on anything that can store files by name, report
on whether a file already exists, and efficiently download a file by
name. This rules out magnetic tapes due to their requirement for
sequential access.

Although we need to trust that a backend won't lose data (for now), we
don't need to trust the backend not to snoop on us, as Ugarit
optionally encrypts everything sent to the vault.

<h2>Terminology</h2>

A Ugarit backend is the software module that handles backend
storage. An actual storage area - managed by a backend - is called a
storage, and is used to implement a vault; currently, every storage is
a valid vault, but the planned future introduction of a distributed
storage backend will enable multiple storages (which are not,
themselves, valid vaults as they only contain some subset of the
information required) to be combined into an aggregrate storage, which
then holds the actual vault. Note that the contents of a storage is
purely a set of blocks, and a series of named tags containing
references to them; the storage does not know the details of
encryption and hashing, so cannot make any sense of its contents.

For example, if you use the recommended "splitlog" filesystem backend,
your vault might be <samp>/mnt/bigdisk</samp> on the server
<samp>prometheus</samp>. The backend (which is compiled along with the
other filesystem backends in the <code>backend-fs</code> binary) must
be installed on <samp>prometheus</samp>, and Ugarit clients all over
the place may then use it via ssh to <samp>prometheus</samp>. However,
even with the filesystem backends, the actual storage might not be on
<samp>prometheus</samp> where the backend runs -
<samp>/mnt/bigdisk</samp> might be an NFS mount, or a mount from a
storage-area network. This ability to delegate via SSH is particularly
useful with the "cache" backend, which reduces latency by storing a
cache of what blocks exist in a backend, thereby making it quicker to
identify already-stored files; a cluster of servers all sharing the
same vault might all use SSH tunnels to access an instance of the
"cache" backend on one of them (using some local disk to store the
cache), which proxies the actual vault storage to a vault on the other
end of a high-latency Internet link, again via an SSH tunnel.

A vault is where Ugarit stores backups (as chains of snapshots) and
archives (as chains of archive imports). Backups and archives are
identified by tags, which are the top-level named entry points into a
vault. A vault is based on top of a storage, along with a choice of
hash function, compression algorithm, and encryption that are used to
map the logical world of snapshots and archive imports into the
physical world of blocks stored in the storage.

A snapshot is a copy of a filesystem tree in the vault, with a header
block that gives some metadata about it. A backup consists of a number
of snapshots of a given filesystem.

An archive import is a set of filesystem trees, each along with
metadata about it. Whereas a backup is organised around a series of
timed snapshots, an archive is organised around the metadata; the
filesystem trees in the archive are identified by their properties.

<h2>So what, exactly, is in a vault?</h2>

A Ugarit vault contains a load of blocks, each up to a maximum size
(usually 1MiB, although other backends might impose smaller
limits). Each block is identified by the hash of its contents; this is
how Ugarit avoids ever uploading the same data twice, by checking to
see if the data to be uploaded already exists in the vault by
looking up the hash. The contents of the blocks are compressed and
then encrypted before upload.

Every file uploaded is, unless it's small enough to fit in a single
block, chopped into blocks, and each block uploaded. This way, the
entire contents of your filesystem can be uploaded - or, at least,
only the parts of it that aren't already there! The blocks are then
tied together to create a snapshot by uploading blocks full of the
hashes of the data blocks, and directory blocks are uploaded listing
the names and attributes of files in directories, along with the
hashes of the blocks that contain the files' contents. Even the blocks
that contain lists of hashes of other blocks are subject to checking
for pre-existence in the vault; if only a few MiB of your
hundred-GiB filesystem has changed, then even the index blocks and
directory blocks are re-used from previous snapshots.

Once uploaded, a block in the vault is never again changed. After all,
if its contents changed, its hash would change, so it would no longer
be the same block! However, every block has a reference count,
tracking the number of index blocks that refer to it. This means that
the vault knows which blocks are shared between multiple snapshots (or
shared *within* a snapshot - if a filesystem has more than one copy of
the same file, still only one copy is uploaded), so that if a given
snapshot is deleted, then the blocks that only that snapshot is using
can be deleted to free up space, without corrupting other snapshots by
deleting blocks they share. Keep in mind, however, that not all
storage backends may support this - there are certain advantages to
being an append-only vault. For a start, you can't delete something by
accident! The supplied fs and sqlite backends support deletion, while
the splitlog backend does not yet. However, the actual snapshot
deletion command in the user interface hasn't been implemented yet
either, so it's a moot point for now...

Finally, the vault contains objects called tags. Unlike the blocks,
the tags' contents can change, and they have meaningful names rather
than being identified by hash. Tags identify the top-level blocks of
snapshots within the system, from which (by following the chain of
hashes down through the index blocks) the entire contents of a
snapshot may be found. Unless you happen to have recorded the hash of
a snapshot somewhere, the tags are where you find snapshots from when
you want to do a restore.

Whenever a snapshot is taken, as soon as Ugarit has uploaded all the
files, directories, and index blocks required, it looks up the tag you
have identified as the target of the snapshot. If the tag already
exists, then the snapshot it currently points to is recorded in the
new snapshot as the "previous snapshot"; then the snapshot header
containing the previous snapshot hash, along with the date and time
and any comments you provide for the snapshot, and is uploaded (as
another block, identified by its hash). The tag is then updated to
point to the new snapshot.

This way, each tag actually identifies a chronological chain of
snapshots. Normally, you would use a tag to identify a filesystem
being backed up; you'd keep snapshotting the filesystem to the same
tag, resulting in all the snapshots of that filesystem hanging from
the tag. But if you wanted to remember any particular snapshot
(perhaps if it's the snapshot you take before a big upgrade or other
risky operation), you can duplicate the tag, in effect 'forking' the
chain of snapshots much like a branch in a version control system.

Archive imports cause the creation of one or more archive metadata
blocks, each of which lists the hashes of files or filesystem trees in
the archive, along with their metadata. Each import then has a single
archive import block pointing to the sequence of metadata blocks, and
pointing to the previous archive import block in that archive. The
same filesystem tree can be imported more than once to the same
archive, and the "latest" metadata always wins.

Generally, you should create lots of small archives for different
categories of things - such as one for music, one for photos, and so
on. You might well create separate archives for the music collections
of different people in your household, unless they overlap, and
another for Christmas music so it doesn't crop up in random shuffle
play! It's easy to merge archives if you over-compartmentalise them,
but harder to split an archive if you find it too cluttered with
unrelated things.

I've spoken of archive imports, and backup snapshots, each having a
"previous" reference to the last import or snapshot in the chain, but
it's actually more complex than that: they have an arbitrary list of
zero or more previous objects. As such, it's possible for several
imports or snapshots to have the same "previous", known as a "fork",
and it's possible to have an import or snapshot that merges multiple
previous ones.

Forking is handy if you want to basically duplicate an archive,
creating two new archives with the same contents to begin with, but
each then capable of diverging thereafter. You might do this to keep
the state of an archive before doing a bit import, so you can go back
to the original state if you regret the import, for instance.

Forking a backup tag is a more unusual operation, but also
useful. Perhaps you have a server running many stateful services, and
the hardware becomes overloaded, so you clone the basic setup onto
another server, and run half of the services on the original and half
on the new one; if you fork the backup tag of the original server to
create a backup tag for the new server, then both servers' snapshot
history will share the original shared state.

Merging is most useful for archives; you might merge several archives
into one, as mentioned.

And, of course, you can merge backup tags, as well. If your earlier
splitting of one server into two doesn't work out (perhaps your
workload reduces, or you can now afford a single, more powerful,
server to handle everything in one place), you might rsync back the
service state from the two servers onto the new server, so it's all
merged in the new server's filesystem. To preserve this in the
snapshot history, you can merge the two backup tags of the two servers
to create a backup tag for the single new server, which will
accurately reflect the history of the filesystem.

Also, tags might fork by accident - I plan to introduce a distributed
storage backend, which will replicate blocks and tags across multiple
storages to create a single virtual storage to build a vault on top
of; in the event of the network of actual storages suffering a
failure, it may be that snapshots and imports are only applied to some
of the storages - and then subsequent snapshots and imports only get
applied to some other subset of the storages. When the network is
repaired and all the storages are again visible, they will have
diverged, inconsistent, states for their tags, and the distributed
storage system will resolve the situation by keeping the majority
state as the state of the tag on all the backends, but preserving any
other states by creating new tags, with the original name plus a
suffix. These can then be merged to "heal" the conflict.

<h1>Using Ugarit</h1>

<h2>Installation</h2>

Install [http://www.call-with-current-continuation.org/|Chicken Scheme] using their [http://wiki.call-cc.org/man/4/Getting%20started|installation instructions].

Ugarit can then be installed by typing (as root):

    chicken-install ugarit

See the [http://wiki.call-cc.org/manual/Extensions#chicken-install-reference|chicken-install manual] for details if you have any trouble, or wish to install into your home directory.

<h2>Setting up a vault</h2>

Firstly, you need to know the vault identifier for the place you'll
be storing your vaults. This depends on your backend. The vault
identifier is actually the command line used to invoke the backend for
a particular vault; communication with the vault is via standard
input and output, which is how it's easy to tunnel via ssh.

<h3>Local filesystem backends</h3>

These backends use the local filesystem to store the vaults. Of
course, the "local filesystem" on a given server might be an NFS mount
or mounted from a storage-area network.

<h4>Logfile backend</h4>

The logfile backend works much like the original Venti system. It's
append-only - you won't be able to delete old snapshots from a logfile
vault, even when I implement deletion. It stores the vault in two
sets of files; one is a log of data blocks, split at a specified
maximum size, and the other is the metadata: an sqlite database used
to track the location of blocks in the log files, the contents of
tags, and a count of the logs so a filename can be chosen for a new one.

To set up a new logfile vault, just choose where to put the two
parts. It would be nice to put the metadata file on a different
physical disk to the logs directory, to reduce seeking. If you only
have one disk, you can put the metadata file in the log directory
("metadata" is a good name).

You can then refer to it using the following vault identifier:

      "backend-fs splitlog ...log directory... ...metadata file..."

<h4>SQLite backend</h4>

The sqlite backend works a bit like a
[http://www.fossil-scm.org/|Fossil] repository; the storage is
implemented as a single file, which is actually an SQLite database
containing blocks as blobs, along with tags and configuration data in
their own tables.

It supports unlinking objects, and the use of a single file to store
everything is convenient; but storing everything in a single file with
random access is slightly riskier than the simple structure of an
append-only log file; it is less tolerant of corruption, which can
easily render the entire storage unusable. Also, that one file can get
very large.

SQLite has internal limits on the size of a database, but they're
quite large - you'll probably hit a size limit at about 140
terabytes.

To set up an SQLite storage, just choose a place to put the file. I
usually use an extension of <code>.vault</code>; note that SQLite will
create additional temporary files alongside it with additional
extensions, too.

Then refer to it with the following vault identifier:

      "backend-sqlite ...path to vault file..."

<h4>Filesystem backend</h4>

The filesystem backend creates vaults by storing each block or tag
in its own file, in a directory. To keep the objects-per-directory
count down, it'll split the files into subdirectories. Because of
this, it uses a stupendous number of inodes (more than the filesystem
being backed up). Only use it if you don't mind that; splitlog is much
more efficient.

To set up a new filesystem-backend vault, just create an empty
directory that Ugarit will have write access to when it runs. It will
probably run as root in order to be able to access the contents of
files that aren't world-readable (although that's up to you), so
unless you access your storage via ssh or sudo to use another user to
run the backend under, be careful of NFS mounts that have
<code>maproot=nobody</code> set!

You can then refer to it using the following vault identifier:

      "backend-fs fs ...path to directory..."

<h3>Proxying backends</h3>

These backends wrap another vault identifier which the actual
storage task is delegated to, but add some value along the way.

<h3>SSH tunnelling</h3>

It's easy to access a vault stored on a remote server. The caveat
is that the backend then needs to be installed on the remote server!
Since vaults are accessed by running the supplied command, and then
talking to them via stdin and stdout, the vault identified needs
only be:

      "ssh ...hostname... '...remote vault identifier...'"

<h3>Cache backend</h3>

The cache backend is used to cache a list of what blocks exist in the
proxied backend, so that it can answer queries as to the existance of
a block rapidly, even when the proxied backend is on the end of a
high-latency link (eg, the Internet). This should speed up snapshots,
as existing files are identified by asking the backend if the vault
already has them.

The cache backend works by storing the cache in a local sqlite
file. Given a place for it to store that file, usage is simple:

      "backend-cache ...path to cachefile... '...proxied vault identifier...'"

The cache file will be automatically created if it doesn't already
exist, so make sure there's write access to the containing directory.

 - WARNING - WARNING - WARNING - WARNING - WARNING - WARNING -

If you use a cache on a vault shared between servers, make sure
that you either:

  *  Never delete things from the vault

or

  *  Make sure all access to the vault is via the same cache

If a block is deleted from a vault, and a cache on that vault is
not aware of the deletion (as it did not go "through" the caching
proxy), then the cache will record that the block exists in the
vault when it does not. This will mean that if a snapshot is made
through the cache that would use that block, then it will be assumed
that the block already exists in the vault when it does
not. Therefore, the block will not be uploaded, and a dangling
reference will result!

Some setups which *are* safe:

  *  A single server using a vault via a cache, not sharing it with
   anyone else.

  *  A pool of servers using a vault via the same cache.

  *  A pool of servers using a vault via one or more caches, and
   maybe some not via the cache, where nothing is ever deleted from
   the vault.

  *  A pool of servers using a vault via one cache, and maybe some
   not via the cache, where deletions are only performed on servers
   using the cache, so the cache is always aware.

<h2>Writing a <code>ugarit.conf</code></h2>

<code>ugarit.conf</code> should look something like this:

<verbatim>(storage <vault identifier>)
(hash tiger "<salt>")
[double-check]
[(compression [deflate|lzma])]
[(encryption aes <key>)]
[(file-cache "<path>")]
[(rule ...)]</verbatim>

The hash line chooses a hash algorithm. Currently Tiger-192
(<code>tiger</code>), SHA-256 (<code>sha256</code>), SHA-384
(<code>sha384</code>) and SHA-512 (<code>sha512</code>) are supported;
if you omit the line then Tiger will still be used, but it will be a
simple hash of the block with the block type appended, which reveals
to attackers what blocks you have (as the hash is of the unencrypted
block, and the hash is not encrypted). This is useful for development
and testing or for use with trusted vaults, but not advised for use
with vaults that attackers may snoop at. Providing a salt string
produces a hash function that hashes the block, the type of block, and
the salt string, producing hashes that attackers who can snoop the
vault cannot use to find known blocks (see the "Security model"
section below for more details).

I would recommend that you create a salt string from a secure entropy
source, such as:

   dd if=/dev/random bs=1 count=64 | base64 -w 0

Whichever hash function you use, you will need to install the required
Chicken egg with one of the following commands:

    chicken-install -s tiger-hash  # for tiger
    chicken-install -s sha2        # for the SHA hashes

<code>double-check</code>, if present, causes Ugarit to perform extra
internal consistency checks during backups, which will detect bugs but
may slow things down.

<code>lzma</code> is the recommended compression option for
low-bandwidth backends or when space is tight, but it's very slow to
compress; deflate or no compression at all are better for fast local
vaults. To have no compression at all, just remove the
<code>(compression ...)</code> line entirely. Likewise, to use
compression, you need to install a Chicken egg:

       chicken-install -s z3       # for deflate
       chicken-install -s lzma     # for lzma

WARNING: The lzma egg is currently rather difficult to install, and
needs rewriting to fix this problem.

Likewise, the <code>(encryption ...)</code> line may be omitted to have no
encryption; the only currently supported algorithm is aes (in CBC
mode) with a key given in hex, as a passphrase (hashed to get a key),
or a passphrase read from the terminal on every run. The key may be
16, 24, or 32 bytes for 128-bit, 192-bit or 256-bit AES. To specify a
hex key, just supply it as a string, like so:

      (encryption aes "00112233445566778899AABBCCDDEEFF")

...for 128-bit AES,

      (encryption aes "00112233445566778899AABBCCDDEEFF0011223344556677")

...for 192-bit AES, or

      (encryption aes "00112233445566778899AABBCCDDEEFF00112233445566778899AABBCCDDEEFF")

...for 256-bit AES.

Alternatively, you can provide a passphrase, and specify how large a
key you want it turned into, like so:

      (encryption aes ([16|24|32] "We three kings of Orient are, one in a taxi one in a car, one on a scooter honking his hooter and smoking a fat cigar. Oh, star of wonder, star of light; star with royal dynamite"))

I would recommend that you generate a long passphrase from a secure
entropy source, such as:

   dd if=/dev/random bs=1 count=64 | base64 -w 0

Finally, the extra-paranoid can request that Ugarit prompt for a
passphrase on every run and hash it into a key of the specified
length, like so:

      (encryption aes ([16|24|32] prompt))

(note the lack of quotes around <code>prompt</code>, distinguishing it from a passphrase)

Please read the "Security model" section below for details on the
implications of different encryption setups.

Again, as it is an optional feature, to use encryption, you must
install the appropriate Chicken egg:

       chicken-install -s aes

A file cache, if enabled, significantly speeds up subsequent snapshots
of a filesystem tree. The file cache is a file (which Ugarit will
create if it doesn't already exist) mapping filenames to
(mtime,size,hash) tuples; as it scans the filesystem, if it finds a
file in the cache and the mtime and size have not changed, it will
assume it is already stored under the specified hash. This saves it
from having to read the entire file to hash it and then check if the
hash is present in the vault. In other words, if only a few files
have changed since the last snapshot, then snapshotting a directory
tree becomes an O(N) operation, where N is the number of files, rather
than an O(M) operation, where M is the total size of files involved.

For example:

      (storage "ssh ugarit@spiderman 'backend-fs splitlog /mnt/ugarit-data /mnt/ugarit-metadata/metadata'")
      (hash tiger "i3HO7JeLCSa6Wa55uqTRqp4jppUYbXoxme7YpcHPnuoA+11ez9iOIA6B6eBIhZ0MbdLvvFZZWnRgJAzY8K2JBQ")
      (encryption aes (32 "FN9m34J4bbD3vhPqh6+4BjjXDSPYpuyskJX73T1t60PP0rPdC3AxlrjVn4YDyaFSbx5WRAn4JBr7SBn2PLyxJw"))
      (compression lzma)
      (file-cache "/var/ugarit/cache")

Be careful to put a set of parentheses around each configuration
entry. White space isn't significant, so feel free to indent things
and wrap them over lines if you want.

Keep copies of this file safe - you'll need it to do extractions!
Print a copy out and lock it in your fire safe! Ok, currently, you
might be able to recreate it if you remember where you put the
storage, but encryption keys and hash salts are harder to remember...

<h2>Your first backup</h2>

Think of a tag to identify the filesystem you're backing up. If it's
<code>/home</code> on the server <samp>gandalf</samp>, you might call it <samp>gandalf-home</samp>. If
it's the entire filesystem of the server <samp>bilbo</samp>, you might just call
it <samp>bilbo</samp>.

Then from your shell, run (as root):

      # ugarit snapshot <ugarit.conf> [-c] [-a] <tag> <path to root of filesystem>

For example, if we have a <code>ugarit.conf</code> in the current directory:

      # ugarit snapshot ugarit.conf -c localhost-etc /etc

Specify the <code>-c</code> flag if you want to store ctimes in the vault;
since it's impossible to restore ctimes when extracting from an
vault, doing this is useful only for informational purposes, so it's
not done by default. Similarly, atimes aren't stored in the vault
unless you specify <code>-a</code>, because otherwise, there will be a lot of
directory blocks uploaded on every snapshot, as the atime of every
file will have been changed by the previous snapshot - so with <code>-a</code>
specified, on every snapshot, every directory in your filesystem will
be uploaded! Ugarit will happily restore atimes if they are found in
a vault; their storage is made optional simply because uploading
them is costly and rarely useful.

<h2>Exploring the vault</h2>

Now you have a backup, you can explore the contents of the
vault. This need not be done as root, as long as you can read
<code>ugarit.conf</code>; however, if you want to extract files, run it as root
so the uids and gids can be set.

      $ ugarit explore ugarit.conf

This will put you into an interactive shell exploring a virtual
filesystem. The root directory contains an entry for every tag; if you
type <code>ls</code> you should see your tag listed, and within that
tag, you'll find a list of snapshots, in descending date order, with a
special entry <code>current</code> for the most recent
snapshot. Within a snapshot, you'll find the root directory of your
snapshot under <code>contents</codel>, and the detailts of the snapshot itself in
<code>propreties.sexpr</code>, and will be able to <code>cd</code> into
subdirectories, and so on:

      > <b>ls</b>
      localhost-etc/ <tag>
      > <b>cd localhost-etc</b>
      /localhost-etc> <b>ls</b>
      current/ <snapshot>
      2015-06-12 22:49:34/ <snapshot>
      2015-06-12 22:49:25/ <snapshot>
      /localhost-etc> cd current
      /localhost-etc/current> ls
      log.sexpr <file>
      properties.sexpr <inline>
      contents/ <dir>
      /localhost-etc/current> <b>cat properties.sexpr</b>
      ((previous . "a140e6dbe0a7a38f8b8c381323997c23e51a39e2593afb61")
       (mtime . 1434102574.0)
       (contents . "34eccf1f5141187e4209cfa354fdea749a0c3c1c4682ec86")
       (stats (blocks-stored . 12)
              (bytes-stored . 16889)
              (blocks-skipped . 50)
              (bytes-skipped . 6567341)
              (file-cache-hits . 0)
              (file-cache-bytes . 0))
       (log . "b2a920f962c12848352f33cf32941e5313bcc5f209219c1a")
       (hostname . "ahe")
       (source-path . "/etc")
       (notes)
       (files . 112)
       (size . 6563588))
      /localhost-etc/current> <b>cd contents</b>
      /localhost-etc/current/contents> <b>ls</b>
      zoneinfo <symlink>
      vconsole.conf <symlink>
      udev/ <dir>
      tmpfiles.d/ <dir>
      systemd/ <dir>
      sysctl.d/ <dir>
      sudoers.tmp~ <file>
      sudoers <file>
      subuid <file>
      subgid <file>
      static <symlink>
      ssl/ <dir>
      ssh/ <dir>
      shells <symlink>
      shadow- <file>
      shadow <file>
      services <symlink>
      samba/ <dir>
      rpc <symlink>
      resolvconf.conf <symlink>
      resolv.conf <file>
      -- Press q then enter to stop or enter for more...
      <b>q</b>
      /localhost-etc/current/contents> <b>ls -ll resolv.conf</b>
      -rw-r--r--     0     0 [2015-05-23 23:22:41] 78B/-: resolv.conf
      key: #f
      contents: "e33ea1394cd2a67fe6caab9af99f66a4a1cc50e8929d3550"
      size: 78
      ctime: 1432419761.0

As well as exploring around, you can also extract files or directories
(or entire snapshots) by using the <code>get</code> command. Ugarit
will do its best to restore the metadata of files, subject to the
rights of the user you run it as.

Type <code>help</code> to get help in the interactive shell.

The interactive shell supports command-line editing, history and tab
completion for your convenience.

<h2>Extracting things directly</h2>

As well as using the interactive explore mode, it is also possible to
directly extract something from the vault, given a path.

Given the sample vault from the previous example, it would be possible
to extract the <code>README.txt</code> file with the following
command:

      ugarit extract ugarit.conf /Test/current/contents/README.txt

<h2>Forking tags</h2>

As mentioned above, you can fork a tag, creating two tags that
refer to the same snapshot and its history but that can then have
their own subsequent history of snapshots applied to each
independently, with the following command:

      $ ugarit fork <ugarit.conf> <existing tag> <new tag>

<h2>Merging tags</h2>

And you can also merge two or more tags into one. It's possible to
merge a bunch of tags to make an entirely new tag, or you can merge a
tag into an existing tag, by having the "output" tag also be one of
the "input" tags.

The command to do this is:

      $ ugarit merge <ugarit.conf> <output tag> <input tags...>

For instance, to import your classical music collection into your main
musical collection, you might do:

     $ ugarit merge ugarit.conf my-music my-music classical-music

Or if you want to create a new all-music archive from the archives
bobs-music and petes-music, you might do:

     $ ugarit merge ugarit.conf all-music bobs-music petes-music

<h2>Archive operations</h2>

<h3>Importing</h3>

To import some files into an archive, you must create a manifest file
listing them, and their metadata. The manifest can also list
metadata for the import as a whole, perhaps naming the source of the
files, or the reason for importing them.

The metadata for a file (or an import) is a series of named
properties. The value of a property can be any Scheme value, written
in Scheme syntax (with strings double-quoted unless they are to be
interpreted as symbols), but strings and numbers are the most useful
types.

You can use whatever names you like for properties in metadata, but
there are some that the system applies automatically, and an informal
standard of sorts, which is documented in [docs/archive-schema.wiki].

You can produce a manifest file by hand, or use the Ugarit Manifest
Maker to produce one for you. You do this by installing it like so:

      $ chicken-install ugarit-manifest-maker

And then running it, giving it any number of file and directory names
on the command line. When given directories, it will recursively scan
them to find all the files contained therein and put them in the
manifest; it will not put directories in the manifest, although it is
perfectly legal for you to do so when writing a manifest by hand. This
is because the manifest maker can't do much useful analysis on a
directory to suggest default metadata for them (so there isn't much
point in using it), and it's far more useful for it to make it easy
for you to import a large number of files individually by referencing
the directory containing them.

The manifest is sent to standard output, so you need to redirect it to
a file, like so:

      $ ugarit-manifest-maker ~/music > music.manifest

You can specify command-line options, as well. <code>-e PATTERN</code>
or <code>--exclude=PATTERN</code> introduces a glob pattern for files
to exclude from the manifest, and <code>-D KEY=VALUE</code> or
<code>--define=KEY=VALUE</code> provides a property to be added to
every file in the manifest (as opposed to an import property, that is
part of the metadata of the overall import). Note that
<code>VALUE</code> must be double-quoted if it's a string, as per
Scheme value syntax.

One might use this like so:

      $ ugarit-manifest-maker -e *.txt -D rating=5 ~/favourite-music > music.manifest

The manifest maker simplifies the writing of manifests for files, by
listing the files in manifest format along with useful metadata
extracted from the filename and the file itself. For supported file
types (currently, MP3 and OGG music files), it will even look inside
the file to extract metadata.

The manifest file it generates will contain lots of comments
mentioning things it couldn't automatically analyse (such as unknown
OGG/ID3 tags, or unknown types of files); and for metadata properties
it thinks might be relevant but can't automatically provide, it
suggests them with an empty property declaration, commented out. The
idea is that, after generating a manifest, you read it by hand in a
text editor to attempt to improve it.

<h4>The format of a manifest file</h4>

Manifest files have a relatively simple format. The are based on
Scheme s-expressions, so can contain comments. From any semicolon (not
in a string or otherwise quoted) to the end of the line is a comment,
and <code>#;</code> in front of something comments out that something.

Import metadata properties are specified like so:

     (KEY = VALUE)

...where, as usual, <code>VALUE</code> must be double-quoted if it's a
string.

Files to import, with their metadata, are specified like so:

     (object "PATH OF FILE TO IMPORT"
        (KEY = VALUE)
        (KEY = VALUE)...
     )

The closing parenthesis need not be on a line of its own, it's
conventionally placed after the closing parenthesis of the final
property.

Ugarit, when importing the files in the manifest, will add the
following properties if they are not already specified:

<dl>
<dt><code>import-path</code></dt>
<dd>The path the file was imported from</dd>

<dt><code>dc:format</code></dt>
<dd>A guess at the file's MIME type, based on the extension</dd>

<dt><code>mtime</code></dt>
<dd>The file's modification time (as the number of seconds since the
UNIX epoch)</dd>

<dt><code>ctime</code></dt>
<dd>The file's change time (as the number of seconds since the UNIX
epoch)</dd>

<dt><code>filename</code></dt>
<dd>The name of the file, stripped of any directory components, and
including the extension.</dd>

</dl>

The following properties are placed in the import metadata,
automatically:

<dl>
<dt><code>hostname</code></dt>
<dd>The hostname the import was performed on.</dd>

<dt><code>manifest-path</code></dt>
<dd>The path to the manifest file used for the import.</dd>

<dt><code>mtime</code></dt>
<dd>The time (in seconds since the UNIX epoch) at which the import was
committed.</dd>

<dt><code>stats</code></dt>
<dd>A Scheme alist of statistics about the import (number of
files/blocks uploaded, etc).</dd>
</dl>

So, to wrap that all up, here's a sample import manifest file:

<verbatim>
(notes = "A bunch of old CDs I've finally ripped")

(object "/home/alaric/newrip/track01.mp3"
  (filename = "track01.mp3")
  (dc:format = "audio/mpeg")

  (dc:publisher = "Go! Beat Records")
  (dc:created = "1994")
  (dc:contributor = "Portishead")
  (dc:subject = "Trip-Hop")
  (superset:size = 1)
  (superset:index = 1)
  (set:title = "Dummy")
  (set:size = 11)
  (set:index = 1)
  (dc:creator = "Portishead")
  (dc:title = "Wandering Star")

  (mtime = 1428962299.0)
  (ctime = 1428962299.0)
  (file-size = 4703055))

;;... and so on, for ten more MP3s on this CD, then several other CDs...
</verbatim>

<h4>Actually importing a manifest</h4>

Well, when you finally have a manifest file, importing it is easy:

      $ ugarit import <ugarit.conf> <archive tag> <manifest path>

<h4>How do I change the metadata of an already-imported file?</h4>

That's easy; the "current" metadata of a file is the metadata of its
most recent. Just import the file again, in a new manifest, with new
metadata, and it will overwrite the old. However, the old metadata is
still preserved in the archive's history; tags forked from the archive
tag before the second import will still see the original state of the
archive, by design.

<h3>Exploring</h3>

Archives are visible in the explore interface. For instance, an import
of some music I did looks like this:

<pre>
> <b>ls</b>
localhost-etc/ &lt;tag>
archive-tag/ &lt;tag>
> <b>cd archive-tag</b>
/archive-tag> <b>ls</b>
history/ &lt;archive-history>
/archive-tag> <b>cd history</b>
/archive-tag/history> <b>ls</b>
2015-06-12 22:53:13/ &lt;import>
/archive-tag/history> <b>cd 2015-06-12 22:53:13</b>
/archive-tag/history/2015-06-12 22:53:13> <b>ls</b>
log.sexpr &lt;file>
properties.sexpr &lt;inline>
manifest/ &lt;import-manifest>
/archive-tag/history/2015-06-12 22:53:13> <b>cat properties.sexpr</b>
((stats (blocks-stored . 2046)
        (bytes-stored . 1815317503)
        (blocks-skipped . 9)
        (bytes-skipped . 8388608)
        (file-cache-hits . 0)
        (file-cache-bytes . 0))
 (log . "b2a920f962c12848352f33cf32941e5313bcc5f209219c1a")
 (mtime . 1434135993.0)
 (contents . "fcdd5b996914fdcac1e8a6cfbc67663e08f6eaf0cc952e21")
 (hostname . "ahe")
 (notes . "A bunch of music, imported as a demo")
 (manifest-path . "/home/alaric/tmp/test.manifest"))
/archive-tag/history/2015-06-12 22:53:13> <b>cd manifest</b>
/archive-tag/history/2015-06-12 22:53:13/manifest> <b>ls</b>
1d4269099189234eefeb80b95370eaf280730cf4d591004d:03 The Lemon Song.mp3 &lt;file>
7cb253a4886b3e0051ea8cc0e78fb3a0160307a2c37c8382:04 Dazed and Confused.mp3 &lt;file>
64092fa12c2800dda474b41e5ebe8c948f39a59ee91c120b:09 How Many More Times.mp3 &lt;file>
1d79148d1e1e8947c50b44cf2d5690588787af328e82eeef:2-07 Going to California.mp3 &lt;file>
e3685148d0d12213074a9fdb94a00e05282aeabe77fa60d5:1-01 You Shook Me.mp3 &lt;file>
d73904f371af8d7ca2af1076881230f2dc1c2cf82416880a:03 Strangers.mp3 &lt;file>
9c5a0efb7d397180a1e8d42356d8f04c6c26a83d3b05d34a:09 Uptight.mp3 &lt;file>
01a069aec2e731e18fcdd4ecb0e424f346a2f0e16910f5e9:07 Numb.mp3 &lt;file>
7ea1ab7fbd525c40e21d6dd25130e8c70289ad56c09375b0:08 She.mp3 &lt;file>
009dacd8f3185b7caeb47050002e584ab86d08cf9e9aceec:1-03 Communication Breakdown.mp3 &lt;file>
26d264d629e22709f664ed891741f690900d45cd4fd44326:1-03 Dazed and Confused.mp3 &lt;file>
d879761195faf08e4e95a5a2398ea6eefb79920710bfeab6:1-10 Band Introduction _ How Many More Times.mp3 &lt;file>
83244601db42677d110fc8522c6a3cbbc1f22966a779f876:06 All My Love.mp3 &lt;file>
5eebee9a2ad79d04e4f69e9e2a92c4e0a8d5f21e670f89da:07 Tangerine.mp3 &lt;file>
dd6f1203b5973ecd00d2c0cee18087030490230727591746:2-08 That's the Way.mp3 &lt;file>
c0acea15aa27a6dd1bcaff1c13d4f3d741a40a46abeca3fc:04 The Crunge.mp3 &lt;file>
ea7727ad07c6c82e5c9c7218ee1b059cd78264c131c1438d:1-02 I Can't Quit You Baby.mp3 &lt;file>
10fda5f46b8f505ca965bcaf12252eedf5ab44514236f892:14 F.O.D..mp3 &lt;file>
a99ca9af5a83bde1c676c388dc273051defa88756df26e95:1-03 Good Times Bad Times.mp3 &lt;file>
b5d7cfe9808c7fc0dedbd656d44e4c56159cbd3c2ed963bb:1-15 Stairway to Heaven.mp3 &lt;file>
79c87e3c49ffdac175c95aae071f63d3a9efdf2ddb84998c:08.Batmilk.ogg &lt;file>
-- Press q then enter to stop or enter for more...
q
/archive-tag/history/2015-06-12 22:53:13/manifest> <b>ls -ll 7cb253a4886b3e0051ea8cc0e78fb3a0160307a2c37c8382:04 Dazed and Confused.mp3</b>
-r--------     -     - [2015-04-13 21:46:39] -/-: 7cb253a4886b3e0051ea8cc0e78fb3a0160307a2c37c8382:04 Dazed and Confused.mp3
key: #f
contents: "7cb253a4886b3e0051ea8cc0e78fb3a0160307a2c37c8382"
import-path: "/home/alaric/archive/sorted-music/Led Zeppelin/Led Zeppelin/04 Dazed and Confused.mp3"
filename: "04 Dazed and Confused.mp3"
dc:format: "audio/mpeg"
dc:publisher: "Atlantic"
dc:subject: "Classic Rock"
dc:title: "Dazed and Confused"
dc:creator: "Led Zeppelin"
dc:created: "1982"
dc:contributor: "Led Zeppelin"
set:title: "Led Zeppelin"
set:index: 4
set:size: 9
superset:index: 1
superset:size: 1
ctime: 1428957999.0
file-size: 15448903
</verbatim>

<h3>Searching</h3>

However, the explore interface to an archive is far from pleasant. You
need to go to the correct import, and find your file by name, and then
identify it with a big long name composed of its hash and the original
filename to find its properties and extract.

I hope to add property-based searching to explore mode in future
(which is why you need to go into a <code>history</code> directory
within the archive directory, as other ways of exploring the archive
will appear alongside). This will be particularly useful when the
explore-mode virtual filesystem is mounted over 9P!

However, even that interface, being constrained to look like a
filesystem, will be limited. The <code>ugarit</code> command-line tool
provides a very powerful search interface that exposes the full power
of the archive metadata.

<h4>Metadata filters</h4>

Files (and directories) in an archive can be searched for using
"metadata filters", which are descriptions of what you're looking for
that the computer can understand. They are represented as Scheme
s-expressions, and can be made up of the following components:

<dl>
<dt><code>#t</code></dt>
<dd>This filter matches everything. It's not very useful.</dd>

<dt><code>#f</code></dt>
<dd>This filter matches nothing. It's not very useful.</dd>

<dt><code>(and FILTER FILTER...)</code></dt>
<dd>This filter matches files for which all of the inner filters match.</dd>

<dt><code>(or FILTER FILTER...)</code></dt>
<dd>This filter matches files for which any of the inner filters match.</dd>

<dt><code>(not FILTER)</code></dt>
<dd>This filter matches files which do not match the inner filter.</dd>

<dt><code>(= ($ PROP) VALUE)</code></dt>
<dd>This filter matches files which have the given
<code>PROP</code>erty equal to that <code>VALUE</code> in their metadata.</dd>

<dt><code>(= key HASH)</code></dt>
<dd>This filter matches the file with the given hash.</dd>

<dt><code>(= ($import PROP) VALUE)</code></dt>
<dd>This filter matches files which have the given
<code>PROP</code>erty equal to that <code>VALUE</code> in the metadata
of the import that last imported them.</dd>
</dl>

<h4>Searching an archive</h4>

For a start, you can search for files matching a given metadata filter
in a given archive. This is done with:

      $ ugarit search <ugarit.conf> <archive tag> <filter>

For instance, let's look for music by Led Zeppelin:

      $ ugarit search ugarit.conf music '(or
        (= ($ dc:creator) "Led Zeppelin")
        (= ($ dc:contributor) "Led Zeppelin"))'

The result looks like the explore-mode view of an archive manifest,
listing the file's hash followed by its title and extension:

<verbatim>
7cb253a4886b3e0051ea8cc0e78fb3a0160307a2c37c8382:04 Dazed and Confused.mp3
834a1619a59835e0c27b22801e3c829b40be583dadd19770:2-08 No Quarter.mp3
9e8bc4954838bd9c671f275eb48595089257185750d63894:1-12 I Can't Quit You Baby.mp3
6742b3bebcdd9cae5ec5403c585935403fa74d16ed076cf2:02 Friends (1).mp3
07d161f4bd684e283f7f2cf26e0b732157a8e95ef66939c3:05 Carouselambra.mp3
[...]
</verbatim>

What of all our lovely metadata? You can view that if you add the word
"verbose" to the end of the command line, which allows you to specify
alternate output formats:

      $ ugarit search ugarit.conf music '(or
        (= ($ dc:creator) "Led Zeppelin")
        (= ($ dc:contributor) "Led Zeppelin"))' verbose

Now the output looks like:

<verbatim>
object a444ff6ef807b080b536155f58d246d633cab4a0eabef5bf
        (ctime = 1428958660.0)
        (dc:contributor = "Led Zeppelin")
        (dc:created = "2008")
        (dc:creator = "Led Zeppelin")
[... all the usual file properties omitted ...]
        import a43f7a7268ee8b18381c20d7573add5dbf8781f81377279c
                (stats = ((blocks-stored . 2046) (bytes-stored . 1815317503) (blocks-skipped . 9) (bytes-skipped . 8388608) (file-cache-hits . 0) (file-cache-bytes . 0)))
                (log = "b2a920f962c12848352f33cf32941e5313bcc5f209219c1a")
[... all the usual import properties omitted ...]
object b4cadf48b2c07ccf0303fc4064b292cb222980b0d4223641
        (ctime = 1428958673.0)
        (dc:contributor = "Led Zeppelin")
        (dc:created = "2008")
        (dc:creator = "Led Zeppelin")
        (dc:creator = "Jimmy Page/John Paul Jones/Robert Plant")
[...and so on...]
</verbatim>

As you can see, it lists the hash of each file, its metadata, the hash
of the import that last imported it, and the metadata of that import.

That's quite verbose, so you'd probably be wanting to take that as
input to another program to do something nicer with it. But it's laid
out for human reading, not for machine parsing. Thankfully, we have
other formats for that, <code>alist</code> and
<code>alist-with-imports</code>.

The look like:

      $ ugarit search ugarit.conf music '(or
        (= ($ dc:creator) "Led Zeppelin")
        (= ($ dc:contributor) "Led Zeppelin"))' alist

This outputs one Scheme s-expression list per match, the first element
of which is the hash as a string, the rest of which is an alist of properties:

<verbatim>
("7cb253a4886b3e0051ea8cc0e78fb3a0160307a2c37c8382"
 (ctime . 1428957999.0)
 (dc:contributor . "Led Zeppelin")
 (dc:created . "1982")
 (dc:creator . "Led Zeppelin")
[... elided file properties ...]
 (superset:index . 1)
 (superset:size . 1))
("77c960d09eb21ed72e434ddcde0bd3781a4f3d6ee7a6eb66"
 (ctime . 1428958981.0)
 (dc:contributor . "Led Zeppelin")
[...]
</verbatim>

      $ ugarit search ugarit.conf music '(or
        (= ($ dc:creator) "Led Zeppelin")
        (= ($ dc:contributor) "Led Zeppelin"))' alist-with-imports

This outputs one s-expression per list per match, with four
elements. The first is the key string, the second is an alist of file
properties, the third is the import's hash, and the last is an alist
containing the import's properties. It looks like:

<verbatim>
("64fa08a0080aee6ef501c408fd44dfcc634cfcafd8006fc4"
 ((ctime . 1428958683.0)
  (dc:contributor . "Led Zeppelin")
  (dc:created . "2008")
  (dc:creator . "Led Zeppelin")
[... elided file properties ...]
  (superset:index . 1)
  (superset:size . 1))
 "a43f7a7268ee8b18381c20d7573add5dbf8781f81377279c"
 ((stats (blocks-stored . 2046)
         (bytes-stored . 1815317503)
[... elided manifest properties ...]
  (manifest-path . "test.manifest")))
("4cd56f916a63399b252976e842dcae0b87f058b5a60c93a4"
 ((ctime . 1428958437.0)
  (dc:contributor . "Led Zeppelin")
[...]
</verbatim>

And finally, you might just want to get the hashes of matching files
(which are particularly useful for extraction operations, which we'll
come to next). To do this, specify a format of "keys", which outputs
one line per match, containing just the hash:

      $ ugarit search ugarit.conf music '(or
        (= ($ dc:creator) "Led Zeppelin")
        (= ($ dc:contributor) "Led Zeppelin"))' keys

<verbatim>
ce6f6484337de772de9313038cb25d1b16e28028136cc291
6af5c664cbfa1acb22a377e97aee35d94c0fc003d239dd0c
92e91e79b384478b5aab31bf1b2ff9e25e7e2c4b48575185
6ddb9a41d4968468a904f05ecf7e0e73d2c7c7ad76bc394b
a074dddcef67cd93d92c6ffce845894aa56594674023f6e1
4f65f735bbb00a6fda4bc887b370b3160f55e5e07ec37ffa
97cc8b8ba70c39387fc08ef62311b751aea4340d636eb421
72358dbe3eb60da42eadcf6de325b2a6686f4e17ea41fa60
[...]
</verbatim>

However, to write filter expressions, you need to know what properties
you have available to search on. You might remember, or go for
standard properties, or look at existing files in verbose mode to find
some; but you can also just ask Ugarit what properties it has in an
archive, like so:

      $ ugarit search-props <ugarit.conf> <archive tag>

You can even ask what properties are available for files matching an
existing filter:

      $ ugarit search-props <ugarit.conf> <archive tag> <filter>

This is useful if you're interested in further narrowing down a
filter, and so only care about properties that files already matching
that filter have.

For a bunch of music files imported with the Ugarit Manifest Maker,
you can expect to see something like this:

<verbatim>
ctime
dc:contributor
dc:created
dc:creator
dc:format
dc:publisher
dc:subject
dc:title
file-size
filename
import-path
mtime
set:index
set:size
set:title
superset:index
superset:size
</verbatim>

Now you know what properties to search, next you'll be wanting to know
what values to look for. Again, Ugarit has a command to query the
available values of any given property:

      $ ugarit search-values <ugarit.conf> <archive tag> <property>

And you can limit that just to files matching a given filter:

      $ ugarit search-values <ugarit.conf> <archive tag> <filter> <property>

The resulting list of values is ordered by popularity, so the most
widely-used values will be listed first. Let's see what genres of
music were in my sample of music files I imported:

      $ ugarit search-values test.conf archive-tag dc:subject

The result is:

<verbatim>
Classic Rock
Alternative & Punk
Electronic
Trip-Hop
</verbatim>

Ok, let's now use a filter to find out what artists
(<code>dc:creator</code>) I have that made Trip-Hop music (what even
IS that?):

      $ ugarit search-values test.conf archive-tag \
         '(= ($ dc:subject) "Trip-Hop")' \
         dc:creator

The result is:

<verbatim>
Portishead
</verbatim>

Ah, OK, now I know what "Trip-Hop" is.

<h3>Extracting</h3>

All this searching is lovely, but what it gets us, in the end, is a
bunch of file hashes. Perhaps we might want to actually play some
music, or look at a photo, or something. To do that, we need to
extract from the archive.

We've already seen the contents of an archive in the explore mode
virtual filesystem, so we could go into the archive history, find the
import, go into the manifest, pick the file out there, and use
<code>get</code> to extract it, but that would be yucky. Thankfully,
we have a command-line interface to get things from archives, in one
of two ways.

Firstly, we can extract a file (or a directory tree) from an archive,
out into the local filesystem:

      $ ugarit archive-extract <ugarit.conf> <archive tag> <hash> <target>

The "target" is the name to give it in the local filesystem. We could
pull out that Led Zeppelin song from our search results above, like so:

      $ ugarit archive-extract test.conf archive-tag \
         ce6f6484337de772de9313038cb25d1b16e28028136cc291 foo.mp3

We now have a foo.mp3 file in the current directory.

However, sometimes it would be nicer to have it streamed to standard
output, which can be done like so:

      $ ugarit archive-stream <ugarit.conf> <archive tag> <hash>

This lets us write a command such as:

      $ ugarit archive-stream test.conf archive-tag \
         ce6f6484337de772de9313038cb25d1b16e28028136cc291 | mpg123 -

...to play it in real time.

<h2>Storage administration</h2>

Each backend offers a number of administrative commands for
administering the storage underlying vaults. These are accessible via
the <code>ugarit-storage-admin</code> command line interface.

To use it, run it with the following command:

      $ ugarit-storage-admin '<vault identifier>'

The available commands differ between backends, but all backends
support the <code>info</code> and <code>help</code> commands, which
give basic information about the vault, and list all available
commands, respectively. Some offer a <code>stats</code> command that
examines the vault state to give interesting statistics, but which may
be a time-consuming operation.

<h3>Administering <code>splitlog</code> storages</h3>

The splitlog backend offers a wide selection of administrative
commands. See the <code>help</code> command on a splitlog vault for
details. The following commands are available:

<dl>

<dt><code>help</code></dt>
<dd>List the available commands.</dd>

<dt><code>info</code></dt>
<dd>List some basic information about the storage.</dd>

<dt><code>stats</code></dt>
<dd>Examine the metadata to provide overall statistics about the
archive. This may be a time-consuming operation on large
storages.</dd>

<dt><code>set-block-size! BYTES</code></dt>
<dd>Sets the block size to the given number of bytes. This will affect
new blocks written to the storage, and leave existing blocks
untouched, even if they are larger than the new block size.</dd>

<dt><code>set-max-logfile-size! BYTES</code></dt>
<dd>Sets the size at which a log file is finished and a new one
started (likewise, existing log files will be untouched; this will
only affect new log files)</dd>

<dt><code>set-commit-interval! UPDATES</code></dt>
<dd>Sets the frequency of automatic synching of the storage
state to disk. Lowering this harms performance when writing to the
storage, but decreases the number of in-progress block writes that
can fail in a crash.</dd>

<dt><code>write-protect!</code></dt>
<dd>Disables updating of the storage.</dd>

<dt><code>write-unprotect!</code></dt>
<dd>Re-enables updating of the storage.</dd>

<dt><code>reindex!</code></dt>
<dd>Reindex the storage, rebuilding the block and tag state from the
contents of the log. If the metadata file is damaged or lost,
reindexing can rebuild it (although any configuration changes made
via other admin commands will need manually repeating as they are
not logged).</dd>
</dl>

<h3>Administering <code>sqlite</code> storages</h3>

The sqlite backend has a similar administrative interface to the
splitlog backend, except that it does not have log files, so lacks the
<code>set-max-logfile-size!</code> and <code>reindex!</code> commands.

<h3>Administering <code>cache</code> storages</h3>

The cache backend provides a minimalistic interface:

<dl>

<dt><code>help</code></dt>
<dd>List the available commands.</dd>

<dt><code>info</code></dt>
<dd>List some basic information about the storage.</dd>

<dt><code>stats</code></dt>
<dd>Report on how many entries are in the cache.</dd>

<dt><code>clear!</code></dt>
<dd>Clears the cache, dropping all the entries in it.</dd>

</dl>

<h2><code>.ugarit</code> files</h2>

By default, Ugarit will vault everything it finds in the filesystem
tree you tell it to snapshot. However, this might not always be
desired; so we provide the facility to override this with <code>.ugarit</code>
files, or global rules in your <code>.conf</code> file.

Note: The syntax of these files is provisional, as I want to
experiment with usability, as the current syntax is ugly. So please
don't be surprised if the format changes in incompatible ways in
subsequent versions!

In quick summary, if you want to ignore all files or directories
matching a glob in the current directory and below, put the following
in a <code>.ugarit</code> file in that directory:

      (* (glob "*~") exclude)

You can write quite complex expressions as well as just globs. The
full set of rules is:

   *  <code>(glob "<em>pattern</em>")</code> matches files and directories whose names
  match the glob pattern

   *  <code>(name "<em>name</em>")</code> matches files and directories with exactly that
  name (useful for files called <code>*</code>...)

   *  <code>(modified-within <em>number</em> seconds)</code> matches files and
  directories modified within the given number of seconds

  *  <code>(modified-within <em>number</em> minutes)</code> matches files and
  directories modified within the given number of minutes

  *  <code>(modified-within <em>number</em> hours)</code> matches files and directories
  modified within the given number of hours

  *  <code>(modified-within <em>number</em> days)</code> matches files and directories
  modified within the given number of days

  *  <code>(not <em>rule</em>)</code> matches files and directories that do not match
  the given rule

  *  <code>(and <em>rule</em> <em>rule...</em>)</code> matches files and directories that match
  all the given rules

  *  <code>(or <em>rule</em> <em>rule...</em>)</code> matches files and directories that match
  any of the given rules

Also, you can override a previous exclusion with an explicit include
in a lower-level directory:

    (* (glob "*~") include)

You can bind rules to specific directories, rather than to "this
directory and all beneath it", by specifying an absolute or relative
path instead of the `*`:

    ("/etc" (name "passwd") exclude)

If you use a relative path, it's taken relative to the directory of
the <code>.ugarit</code> file.

You can also put some rules in your <code>.conf</code> file, although relative
paths are illegal there, by adding lines of this form to the file:

    (rule * (glob "*~") exclude)

<h1>Questions and Answers</h1>

<h2>What happens if a snapshot is interrupted?</h2>

Nothing! Whatever blocks have been uploaded will be uploaded, but the
snapshot is only added to the tag once the entire filesystem has been
snapshotted. So just start the snapshot again. Any files that have
already be uploaded will then not need to be uploaded again, so the
second snapshot should proceed quickly to the point where it failed
before, and continue from there.

Unless the vault ends up with a partially-uploaded corrupted block
due to being interrupted during upload, you'll be fine. The filesystem
backend has been written to avoid this by writing the block to a file
with the wrong name, then renaming it to the correct name when it's
entirely uploaded.

Actually, there is *one* caveat: blocks that were uploaded, but never
make it into a finished snapshot, will be marked as "referenced" but
there's no snapshot to delete to un-reference them, so they'll never
be removed when you delete snapshots. (Not that snapshot deletion is
implemented yet, mind). If this becomes a problem for people, we could
write a "garbage collect" tool that regenerates the reference counts
in a vault, leading to unused blocks (with a zero refcount) being
unlinked.

<h2>Should I share a single large vault between all my filesystems?</h2>

I think so. Using a single large vault means that blocks shared
between servers - eg, software installed from packages and that sort
of thing - will only ever need to be uploaded once, saving storage
space and upload bandwidth. However, do not share a vault between
servers that do not mutually trust each other, as they can all update
the same tags, so can meddle with each other's snapshots - and read
each other's snapshots.

<h3>CAVEAT</h3>

It's not currently safe to have multiple concurrent snapshots to the
same split log backend; this will soon be fixed, however.

<h1>Security model</h1>

I have designed and implemented Ugarit to be able to handle cases
where the actual vault storage is not entirely trusted.

However, security involves tradeoffs, and Ugarit is configurable in
ways that affect its resistance to different kinds of attacks. Here I
will list different kinds of attack and explain how Ugarit can deal
with them, and how you need to configure it to gain that
protection.

<h2>Vault snoopers</h2>

This might be somebody who can intercept Ugarit's communication with
the vault at any point, or who can read the vault itself at their
leisure.

Ugarit's splitlog backend creates files with "rw-------" permissions
out of the box to try and prevent this. This is a pain for people who
want to share vaults between UIDs, but we can add a configuration
option to override this if that becomes a problem.

<h3>Reading your data</h3>

If you enable encryption, then all the blocks sent to the vault are
encrypted using a secret key stored in your Ugarit configuration
file. As long as that configuration file is kept safe, and the AES
algorithm is secure, then attackers who can snoop the vault cannot
decode your data blocks. Enabling compression will also help, as the
blocks are compressed before encrypting, which is thought to make
cryptographic analysis harder.

Recommendations: Use compression and encryption when there is a risk
of vault snooping. Keep your Ugarit configuration file safe using
UNIX file permissions (make it readable only by root), and maybe store
it on a removable device that's only plugged in when
required. Alternatively, use the "prompt" passphrase option, and be
prompted for a passphrase every time you run Ugarit, so it isn't
stored on disk anywhere.

<h3>Looking for known hashes</h3>

A block is identified by the hash of its content (before compression
and encryption). If an attacker was trying to find people who own a
particular file (perhaps a piece of subversive literature), they could
search Ugarit vaults for its hash.

However, Ugarit has the option to "key" the hash with a "salt" stored
in the Ugarit configuration file. This means that the hashes used are
actually a hash of the block's contents *and* the salt you supply. If
you do this with a random salt that you keep secret, then attackers
can't check your vault for known content just by comparing the hashes.

Recommendations: Provide a secret string to your hash function in your
Ugarit configuration file. Keep the Ugarit configuration file safe, as
per the advice in the previous point.

<h2>Vault modifiers</h2>

These folks can modify Ugarit's writes into the vault, its reads
back from the vault, or can modify the vault itself at their leisure.

Modifying an encrypted block without knowing the encryption key can at
worst be a denial of service, corrupting the block in an unknown
way. An attacker who knows the encryption key could replace a block
with valid-seeming but incorrect content. In the worst case, this
could exploit a bug in the decompression engine, causing a crash or
even an exploit of the Ugarit process itself (thereby gaining the
powers of a process inspector, as documented below). We can but hope
that the decompression engine is robust. Exploits of the decryption
engine, or other parts of Ugarit, are less likely due to the nature of
the operations performed upon them.

However, if a block is modified, then when Ugarit reads it back, the
hash will no longer match the hash Ugarit requested, which will be
detected and an error reported. The hash is checked after
decryption and decompression, so this check does not protect us
against exploits of the decompression engine.

This protection is only afforded when the hash Ugarit asks for is not
tampered with. Most hashes are obtained from within other blocks,
which are therefore safe unless that block has been tampered with; the
nature of the hash tree conveys the trust in the hashes up to the
root. The root hashes are stored in the vault as "tags", which an
vault modifier could alter at will. Therefore, the tags cannot be
trusted if somebody might modify the vault. This is why Ugarit
prints out the snapshot hash and the root directory hash after
performing a snapshot, so you can record them securely outside of the
vault.

The most likely threat posed by vault modifiers is that they could
simply corrupt or delete all of your vault, without needing to know
any encryption keys.

Recommendations: Secure your vaults against modifiers, by whatever
means possible. If vault modifiers are still a potential threat,
write down a log of your root directory hashes from each snapshot, and keep
it safe. When extracting your backups, use the <code>ls -ll</code> command in the
interface to check the "contents" hash of your snapshots, and check
they match the root directory hash you expect.

<h2>Process inspectors</h2>

These folks can attach debuggers or similar tools to running
processes, such as Ugarit itself.

Ugarit backend processes only see encrypted data, so people who can
attach to that process gain the powers of vault snoopers and
modifiers, and the same conditions apply.

People who can attach to the Ugarit process itself, however, will see
the original unencrypted content of your filesystem, and will have
full access to the encryption keys and hashing keys stored in your
Ugarit configuration. When Ugarit is running with sufficient
permissions to restore backups, they will be able to intercept and
modify the data as it comes out, and probably gain total write access
to your entire filesystem in the process.

Recommendations: Ensure that Ugarit does not run under the same user
ID as untrusted software. In many cases it will need to run as root in
order to gain unfettered access to read the filesystems it is backing
up, or to restore the ownership of files. However, when all the files
it backs up are world-readable, it could run as an untrusted user for
backups, and where file ownership is trivially reconstructible, it can
do restores as a limited user, too.

<h2>Attackers in the source filesystem</h2>

These folks create files that Ugarit will back up one day. By having
write access to your filesystem, they already have some level of
power, and standard Unix security practices such as storage quotas
should be used to control them. They may be people with logins on your
box, or more subtly, people who can cause servers to writes files;
somebody who sends an email to your mailserver will probably cause
that message to be written to queue files, as will people who can
upload files via any means.

Such attackers might use up your available storage by creating large
files. This creates a problem in the actual filesystem, but that
problem can be fixed by deleting the files. If those files get
stored into Ugarit, then they are a part of that snapshot. If you
are using a backend that supports deletion, then (when I implement
snapshot deletion in the user interface) you could delete that entire
snapshot to recover the wasted space, but that is a rather serious
operation.

More insidiously, such attackers might attempt to abuse a hash
collision in order to fool the vault. If they have a way of creating
a file that, for instance, has the same hash as your shadow password
file, then Ugarit will think that it already has that file when it
attempts to snapshot it, and store a reference to the existing
file. If that snapshot is restored, then they will receive a copy of
your shadow password file. Similarly, if they can predict a future
hash of your shadow password file, and create a shadow password file
of their own (perhaps one giving them a root account with a known
password) with that hash, they can then wait for the real shadow
password file to have that hash. If the system is later restored from
that snapshot, then their chosen content will appear in the shadow
password file. However, doing this requires a very fundamental break
of the hash function being used.

Recommendations: Think carefully about who has write access to your
filesystems, directly or indirectly via a network service that stores
received data to disk. Enforce quotas where appropriate, and consider
not backing up "queue directories" where untrusted content might
appear; migrate incoming content that passes acceptance tests to an
area that is backed up. If necessary, the queue might be backed up to
a non-snapshotting system, such as rsyncing to another server, so that
any excessive files that appear in there are removed from the backup
in due course, while still affording protection.

<h1>Acknowledgements</h1>

The Ugarit implementation contained herein is the work of Alaric
Snell-Pym and Christian Kellermann, with advice, ideas, encouragement
and guidance from many.






|










|











|



<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
|
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<

<
<
<
<
<
<
<
<
<
<
|
<
<
<
<
<
|
<
<
<
<
<
<
<
<
<
<
|
<
<
<
|
<
<
<
<
<
<
|
<
|
<
<
|
<
<
<
|
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<







1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32


























































































































































































































































































































































































33



























































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































34










35





36










37



38






39

40


41



42




























































43
44
45
46
47
48
49
<center><img src="https://www.kitten-technologies.co.uk/project/ugarit/doc/trunk/artwork/logo.png" /></center>

<h1>Introduction</h1>

Ugarit is a backup/archival system based around content-addressible
storage. [./docs/intro.wiki|Learn more...]

<h1>News</h1>

<p>Development priorities are: Performance, better error handling, and
fixing bugs! After I've cleaned house a little, I'll be focussing on
replicated backend storage (ticket [f1f2ce8cdc]), as I now have a
cluster of storage devices at home.</p>

<ul>

<li>2015-06-12: [./docs/release-2.0.wiki|Version 2.0] is released, containing rudimentary archive
mode, plus many minor improvements! See the release notes at the
bottom for more details.</li>

<li>2014-11-02: Chicken itself has gained
[http://code.call-cc.org/cgi-bin/gitweb.cgi?p=chicken-core.git;a=commit;h=a0ce0b4cb4155754c1a304c0d8b15276b11b8cd2|significantly
faster byte-vector I/O]. This is only on the trunk at the time of
writing; I look forward to it being in a formal release, as it sped up
Ugarit snapshot benchmarks (dumping a 256MiB file into an sqlite
backend) by a factor of twenty-something.</li>

<li>2014-02-21: User [http://rmm.meta.ph/|Rommel Martinez] has written
[http://rmm.meta.ph/blog/2014/02/21/an-introduction-to-ugarit/|An introduction to Ugarit]!</li>

</ul>



























































































































































































































































































































































































<h1>Documentation</h1>






































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































  *  [./docs/intro.wiki|Introduction to Ugarit]





  *  [./docs/installation.wiki|Installation and Configuration]










  *  [./docs/commands.wiki|Command reference]



  *  [./docs/storage-admin.wiki|Storage backend administration]






  *  [./docs/dot-ugarit.wiki|Fine-tuning snapshots with <code>.ugarit</code> files]

  *  [./docs/archive-schema.wiki|Archive metadata schema]


  *  [./docs/faq.wiki|Frequently Asked Questions]



  *  [./docs/security.wiki|Security guide]





























































<h1>Acknowledgements</h1>

The Ugarit implementation contained herein is the work of Alaric
Snell-Pym and Christian Kellermann, with advice, ideas, encouragement
and guidance from many.

1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
2027
2028
Thanks to the early adopters who brought me useful feedback, too!

And I'd like to thank my wife for putting up with me spending several
evenings and weekends and holiday days working on this thing...

<h1>Version history</h1>

  *  2.0: Archival mode [dae5e21ffc], and to support its integration
     into Ugarit, implemented typed tags [08bf026f5a], displaying tag
     types in the VFS [30054df0b6], refactoring the Ugarit internals
     [5fa161239c], made the storage of logs in the vault better
     [68bb75789f], made it possible to view logs from within the VFS
     [4e3673e0fe], supported hidden tags [cf5ef4691c], recording
     configuration information in the vault (and providing instant
     notification if your vault hashing/encryption setup is incorrect,
     thanks to a clever idea by Andy Bennett) [0500d282fc], rearranged
     how local caching is handled [b5911d321a], and added support for
     the history of a snapshot or archive tag to have arbitrary
     branches and merges [a987e28fef], which (as a side-effect)
     improved the performance of running "ls" in long snapshot
     histories [fcf8bc942a]. Also added an sqlite backend
     [8719dfb84f], which makes testing easier but is useful in its own
     right as it's fully-featured and crash-safe, while storing the
     vault in a single file; and improved the appearance of the
     explore mode ls command, as the VFS layout has become more
     complex with the new log/properties views and all the archive
     mode stuff.

  *  1.0.9:  More humane display of sizes in explore's directory
     listings, using low-level I/O to reduce CPU usage. Myriad small
     bug fixes and some internal structural improvements.

  *  1.0.8: Bug fixes to work with the latest chicken master, and
     increased unit test coverage to test stuff that wasn't working
     due to chicken bugs. Looking good!

  *  1.0.7: Fixed bug with directory rules (errors arose when files
     were skipped). I need to improve the test suite coverage of
     high-level components to stop this happening!

  *  1.0.6: Fixed missing features from v1.0.5 due to a fluffed merge
     (whoops), added tracking of directory sizes (files+bytes) in the
     vault on snapshot and the use of this information to display
     overall percentage completion when extracting. Directory sizes
     can be seen in the explore interface when doing "ls -l" or "ls -ll".

  *  1.0.5: Changed the VFS layout slightly, making the existence of
     snapshot objects explicit (when you go into a tag, then go into a
     snapshot, you now need to go into "contents" to see the actual
     file tree; the snapshot object itself now exists as a node in the
     tree). Added traverse-vault-* functions to the core API, and tests
     for same, and used traverse-vault-node to drive the cd and get
     functions in the interactive explore mode (speeding them up in the
     process!). Added "extract" command. Added a progress reporting
     callback facility for snapshots and extractions, and used it to
     provide progress reporting in the front-end, every 60 seconds or
     so by default, not at all with -q, and every time something
     happens with -v. Added tab completion in explore mode.

  *  1.0.4: Resurrected support for compression and encryption and SHA2
  hashes, which had been broken by the failure of the
  <code>autoload</code> egg to continue to work as it used to. Tidying
  up error and ^C handling somewhat.

  *  1.0.3: Installed sqlite busy handlers to retry when the database is
   locked due to concurrent access (affects backend-fs, backend-cache,
   and the file cache), and gained an EXCLUSIVE lock when locking a
   tag in backend-fs; I'm not clear if it's necessary, but it can't
   hurt.

   BUGFIX: Logging of messages from storage backends wasn't
   happening correctly in the Ugarit core, leading to errors when the
   cache backend (which logs an info message at close time) was closed
   and the log message had nowhere to go.

  *  1.0.2: Made the file cache also commit periodically, rather than on
  every write, in order to improve performance. Counting blocks and
  bytes uploaded / reused, and file cache bytes as well as hits;
  reporting same in snapshot UI and logging same to snapshot
  metadata. Switched to the <code>posix-extras</code> egg and ditched our own
  <code>posixextras.scm</code> wrappers. Used the <code>parley</code> egg in the <code>ugarit
  explore</code> CLI for line editing. Added logging infrastructure,
  recording of snapshot logs in the snapshot. Added recovery from
  extraction errors. Listed lock state of tags in explore
  mode. Backend protocol v2 introduced (retaining v1 for
  compatability) allowing for an error on backend startup, and logging
  nonfatal errors, warnings, and info on startup and all protocol
  calls. Added <code>ugarit-archive-admin</code> command line interface to
  backend-specific administrative interfaces. Configuration of the
  splitlog backend (write protection, adjusting block size and logfile
  size limit and commit interval) is now possible via the admin
  interface. The admin interface also permits rebuilding the metadata
  index of a splitlog vault with the <code>reindex!</code> admin command.

  BUGFIX: Made file cache check the file hashes it finds in the
    cache actually exist in the vault, to protect against the case
    where a crash of some kind has caused unflushed changes to be
    lost; the file cache may well have committed changes that the
    backend hasn't, leading to references to nonexistant blocks. Note
    that we assume that vaults are sequentially safe, eg if the
    final indirect block of a large file made it, all the partial
    blocks must have made it too.

  BUGFIX: Added an explicit <code>flush!</code> command to the backend
    protocol, and put explicit flushes at critical points in higher
    layers (<code>backend-cache</code>, the vault abstraction in the Ugarit
    core, and when tagging a snapshot) so that we ensure the blocks we
    point at are flushed before committing references to them in the
    <code>backend-cache</code> or file caches, or into tags, to ensure crash
    safety.

  BUGFIX: Made the splitlog backend never exceed the file size limit
    (except when passed blocks that, plus a header, are larger than
    it), rather than letting a partial block hang over the 'end'.

  BUGFIX: Fixed tag locking, which was broken all over the
    place. Concurrent snapshots to the same tag should now block for
    one another, although why you'd want to *do* that is questionable.

  BUGFIX: Fixed generation of non-keyed hashes, which was
    incorrectly appending the type to the hash without an outer
    hash. This breaks backwards compatability, but nobody was using
    the old algorithm, right? I'll introduce it as an option if
    required.

  *  1.0.1: Consistency check on read blocks by default. Removed warning
  about deletions from backend-cache; we need a new mechanism to
  report warnings from backends to the user. Made backend-cache and
  backend-fs/splitlog commit periodically rather than after every
  insert, which should speed up snapshotting a lot, and reused the
  prepared statements rather than re-preparing them all the
  time.

  BUGFIX: splitlog backend now creates log files with
  "rw-------" rather than "rwx------" permissions; and all sqlite
  databases (splitlog metadata, cache file, and file-cache file) are
  created with "rw-------" rather then "rw-r--r--".

  *  1.0: Migrated from gdbm to sqlite for metadata storage, removing the
  GPL taint. Unit test suite. backend-cache made into a separate
  backend binary. Removed backend-log.

  BUGFIX: file caching uses mtime *and*
  size now, rather than just mtime. Error handling so we skip objects
  that we cannot do something with, and proceed to try the rest of the
  operation.

  *  0.8: decoupling backends from the core and into separate binaries,
  accessed via standard input and output, so they can be run over SSH
  tunnels and other such magic.

  *  0.7: file cache support, sorting of directories so they're archived
  in canonical order, autoloading of hash/encryption/compression
  modules so they're not required dependencies any more.

  *  0.6: .ugarit support.

  *  0.5: Keyed hashing so attackers can't tell what blocks you have,
  markers in logs so the index can be reconstructed, sha2 support, and
  passphrase support.

  *  0.4: AES encryption.

  *  0.3: Added splitlog backend, and fixed a .meta file typo.

  *  0.2: Initial public release.

  *  0.1: Internal development release.







<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
|
<
<
<

<
<
<
|
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
108
109
110
111
112
113
114




















115



116



117




































































































































Thanks to the early adopters who brought me useful feedback, too!

And I'd like to thank my wife for putting up with me spending several
evenings and weekends and holiday days working on this thing...

<h1>Version history</h1>





















  *  2015-06-12: [./docs/release-2.0.wiki|Version 2.0]







  *  [./docs/release-old.wiki|Previous Releases]




































































































































Changes to RELEASE.wiki.








1
2
3
4
5
6
7
8
9

10
11
12
13
14
15
16
17
18







How to do a release:

  *  Merge desired changes onto the trunk
  *  Update ugarit.setup to set the new version
  *  Install and test to make sure you didn't break it!
  *  Update ugarit.release-info to refer to the new release
  *  Commit, and tag the commit with the version number
  *  Run ../kitten-technologies/bin/generate-download-page to
     update DOWNLOAD.wiki

  *  Commit again
  *  Announce on Google Plus etc.

See also:

http://www.kitten-technologies.co.uk/project/kitten-technologies/doc/trunk/README.wiki

In future, expand this with a way of tagging a pre-release beta in
Fossil for fossil followers to try out, before we tag it for henrietta.
>
>
>
>
>
>
>





<



>









1
2
3
4
5
6
7
8
9
10
11
12

13
14
15
16
17
18
19
20
21
22
23
24
25
The tip of trunk is "what's live"; the documentation, ugarit.setup,
and ugarit.release-info from there is what gets served to the public
at the canonical URLs.

Do not merge documentation changes onto the trunk until you're
releasing, or the live docs will be ahead of the available version!

How to do a release:

  *  Merge desired changes onto the trunk
  *  Update ugarit.setup to set the new version
  *  Install and test to make sure you didn't break it!

  *  Commit, and tag the commit with the version number
  *  Run ../kitten-technologies/bin/generate-download-page to
     update DOWNLOAD.wiki
  *  Update ugarit.release-info to refer to the new release
  *  Commit again
  *  Announce on Google Plus etc.

See also:

http://www.kitten-technologies.co.uk/project/kitten-technologies/doc/trunk/README.wiki

In future, expand this with a way of tagging a pre-release beta in
Fossil for fossil followers to try out, before we tag it for henrietta.

Changes to docs/archive-schema.wiki.



1
2
3











4
5


6

7


8
9





10



11


12


13


14


15


16

17

18

19



20







21




22



23


24
25



26



27


28
29
30

31
32


33


34


35


36



37


38


39

40


41

42
43
44



45
46



47



48
49


50


51
52
53

54
55


56

57
58


59


60



61


62
63
64


65
66

67

68



69


70




71
72
73






Any symbol can be used as an archive metadata property name, but here are some
standard ones, defined for the sake of interoperability.












<h2>System-provided import properties</h2>



previous

contents


mtime
log





stats



hostname


manifest-path





<h2>System-provided object properties</h2>





import-path - full path to imported file

filename - filename and extension

dc:format - guessed MIME type





<h2>Object properties provided by the manifest maker</h2>












file-size



mtime


ctime
filename



dc:title - made from file name, or in-file metadata



dc:format - MIME type



<h2>Object properties for music</h2>


dc:title
dc:creator


dc:contributor


dc:publisher


dc:created - date


dc:subject - genre



set:title - title of album


set:index - track number


set:size - track count

superset:index - disc number


superset:size - number of discs


<h2>Object properties for photographs</h2>




dc:creator - photographer
dc:description



dc:subject - keyword, person/thing in photo



dc:spatial - place name, or lat/long/alt
dc:temporal - name of event featured


dc:created - timestamp



<h2>Object properties for PDF/PS/ebooks</h2>


dc:title
dc:creator


dc:subject

dc:description
dc:created


dc:publisher


dc:identifier - ISBN



dc:source - download from URL



<h2>Other useful Dublin Core properties</h2>



See
[http://dublincore.org/documents/2001/04/12/usageguide/generic.shtml]

for inspiration.





dc:alternative - alternative name


dc:extent - size, duration, etc.




dc:language - "en", "jbo", etc
dc:license - licensing statement
dc:accessRights - "public" or "private" (the latter being the default)




>
>

|

>
>
>
>
>
>
>
>
>
>
>
|

>
>
|
>
|
>
>
|
|
>
>
>
>
>
|
>
>
>
|
>
>
|
>
>

>
>
|
>
>

>
>
|
>
|
>
|
>

>
>
>
|
>
>
>
>
>
>
>

>
>
>
>
|
>
>
>
|
>
>
|
|
>
>
>
|
>
>
>
|
>
>



>
|
|
>
>
|
>
>
|
>
>
|
>
>
|
>
>
>
|
>
>
|
>
>
|
>
|
>
>
|
>



>
>
>
|
|
>
>
>
|
>
>
>
|
|
>
>
|
>
>



>
|
|
>
>
|
>
|
|
>
>
|
>
>
|
>
>
>
|
>
>



>
>
|
<
>
|
>

>
>
>
|
>
>
|
>
>
>
>
|
|
|
>
>
>
>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179

180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
<h1>Ugarit Archive Metadata Schema</h1>

Any symbol can be used as an archive metadata property name, but here are some
suggested ones, defined for the sake of interoperability.

Where possible, we have used the
[http://dublincore.org/documents/2001/04/12/usageguide/generic.shtml|Dublin
Core] vocabulary, as it's a good fit for the kinds of things archive
mode is designed for. Properties imported from Dublin Core are
identified with a <code>dc:</code> prefix.

Some of these properties are automatically applied by the import
process. However, if these properties are specified in the import
manifest file, then the specified value from the manifest overrides
the default.

<h2>Import properties</h2>

These are properties applied to an import object, rather than to an
individual object in an archive.

<h3>Internal</h3>

These properties are all provided by the system itself, and must not
be specified in an import manifest.

<dl>
<dt><code>previous</code> (hash)</dt>
<dd>The hash of a previous import. If there is
no instance of this property, then this is the first import in a
sequence. If there are more than one instances, then this is a
merge.</dd>

<dt><code>contents</code> (hash)</dt>
<dd>The hash of the imported archive manifest. This is probably not of
much interest beyond the Ugarit internals.</dd>

<dt><code>mtime</code> (number)</dt>
<dd>The UNIX timestamp of the import.</dd>

<dt><code>log</code> (hash)</dt>
<dd>The hash of the import log file.</dd>

<dt><code>stats</code> (alist)</dt>
<dd>An alist of import statistics.</dd>

<dt><code>manifest-path</code></dt>
<dd>The path to the manifest filename that was used for the import.</dd>

<dt><code>hostname</code></dt>
<dd>The hostname on which the import was performed.</dd>

</dl>

<h2>Core object properties</h2>

These object properties apply usefully to almost anything in an archive.

<dt><code>import-path</code></dt>
<dd>The path the file was imported from, as taken from the import
manifest file. (DEFAULT: The path from the manifest file)</dd>

<dt><code>filename</code></dt>
<dd>The name of the file, including the extension (if applicable), but
not any directory path. This is usually the name the file had when it
was imported (eg, the latter part of <code>import-path</code>), but if
it was imported from some temporary file name while the system knows
of a "proper" filename other than that, they may differ. (DEFAULT: The
import path, minus any directory path)</dd>

<dt><code>dc:format</code></dt>
<dd>The MIME type of the file. (DEFAULT: A MIME type guessed from the
file extension)<dd>

<dt><code>file-size</code></dt>
<dd>The size of the file. If it's a directory, then this is the sum of
the sizes of the files within it, not including any directory
metadata.</dd>

<dt><code>mtime</code> (number)</dt>
<dd>The mtime of the file when it was imported, as a UNIX
timestamp.</dd>

<dt><code>ctime</code></dt>
<dd>The ctime of the file when it was imported, as a UNIX
timestamp.</dd>

<dt><code>dc:title</code></dt>
<dd>The title of the object. This should be a proper human-readable
title, not just a filename, where possible.</dd>

<dt><code>dc:description</code></dt>
<dd>A longer description of the object.</dd>

<h2>Object properties for music</h2>

Music files should put the song title in <code>dc:title</code>.

<dt><code>dc:creator</code></dt>
<dd>The creator of the piece, generally the artist name.</dd>

<dt><code>dc:contributor</code></dt>
<dd>Some other contributor to the piece, other than the artist.</dd>

<dt><code>dc:publisher</code></dt>
<dd>The name of the publisher.</dd>

<dt><code>dc:created</code></dt>
<dd>The creation date, in <code>YYYY-MM-DD</code> form.</dd>

<dt><code>dc:subject</code></dt>
<dd>The name of the genre.</dd>

<dt><code>set:title</code></dt>
<dd>The title of the album.</dd>

<dt><code>set:index</code></dt>
<dd>Track number within the album.</dd>

<dt><code>set:size</code></dt>
<dd>Track count within the album.</dd>

<dt><code>superset:index</code></dt>
<dd>For multi-disk albums, the disk number.</dd>

<dt><code>superset:size</code></dt>
<dd>For multi-disk albums, the number of disks.</dd>

<h2>Object properties for photographs</h2>

Use <code>dc:description</code> for a description of the photo.

<dt><code>dc:creator</code></dt>
<dd>The name of the photographer.</dd>

<dt><code>dc:subject</code></dt>
<dd>Something in the photograph (names of photographed people or
things, or more general keywords)</dd>

<dt><code>dc:spatial</code></dt>
<dd>The name of the place the photo was taken, or coordinates as a
[https://en.wikipedia.org/wiki/Geo_URI|geo: URL].</dd>

<dt><code>dc:temporal</code></dt>
<dd>The name of the event the photograph was from.</dd>

<dt><code>dc:created</code></dt>
<dd>The creation timestamp of the photo, in YYYY-MM-DD format,
optionally with a 24-hour UTC HH:MM:SS time.</dd>

<h2>Object properties for PDF/PS/ebooks</h2>

Use <code>dc:title</code> for the title of the work.

<dt><code>dc:creator</code></dt>
<dd>The name of the author.</dd>

<dt><code>dc:subject</code></dt>
<dd>A subject or keyword.</dd>

<dt><code>dc:created</code></dt>
<dd>The creation date in YYYY-MM-DD format.</dd>

<dt><code>dc:publisher</code></dt>
<dd>The name of the publisher.</dd>

<dt><code>dc:identifier</code></dt>
<dd>An ISBN, ISSN, or similar identifier, in
[https://en.wikipedia.org/wiki/Uniform_resource_name|URN format] (eg:
<code>urn:isbn:0451450523</code>).</dd>

<dt><code>dc:source</code></dt>
<dd>The original URL the thing was downloaded from.</dd>

<h2>Other useful Dublin Core properties</h2>

<dt><code>dc:alternative</code></dt>
<dd>An alternative title.</dd>


<dt><code>dc:extent</code></dt>
<dd>Size, duration, etc. Not the size of the file in bytes, but the
duration of a recording, the size of an image in pixels, etc.</dd>

<dt><code>dc:language</code></dt>
<dd>The language of the object. <code>en</code>, <code>en-GB</code>,
<code>jbo</code>, etc.</dd>

<dt><code>dc:license</code></dt>
<dd>A description of the license the file is under.</dd>

<dt><code>dc:accessRights</code></dt>
<dd>A space-separted list of names of groups that should be allowed to
access the object, under some means of publishing all or part of an
archive. <code>public</code> should refer to unrestricted access.</dd>

<h2>Please contribute!</h2>

The above are the conventions I have started to settle towards with
the kinds of things I am using Ugarit archives for. If you use it for
something else, please drop me a line and I'll be glad to help you
choose a good schema, and publish the results here for others to share!

Added docs/commands.wiki.













































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
<h1>Ugarit command-line reference</h1>

<h2>Your first backup</h2>

Think of a tag to identify the filesystem you're backing up. If it's
<code>/home</code> on the server <samp>gandalf</samp>, you might call it <samp>gandalf-home</samp>. If
it's the entire filesystem of the server <samp>bilbo</samp>, you might just call
it <samp>bilbo</samp>.

Then from your shell, run (as root):

<pre># ugarit snapshot <ugarit.conf> <nowiki>[-c] [-a]</nowiki> <tag> <path to root of filesystem></pre>

For example, if we have a <code>ugarit.conf</code> in the current directory:

<pre># ugarit snapshot ugarit.conf -c localhost-etc /etc</pre>

Specify the <code>-c</code> flag if you want to store ctimes in the vault;
since it's impossible to restore ctimes when extracting from an
vault, doing this is useful only for informational purposes, so it's
not done by default. Similarly, atimes aren't stored in the vault
unless you specify <code>-a</code>, because otherwise, there will be a lot of
directory blocks uploaded on every snapshot, as the atime of every
file will have been changed by the previous snapshot - so with <code>-a</code>
specified, on every snapshot, every directory in your filesystem will
be uploaded! Ugarit will happily restore atimes if they are found in
a vault; their storage is made optional simply because uploading
them is costly and rarely useful.

<h2>Exploring the vault</h2>

Now you have a backup, you can explore the contents of the
vault. This need not be done as root, as long as you can read
<code>ugarit.conf</code>; however, if you want to extract files, run it as root
so the uids and gids can be set.

<pre>$ ugarit explore ugarit.conf</pre>

This will put you into an interactive shell exploring a virtual
filesystem. The root directory contains an entry for every tag; if you
type <code>ls</code> you should see your tag listed, and within that
tag, you'll find a list of snapshots, in descending date order, with a
special entry <code>current</code> for the most recent
snapshot. Within a snapshot, you'll find the root directory of your
snapshot under <code>contents</code>, and the detailts of the snapshot itself in
<code>propreties.sexpr</code>, and will be able to <code>cd</code> into
subdirectories, and so on:

<pre>> <b>ls</b>
localhost-etc/ <tag>
> <b>cd localhost-etc</b>
/localhost-etc> <b>ls</b>
current/ <snapshot>
2015-06-12 22:49:34/ <snapshot>
2015-06-12 22:49:25/ <snapshot>
/localhost-etc> cd current
/localhost-etc/current> ls
log.sexpr <file>
properties.sexpr <inline>
contents/ <dir>
/localhost-etc/current> <b>cat properties.sexpr</b>
((previous . "a140e6dbe0a7a38f8b8c381323997c23e51a39e2593afb61")
 (mtime . 1434102574.0)
 (contents . "34eccf1f5141187e4209cfa354fdea749a0c3c1c4682ec86")
 (stats (blocks-stored . 12)
  (bytes-stored . 16889)
  (blocks-skipped . 50)
  (bytes-skipped . 6567341)
  (file-cache-hits . 0)
  (file-cache-bytes . 0))
 (log . "b2a920f962c12848352f33cf32941e5313bcc5f209219c1a")
 (hostname . "ahe")
 (source-path . "/etc")
 (notes)
 (files . 112)
 (size . 6563588))
/localhost-etc/current> <b>cd contents</b>
/localhost-etc/current/contents> <b>ls</b>
zoneinfo <symlink>
vconsole.conf <symlink>
udev/ <dir>
tmpfiles.d/ <dir>
systemd/ <dir>
sysctl.d/ <dir>
sudoers.tmp~ <file>
sudoers <file>
subuid <file>
subgid <file>
static <symlink>
ssl/ <dir>
ssh/ <dir>
shells <symlink>
shadow- <file>
shadow <file>
services <symlink>
samba/ <dir>
rpc <symlink>
resolvconf.conf <symlink>
resolv.conf <file>
-- Press q then enter to stop or enter for more...
<b>q</b>
/localhost-etc/current/contents> <b>ls -ll resolv.conf</b>
-rw-r--r--     0     0 <nowiki>[2015-05-23 23:22:41]</nowiki> 78B/-: resolv.conf
key: #f
contents: "e33ea1394cd2a67fe6caab9af99f66a4a1cc50e8929d3550"
size: 78
ctime: 1432419761.0</pre>

As well as exploring around, you can also extract files or directories
(or entire snapshots) by using the <code>get</code> command. Ugarit
will do its best to restore the metadata of files, subject to the
rights of the user you run it as.

Type <code>help</code> to get help in the interactive shell.

The interactive shell supports command-line editing, history and tab
completion for your convenience.

<h2>Extracting things directly</h2>

As well as using the interactive explore mode, it is also possible to
directly extract something from the vault, given a path.

Given the sample vault from the previous example, it would be possible
to extract the <code>README.txt</code> file with the following
command:

<pre>$ ugarit extract ugarit.conf /Test/current/contents/README.txt</pre>

<h2>Forking tags</h2>

As mentioned above, you can fork a tag, creating two tags that
refer to the same snapshot and its history but that can then have
their own subsequent history of snapshots applied to each
independently, with the following command:

<pre>$ ugarit fork <ugarit.conf> <existing tag> <new tag></pre>

<h2>Merging tags</h2>

And you can also merge two or more tags into one. It's possible to
merge a bunch of tags to make an entirely new tag, or you can merge a
tag into an existing tag, by having the "output" tag also be one of
the "input" tags.

The command to do this is:

<pre>$ ugarit merge <ugarit.conf> <output tag> <input tags...></pre>

For instance, to import your classical music collection into your main
musical collection, you might do:

<pre>$ ugarit merge ugarit.conf my-music my-music classical-music</pre>

Or if you want to create a new all-music archive from the archives
bobs-music and petes-music, you might do:

<pre>$ ugarit merge ugarit.conf all-music bobs-music petes-music</pre>

<h2>Archive operations</h2>

<h3>Importing</h3>

To import some files into an archive, you must create a manifest file
listing them, and their metadata. The manifest can also list
metadata for the import as a whole, perhaps naming the source of the
files, or the reason for importing them.

The metadata for a file (or an import) is a series of named
properties. The value of a property can be any Scheme value, written
in Scheme syntax (with strings double-quoted unless they are to be
interpreted as symbols), but strings and numbers are the most useful
types.

You can use whatever names you like for properties in metadata, but
there are some that the system applies automatically, and an informal
standard of sorts, which is documented in [docs/archive-schema.wiki].

You can produce a manifest file by hand, or use the Ugarit Manifest
Maker to produce one for you. You do this by installing it like so:

<pre>$ chicken-install ugarit-manifest-maker</pre>

And then running it, giving it any number of file and directory names
on the command line. When given directories, it will recursively scan
them to find all the files contained therein and put them in the
manifest; it will not put directories in the manifest, although it is
perfectly legal for you to do so when writing a manifest by hand. This
is because the manifest maker can't do much useful analysis on a
directory to suggest default metadata for them (so there isn't much
point in using it), and it's far more useful for it to make it easy
for you to import a large number of files individually by referencing
the directory containing them.

The manifest is sent to standard output, so you need to redirect it to
a file, like so:

<pre>$ ugarit-manifest-maker ~/music > music.manifest</pre>

You can specify command-line options, as well. <code>-e PATTERN</code>
or <code>--exclude=PATTERN</code> introduces a glob pattern for files
to exclude from the manifest, and <code>-D KEY=VALUE</code> or
<code>--define=KEY=VALUE</code> provides a property to be added to
every file in the manifest (as opposed to an import property, that is
part of the metadata of the overall import). Note that
<code>VALUE</code> must be double-quoted if it's a string, as per
Scheme value syntax.

One might use this like so:

<pre>$ ugarit-manifest-maker -e *.txt -D rating=5 ~/favourite-music > music.manifest</pre>

The manifest maker simplifies the writing of manifests for files, by
listing the files in manifest format along with useful metadata
extracted from the filename and the file itself. For supported file
types (currently, MP3 and OGG music files), it will even look inside
the file to extract metadata.

The manifest file it generates will contain lots of comments
mentioning things it couldn't automatically analyse (such as unknown
OGG/ID3 tags, or unknown types of files); and for metadata properties
it thinks might be relevant but can't automatically provide, it
suggests them with an empty property declaration, commented out. The
idea is that, after generating a manifest, you read it by hand in a
text editor to attempt to improve it.

<h4>The format of a manifest file</h4>

Manifest files have a relatively simple format. The are based on
Scheme s-expressions, so can contain comments. From any semicolon (not
in a string or otherwise quoted) to the end of the line is a comment,
and <code>#;</code> in front of something comments out that something.

Import metadata properties are specified like so:

<pre>(KEY = VALUE)</pre>

...where, as usual, <code>VALUE</code> must be double-quoted if it's a
string.

Files to import, with their metadata, are specified like so:

<pre>(object "PATH OF FILE TO IMPORT"
  (KEY = VALUE)
  (KEY = VALUE)...
)</pre>

The closing parenthesis need not be on a line of its own, it's
conventionally placed after the closing parenthesis of the final
property.

Ugarit, when importing the files in the manifest, will add the
following properties if they are not already specified:

<dl>
<dt><code>import-path</code></dt>
<dd>The path the file was imported from</dd>

<dt><code>dc:format</code></dt>
<dd>A guess at the file's MIME type, based on the extension</dd>

<dt><code>mtime</code></dt>
<dd>The file's modification time (as the number of seconds since the
UNIX epoch)</dd>

<dt><code>ctime</code></dt>
<dd>The file's change time (as the number of seconds since the UNIX
epoch)</dd>

<dt><code>filename</code></dt>
<dd>The name of the file, stripped of any directory components, and
including the extension.</dd>

</dl>

The following properties are placed in the import metadata,
automatically:

<dl>
<dt><code>hostname</code></dt>
<dd>The hostname the import was performed on.</dd>

<dt><code>manifest-path</code></dt>
<dd>The path to the manifest file used for the import.</dd>

<dt><code>mtime</code></dt>
<dd>The time (in seconds since the UNIX epoch) at which the import was
committed.</dd>

<dt><code>stats</code></dt>
<dd>A Scheme alist of statistics about the import (number of
files/blocks uploaded, etc).</dd>
</dl>

So, to wrap that all up, here's a sample import manifest file:

<verbatim>
(notes = "A bunch of old CDs I've finally ripped")

(object "/home/alaric/newrip/track01.mp3"
  (filename = "track01.mp3")
  (dc:format = "audio/mpeg")

  (dc:publisher = "Go! Beat Records")
  (dc:created = "1994")
  (dc:contributor = "Portishead")
  (dc:subject = "Trip-Hop")
  (superset:size = 1)
  (superset:index = 1)
  (set:title = "Dummy")
  (set:size = 11)
  (set:index = 1)
  (dc:creator = "Portishead")
  (dc:title = "Wandering Star")

  (mtime = 1428962299.0)
  (ctime = 1428962299.0)
  (file-size = 4703055))

;;... and so on, for ten more MP3s on this CD, then several other CDs...
</verbatim>

<h4>Actually importing a manifest</h4>

Well, when you finally have a manifest file, importing it is easy:

<pre>$ ugarit import <ugarit.conf> <archive tag> <manifest path></pre>

<h4>How do I change the metadata of an already-imported file?</h4>

That's easy; the "current" metadata of a file is the metadata of its
most recent. Just import the file again, in a new manifest, with new
metadata, and it will overwrite the old. However, the old metadata is
still preserved in the archive's history; tags forked from the archive
tag before the second import will still see the original state of the
archive, by design.

<h3>Exploring</h3>

Archives are visible in the explore interface. For instance, an import
of some music I did looks like this:

<pre>> <b>ls</b>
localhost-etc/ &lt;tag>
archive-tag/ &lt;tag>
> <b>cd archive-tag</b>
/archive-tag> <b>ls</b>
history/ &lt;archive-history>
/archive-tag> <b>cd history</b>
/archive-tag/history> <b>ls</b>
2015-06-12 22:53:13/ &lt;import>
/archive-tag/history> <b>cd 2015-06-12 22:53:13</b>
/archive-tag/history/2015-06-12 22:53:13> <b>ls</b>
log.sexpr &lt;file>
properties.sexpr &lt;inline>
manifest/ &lt;import-manifest>
/archive-tag/history/2015-06-12 22:53:13> <b>cat properties.sexpr</b>
((stats (blocks-stored . 2046)
        (bytes-stored . 1815317503)
        (blocks-skipped . 9)
        (bytes-skipped . 8388608)
        (file-cache-hits . 0)
        (file-cache-bytes . 0))
 (log . "b2a920f962c12848352f33cf32941e5313bcc5f209219c1a")
 (mtime . 1434135993.0)
 (contents . "fcdd5b996914fdcac1e8a6cfbc67663e08f6eaf0cc952e21")
 (hostname . "ahe")
 (notes . "A bunch of music, imported as a demo")
 (manifest-path . "/home/alaric/tmp/test.manifest"))
/archive-tag/history/2015-06-12 22:53:13> <b>cd manifest</b>
/archive-tag/history/2015-06-12 22:53:13/manifest> <b>ls</b>
1d4269099189234eefeb80b95370eaf280730cf4d591004d:03 The Lemon Song.mp3 &lt;file>
7cb253a4886b3e0051ea8cc0e78fb3a0160307a2c37c8382:04 Dazed and Confused.mp3 &lt;file>
64092fa12c2800dda474b41e5ebe8c948f39a59ee91c120b:09 How Many More Times.mp3 &lt;file>
1d79148d1e1e8947c50b44cf2d5690588787af328e82eeef:2-07 Going to California.mp3 &lt;file>
e3685148d0d12213074a9fdb94a00e05282aeabe77fa60d5:1-01 You Shook Me.mp3 &lt;file>
d73904f371af8d7ca2af1076881230f2dc1c2cf82416880a:03 Strangers.mp3 &lt;file>
9c5a0efb7d397180a1e8d42356d8f04c6c26a83d3b05d34a:09 Uptight.mp3 &lt;file>
01a069aec2e731e18fcdd4ecb0e424f346a2f0e16910f5e9:07 Numb.mp3 &lt;file>
7ea1ab7fbd525c40e21d6dd25130e8c70289ad56c09375b0:08 She.mp3 &lt;file>
009dacd8f3185b7caeb47050002e584ab86d08cf9e9aceec:1-03 Communication Breakdown.mp3 &lt;file>
26d264d629e22709f664ed891741f690900d45cd4fd44326:1-03 Dazed and Confused.mp3 &lt;file>
d879761195faf08e4e95a5a2398ea6eefb79920710bfeab6:1-10 Band Introduction _ How Many More Times.mp3 &lt;file>
83244601db42677d110fc8522c6a3cbbc1f22966a779f876:06 All My Love.mp3 &lt;file>
5eebee9a2ad79d04e4f69e9e2a92c4e0a8d5f21e670f89da:07 Tangerine.mp3 &lt;file>
dd6f1203b5973ecd00d2c0cee18087030490230727591746:2-08 That's the Way.mp3 &lt;file>
c0acea15aa27a6dd1bcaff1c13d4f3d741a40a46abeca3fc:04 The Crunge.mp3 &lt;file>
ea7727ad07c6c82e5c9c7218ee1b059cd78264c131c1438d:1-02 I Can't Quit You Baby.mp3 &lt;file>
10fda5f46b8f505ca965bcaf12252eedf5ab44514236f892:14 F.O.D..mp3 &lt;file>
a99ca9af5a83bde1c676c388dc273051defa88756df26e95:1-03 Good Times Bad Times.mp3 &lt;file>
b5d7cfe9808c7fc0dedbd656d44e4c56159cbd3c2ed963bb:1-15 Stairway to Heaven.mp3 &lt;file>
79c87e3c49ffdac175c95aae071f63d3a9efdf2ddb84998c:08.Batmilk.ogg &lt;file>
-- Press q then enter to stop or enter for more...
q
/archive-tag/history/2015-06-12 22:53:13/manifest> <b>ls -ll 7cb253a4886b3e0051ea8cc0e78fb3a0160307a2c37c8382:04 Dazed and Confused.mp3</b>
-r--------     -     - <nowiki>[2015-04-13 21:46:39]</nowiki> -/-: 7cb253a4886b3e0051ea8cc0e78fb3a0160307a2c37c8382:04 Dazed and Confused.mp3
key: #f
contents: "7cb253a4886b3e0051ea8cc0e78fb3a0160307a2c37c8382"
import-path: "/home/alaric/archive/sorted-music/Led Zeppelin/Led Zeppelin/04 Dazed and Confused.mp3"
filename: "04 Dazed and Confused.mp3"
dc:format: "audio/mpeg"
dc:publisher: "Atlantic"
dc:subject: "Classic Rock"
dc:title: "Dazed and Confused"
dc:creator: "Led Zeppelin"
dc:created: "1982"
dc:contributor: "Led Zeppelin"
set:title: "Led Zeppelin"
set:index: 4
set:size: 9
superset:index: 1
superset:size: 1
ctime: 1428957999.0
file-size: 15448903
</pre>

<h3>Searching</h3>

However, the explore interface to an archive is far from pleasant. You
need to go to the correct import, and find your file by name, and then
identify it with a big long name composed of its hash and the original
filename to find its properties and extract.

I hope to add property-based searching to explore mode in future
(which is why you need to go into a <code>history</code> directory
within the archive directory, as other ways of exploring the archive
will appear alongside). This will be particularly useful when the
explore-mode virtual filesystem is mounted over 9P!

However, even that interface, being constrained to look like a
filesystem, will be limited. The <code>ugarit</code> command-line tool
provides a very powerful search interface that exposes the full power
of the archive metadata.

<h4>Metadata filters</h4>

Files (and directories) in an archive can be searched for using
"metadata filters", which are descriptions of what you're looking for
that the computer can understand. They are represented as Scheme
s-expressions, and can be made up of the following components:

<dl>
<dt><code>#t</code></dt>
<dd>This filter matches everything. It's not very useful.</dd>

<dt><code>#f</code></dt>
<dd>This filter matches nothing. It's not very useful.</dd>

<dt><code>(and FILTER FILTER...)</code></dt>
<dd>This filter matches files for which all of the inner filters match.</dd>

<dt><code>(or FILTER FILTER...)</code></dt>
<dd>This filter matches files for which any of the inner filters match.</dd>

<dt><code>(not FILTER)</code></dt>
<dd>This filter matches files which do not match the inner filter.</dd>

<dt><code>(= ($ PROP) VALUE)</code></dt>
<dd>This filter matches files which have the given
<code>PROP</code>erty equal to that <code>VALUE</code> in their metadata.</dd>

<dt><code>(= key HASH)</code></dt>
<dd>This filter matches the file with the given hash.</dd>

<dt><code>(= ($import PROP) VALUE)</code></dt>
<dd>This filter matches files which have the given
<code>PROP</code>erty equal to that <code>VALUE</code> in the metadata
of the import that last imported them.</dd>
</dl>

<h4>Searching an archive</h4>

For a start, you can search for files matching a given metadata filter
in a given archive. This is done with:

<pre>$ ugarit search <ugarit.conf> <archive tag> <filter></pre>

For instance, let's look for music by Led Zeppelin:

<pre>$ ugarit search ugarit.conf music '(or
   (= ($ dc:creator) "Led Zeppelin")
   (= ($ dc:contributor) "Led Zeppelin"))'</pre>

The result looks like the explore-mode view of an archive manifest,
listing the file's hash followed by its title and extension:

<verbatim>
7cb253a4886b3e0051ea8cc0e78fb3a0160307a2c37c8382:04 Dazed and Confused.mp3
834a1619a59835e0c27b22801e3c829b40be583dadd19770:2-08 No Quarter.mp3
9e8bc4954838bd9c671f275eb48595089257185750d63894:1-12 I Can't Quit You Baby.mp3
6742b3bebcdd9cae5ec5403c585935403fa74d16ed076cf2:02 Friends (1).mp3
07d161f4bd684e283f7f2cf26e0b732157a8e95ef66939c3:05 Carouselambra.mp3
[...]
</verbatim>

What of all our lovely metadata? You can view that if you add the word
"verbose" to the end of the command line, which allows you to specify
alternate output formats:

<pre>$ ugarit search ugarit.conf music '(or
   (= ($ dc:creator) "Led Zeppelin")
   (= ($ dc:contributor) "Led Zeppelin"))' verbose</pre>

Now the output looks like:

<verbatim>
object a444ff6ef807b080b536155f58d246d633cab4a0eabef5bf
        (ctime = 1428958660.0)
        (dc:contributor = "Led Zeppelin")
        (dc:created = "2008")
        (dc:creator = "Led Zeppelin")
[... all the usual file properties omitted ...]
        import a43f7a7268ee8b18381c20d7573add5dbf8781f81377279c
                (stats = ((blocks-stored . 2046) (bytes-stored . 1815317503) (blocks-skipped . 9) (bytes-skipped . 8388608) (file-cache-hits . 0) (file-cache-bytes . 0)))
                (log = "b2a920f962c12848352f33cf32941e5313bcc5f209219c1a")
[... all the usual import properties omitted ...]
object b4cadf48b2c07ccf0303fc4064b292cb222980b0d4223641
        (ctime = 1428958673.0)
        (dc:contributor = "Led Zeppelin")
        (dc:created = "2008")
        (dc:creator = "Led Zeppelin")
        (dc:creator = "Jimmy Page/John Paul Jones/Robert Plant")
[...and so on...]
</verbatim>

As you can see, it lists the hash of each file, its metadata, the hash
of the import that last imported it, and the metadata of that import.

That's quite verbose, so you'd probably be wanting to take that as
input to another program to do something nicer with it. But it's laid
out for human reading, not for machine parsing. Thankfully, we have
other formats for that, <code>alist</code> and
<code>alist-with-imports</code>.

Try this:

<pre>$ ugarit search ugarit.conf music '(or
   (= ($ dc:creator) "Led Zeppelin")
   (= ($ dc:contributor) "Led Zeppelin"))' alist</pre>

This outputs one Scheme s-expression list per match, the first element
of which is the hash as a string, the rest of which is an alist of properties:

<verbatim>
("7cb253a4886b3e0051ea8cc0e78fb3a0160307a2c37c8382"
 (ctime . 1428957999.0)
 (dc:contributor . "Led Zeppelin")
 (dc:created . "1982")
 (dc:creator . "Led Zeppelin")
[... elided file properties ...]
 (superset:index . 1)
 (superset:size . 1))
("77c960d09eb21ed72e434ddcde0bd3781a4f3d6ee7a6eb66"
 (ctime . 1428958981.0)
 (dc:contributor . "Led Zeppelin")
[...]
</verbatim>

<pre>$ ugarit search ugarit.conf music '(or
   (= ($ dc:creator) "Led Zeppelin")
   (= ($ dc:contributor) "Led Zeppelin"))' alist-with-imports</pre>

This outputs one s-expression per list per match, with four
elements. The first is the key string, the second is an alist of file
properties, the third is the import's hash, and the last is an alist
containing the import's properties. It looks like:

<verbatim>
("64fa08a0080aee6ef501c408fd44dfcc634cfcafd8006fc4"
 ((ctime . 1428958683.0)
  (dc:contributor . "Led Zeppelin")
  (dc:created . "2008")
  (dc:creator . "Led Zeppelin")
[... elided file properties ...]
  (superset:index . 1)
  (superset:size . 1))
 "a43f7a7268ee8b18381c20d7573add5dbf8781f81377279c"
 ((stats (blocks-stored . 2046)
         (bytes-stored . 1815317503)
[... elided manifest properties ...]
  (manifest-path . "test.manifest")))
("4cd56f916a63399b252976e842dcae0b87f058b5a60c93a4"
 ((ctime . 1428958437.0)
  (dc:contributor . "Led Zeppelin")
[...]
</verbatim>

And finally, you might just want to get the hashes of matching files
(which are particularly useful for extraction operations, which we'll
come to next). To do this, specify a format of "keys", which outputs
one line per match, containing just the hash:

<pre>$ ugarit search ugarit.conf music '(or
   (= ($ dc:creator) "Led Zeppelin")
   (= ($ dc:contributor) "Led Zeppelin"))' keys</pre>

<verbatim>
ce6f6484337de772de9313038cb25d1b16e28028136cc291
6af5c664cbfa1acb22a377e97aee35d94c0fc003d239dd0c
92e91e79b384478b5aab31bf1b2ff9e25e7e2c4b48575185
6ddb9a41d4968468a904f05ecf7e0e73d2c7c7ad76bc394b
a074dddcef67cd93d92c6ffce845894aa56594674023f6e1
4f65f735bbb00a6fda4bc887b370b3160f55e5e07ec37ffa
97cc8b8ba70c39387fc08ef62311b751aea4340d636eb421
72358dbe3eb60da42eadcf6de325b2a6686f4e17ea41fa60
[...]
</verbatim>

However, to write filter expressions, you need to know what properties
you have available to search on. You might remember, or go for
standard properties, or look at existing files in verbose mode to find
some; but you can also just ask Ugarit what properties it has in an
archive, like so:

<pre>$ ugarit search-props <ugarit.conf> <archive tag></pre>

You can even ask what properties are available for files matching an
existing filter:

<pre>$ ugarit search-props <ugarit.conf> <archive tag> <filter></pre>

This is useful if you're interested in further narrowing down a
filter, and so only care about properties that files already matching
that filter have.

For a bunch of music files imported with the Ugarit Manifest Maker,
you can expect to see something like this:

<verbatim>
ctime
dc:contributor
dc:created
dc:creator
dc:format
dc:publisher
dc:subject
dc:title
file-size
filename
import-path
mtime
set:index
set:size
set:title
superset:index
superset:size
</verbatim>

Now you know what properties to search, next you'll be wanting to know
what values to look for. Again, Ugarit has a command to query the
available values of any given property:

<pre>$ ugarit search-values <ugarit.conf> <archive tag> <property></pre>

And you can limit that just to files matching a given filter:

<pre>$ ugarit search-values <ugarit.conf> <archive tag> <filter> <property></pre>

The resulting list of values is ordered by popularity, so the most
widely-used values will be listed first. Let's see what genres of
music were in my sample of music files I imported:

<pre>$ ugarit search-values test.conf archive-tag dc:subject</pre>

The result is:

<verbatim>
Classic Rock
Alternative & Punk
Electronic
Trip-Hop
</verbatim>

Ok, let's now use a filter to find out what artists
(<code>dc:creator</code>) I have that made Trip-Hop music (what even
IS that?):

<pre>$ ugarit search-values test.conf archive-tag \
    '(= ($ dc:subject) "Trip-Hop")' \
    dc:creator</pre>

The result is:

<verbatim>Portishead</verbatim>

Ah, OK, now I know what "Trip-Hop" is.

<h3>Extracting</h3>

All this searching is lovely, but what it gets us, in the end, is a
bunch of file hashes. Perhaps we might want to actually play some
music, or look at a photo, or something. To do that, we need to
extract from the archive.

We've already seen the contents of an archive in the explore mode
virtual filesystem, so we could go into the archive history, find the
import, go into the manifest, pick the file out there, and use
<code>get</code> to extract it, but that would be yucky. Thankfully,
we have a command-line interface to get things from archives, in one
of two ways.

Firstly, we can extract a file (or a directory tree) from an archive,
out into the local filesystem:

<pre>$ ugarit archive-extract <ugarit.conf> <archive tag> <hash> <target></pre>

The "target" is the name to give it in the local filesystem. We could
pull out that Led Zeppelin song from our search results above, like so:

<pre>$ ugarit archive-extract test.conf archive-tag \
    ce6f6484337de772de9313038cb25d1b16e28028136cc291 foo.mp3</pre>

We now have a foo.mp3 file in the current directory.

However, sometimes it would be nicer to have it streamed to standard
output, which can be done like so:

<pre>$ ugarit archive-stream <ugarit.conf> <archive tag> <hash></pre>

This lets us write a command such as:

<pre>$ ugarit archive-stream test.conf archive-tag \
    ce6f6484337de772de9313038cb25d1b16e28028136cc291 | mpg123 -</pre>

...to play it in real time.

Added docs/dot-ugarit.wiki.













































































































































>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
<h1><code>.ugarit</code> files</h1>

By default, Ugarit will vault everything it finds in the filesystem
tree you tell it to snapshot. However, this might not always be
desired; so we provide the facility to override this with <code>.ugarit</code>
files, or global rules in your <code>.conf</code> file.

Note: All of this only applies to snapshots. Archive mode imports are
not affected by <code>.ugarit</code> files, or global rules.

Note: The syntax of these files is provisional, as I want to
experiment with usability, as the current syntax is ugly. So please
don't be surprised if the format changes in incompatible ways in
subsequent versions!

In quick summary, if you want to ignore all files or directories
matching a glob in the current directory and below, put the following
in a <code>.ugarit</code> file in that directory:

<pre>(* (glob "*~") exclude)</pre>

You can write quite complex expressions as well as just globs. The
full set of rules is:

   *  <code>(glob "<em>pattern</em>")</code> matches files and directories whose names
  match the glob pattern

   *  <code>(name "<em>name</em>")</code> matches files and directories with exactly that
  name (useful for files called <code>*</code>...)

   *  <code>(modified-within <em>number</em> seconds)</code> matches files and
  directories modified within the given number of seconds

  *  <code>(modified-within <em>number</em> minutes)</code> matches files and
  directories modified within the given number of minutes

  *  <code>(modified-within <em>number</em> hours)</code> matches files and directories
  modified within the given number of hours

  *  <code>(modified-within <em>number</em> days)</code> matches files and directories
  modified within the given number of days

  *  <code>(not <em>rule</em>)</code> matches files and directories that do not match
  the given rule

  *  <code>(and <em>rule</em> <em>rule...</em>)</code> matches files and directories that match
  all the given rules

  *  <code>(or <em>rule</em> <em>rule...</em>)</code> matches files and directories that match
  any of the given rules

Also, you can override a previous exclusion with an explicit include
in a lower-level directory:

<pre>(* (glob "*~") include)</pre>

You can bind rules to specific directories, rather than to "this
directory and all beneath it", by specifying an absolute or relative
path instead of the `*`:

<pre>("/etc" (name "passwd") exclude)</pre>

If you use a relative path, it's taken relative to the directory of
the <code>.ugarit</code> file.

You can also put some rules in your <code>.conf</code> file, although relative
paths are illegal there, by adding lines of this form to the file:

<pre>(rule * (glob "*~") exclude)</pre>

Added docs/faq.wiki.

















































































>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
<h1>Questions and Answers</h1>

<h2>What happens if a snapshot is interrupted?</h2>

Nothing! Whatever blocks have been uploaded will be uploaded, but the
snapshot is only added to the tag once the entire filesystem has been
snapshotted. So just start the snapshot again. Any files that have
already be uploaded will then not need to be uploaded again, so the
second snapshot should proceed quickly to the point where it failed
before, and continue from there.

Unless the vault ends up with a partially-uploaded corrupted block
due to being interrupted during upload, you'll be fine. The filesystem
backend has been written to avoid this by writing the block to a file
with the wrong name, then renaming it to the correct name when it's
entirely uploaded.

Actually, there is *one* caveat: blocks that were uploaded, but never
make it into a finished snapshot, will be marked as "referenced" but
there's no snapshot to delete to un-reference them, so they'll never
be removed when you delete snapshots. (Not that snapshot deletion is
implemented yet, mind). If this becomes a problem for people, we could
write a "garbage collect" tool that regenerates the reference counts
in a vault, leading to unused blocks (with a zero refcount) being
unlinked.

<h2>Should I share a single large vault between all my filesystems?</h2>

I think so. Using a single large vault means that blocks shared
between servers - eg, software installed from packages and that sort
of thing - will only ever need to be uploaded once, saving storage
space and upload bandwidth. However, do not share a vault between
servers that do not mutually trust each other, as they can all update
the same tags, so can meddle with each other's snapshots - and read
each other's snapshots.

<h3>CAVEAT</h3>

It's not currently safe to have multiple concurrent snapshots to the
same split log backend; this will soon be fixed, however.

Added docs/installation.wiki.













































































































































































































































































































































































































































































































































































































































































































>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
<h1>Installation</h1>

Install [http://www.call-with-current-continuation.org/|Chicken Scheme] using their [http://wiki.call-cc.org/man/4/Getting%20started|installation instructions].

Ugarit can then be installed by typing (as root):

    chicken-install ugarit

See the [http://wiki.call-cc.org/manual/Extensions#chicken-install-reference|chicken-install manual] for details if you have any trouble, or wish to install into your home directory.

<h1>Setting up a vault</h1>

Firstly, you need to know the vault identifier for the place you'll
be storing your vaults. This depends on your backend. The vault
identifier is actually the command line used to invoke the backend for
a particular vault; communication with the vault is via standard
input and output, which is how it's easy to tunnel via ssh.

<h2>Local filesystem backends</h2>

These backends use the local filesystem to store the vaults. Of
course, the "local filesystem" on a given server might be an NFS mount
or mounted from a storage-area network.

<h3>Logfile backend</h3>

The logfile backend works much like the original Venti system. It's
append-only - you won't be able to delete old snapshots from a logfile
vault, even when I implement deletion. It stores the vault in two
sets of files; one is a log of data blocks, split at a specified
maximum size, and the other is the metadata: an sqlite database used
to track the location of blocks in the log files, the contents of
tags, and a count of the logs so a filename can be chosen for a new one.

To set up a new logfile vault, just choose where to put the two
parts. It would be nice to put the metadata file on a different
physical disk to the logs directory, to reduce seeking. If you only
have one disk, you can put the metadata file in the log directory
("metadata" is a good name).

You can then refer to it using the following vault identifier:

      "backend-fs splitlog ...log directory... ...metadata file..."

<h3>SQLite backend</h3>

The sqlite backend works a bit like a
[http://www.fossil-scm.org/|Fossil] repository; the storage is
implemented as a single file, which is actually an SQLite database
containing blocks as blobs, along with tags and configuration data in
their own tables.

It supports unlinking objects, and the use of a single file to store
everything is convenient; but storing everything in a single file with
random access is slightly riskier than the simple structure of an
append-only log file; it is less tolerant of corruption, which can
easily render the entire storage unusable. Also, that one file can get
very large.

SQLite has internal limits on the size of a database, but they're
quite large - you'll probably hit a size limit at about 140
terabytes.

To set up an SQLite storage, just choose a place to put the file. I
usually use an extension of <code>.vault</code>; note that SQLite will
create additional temporary files alongside it with additional
extensions, too.

Then refer to it with the following vault identifier:

      "backend-sqlite ...path to vault file..."

<h3>Filesystem backend</h3>

The filesystem backend creates vaults by storing each block or tag
in its own file, in a directory. To keep the objects-per-directory
count down, it'll split the files into subdirectories. Because of
this, it uses a stupendous number of inodes (more than the filesystem
being backed up). Only use it if you don't mind that; splitlog is much
more efficient.

To set up a new filesystem-backend vault, just create an empty
directory that Ugarit will have write access to when it runs. It will
probably run as root in order to be able to access the contents of
files that aren't world-readable (although that's up to you), so
unless you access your storage via ssh or sudo to use another user to
run the backend under, be careful of NFS mounts that have
<code>maproot=nobody</code> set!

You can then refer to it using the following vault identifier:

      "backend-fs fs ...path to directory..."

<h2>Proxying backends</h2>

These backends wrap another vault identifier which the actual
storage task is delegated to, but add some value along the way.

<h2>SSH tunnelling</h2>

It's easy to access a vault stored on a remote server. The caveat
is that the backend then needs to be installed on the remote server!
Since vaults are accessed by running the supplied command, and then
talking to them via stdin and stdout, the vault identified needs
only be:

      "ssh ...hostname... '...remote vault identifier...'"

<h2>Cache backend</h2>

The cache backend is used to cache a list of what blocks exist in the
proxied backend, so that it can answer queries as to the existance of
a block rapidly, even when the proxied backend is on the end of a
high-latency link (eg, the Internet). This should speed up snapshots,
as existing files are identified by asking the backend if the vault
already has them.

The cache backend works by storing the cache in a local sqlite
file. Given a place for it to store that file, usage is simple:

      "backend-cache ...path to cachefile... '...proxied vault identifier...'"

The cache file will be automatically created if it doesn't already
exist, so make sure there's write access to the containing directory.

 - WARNING - WARNING - WARNING - WARNING - WARNING - WARNING -

If you use a cache on a vault shared between servers, make sure
that you either:

  *  Never delete things from the vault

or

  *  Make sure all access to the vault is via the same cache

If a block is deleted from a vault, and a cache on that vault is
not aware of the deletion (as it did not go "through" the caching
proxy), then the cache will record that the block exists in the
vault when it does not. This will mean that if a snapshot is made
through the cache that would use that block, then it will be assumed
that the block already exists in the vault when it does
not. Therefore, the block will not be uploaded, and a dangling
reference will result!

Some setups which *are* safe:

  *  A single server using a vault via a cache, not sharing it with
   anyone else.

  *  A pool of servers using a vault via the same cache.

  *  A pool of servers using a vault via one or more caches, and
   maybe some not via the cache, where nothing is ever deleted from
   the vault.

  *  A pool of servers using a vault via one cache, and maybe some
   not via the cache, where deletions are only performed on servers
   using the cache, so the cache is always aware.

<h1>Writing a <code>ugarit.conf</code></h1>

<code>ugarit.conf</code> should look something like this:

<verbatim>(storage <vault identifier>)
(hash tiger "<salt>")
[double-check]
[(compression [deflate|lzma])]
[(encryption aes <key>)]
[(cache "<path>")|(file-cache "<path>")]
[(rule ...)]</verbatim>

<h2>Hashing</h2>

The hash line chooses a hash algorithm. Currently Tiger-192
(<code>tiger</code>), SHA-256 (<code>sha256</code>), SHA-384
(<code>sha384</code>) and SHA-512 (<code>sha512</code>) are supported;
if you omit the line then Tiger will still be used, but it will be a
simple hash of the block with the block type appended, which reveals
to attackers what blocks you have (as the hash is of the unencrypted
block, and the hash is not encrypted). This is useful for development
and testing or for use with trusted vaults, but not advised for use
with vaults that attackers may snoop at. Providing a salt string
produces a hash function that hashes the block, the type of block, and
the salt string, producing hashes that attackers who can snoop the
vault cannot use to find known blocks (see the "Security model"
section below for more details).

I would recommend that you create a salt string from a secure entropy
source, such as:

<pre>dd if=/dev/random bs=1 count=64 | base64 -w 0</pre>

Whichever hash function you use, you will need to install the required
Chicken egg with one of the following commands:

<pre>chicken-install -s tiger-hash  # for tiger
chicken-install -s sha2        # for the SHA hashes</pre>

<h2>Compression</h2>

<code>lzma</code> is the recommended compression option for
low-bandwidth backends or when space is tight, but it's very slow to
compress; deflate or no compression at all are better for fast local
vaults. To have no compression at all, just remove the
<code>(compression ...)</code> line entirely. Likewise, to use
compression, you need to install a Chicken egg:

<pre>chicken-install -s z3       # for deflate
chicken-install -s lzma     # for lzma</pre>

WARNING: The lzma egg is currently rather difficult to install, and
needs rewriting to fix this problem.

<h2>Encryption</h2>

Likewise, the <code>(encryption ...)</code> line may be omitted to have no
encryption; the only currently supported algorithm is aes (in CBC
mode) with a key given in hex, as a passphrase (hashed to get a key),
or a passphrase read from the terminal on every run. The key may be
16, 24, or 32 bytes for 128-bit, 192-bit or 256-bit AES. To specify a
hex key, just supply it as a string, like so:

<pre>(encryption aes "00112233445566778899AABBCCDDEEFF")</pre>

...for 128-bit AES,

<pre>(encryption aes "00112233445566778899AABBCCDDEEFF0011223344556677")</pre>

...for 192-bit AES, or

<pre>(encryption aes "00112233445566778899AABBCCDDEEFF00112233445566778899AABBCCDDEEFF")</pre>

...for 256-bit AES.

Alternatively, you can provide a passphrase, and specify how large a
key you want it turned into, like so:

<pre>(encryption aes ([16|24|32] "We three kings of Orient are, one in a taxi one in a car, one on a scooter honking his hooter and smoking a fat cigar. Oh, star of wonder, star of light; star with royal dynamite"))</pre>

I would recommend that you generate a long passphrase from a secure
entropy source, such as:

<pre>dd if=/dev/random bs=1 count=64 | base64 -w 0</pre>

Finally, the extra-paranoid can request that Ugarit prompt for a
passphrase on every run and hash it into a key of the specified
length, like so:

<pre>(encryption aes ([16|24|32] prompt))</pre>

(note the lack of quotes around <code>prompt</code>, distinguishing it from a passphrase)

Please read the [./security.wiki|Security model] documentationfor
details on the implications of different encryption setups.

Again, as it is an optional feature, to use encryption, you must
install the appropriate Chicken egg:

<pre>chicken-install -s aes</pre>

<h2>Caching</h2>

Ugarit can use a local cache to speed up various operations. If a path
to a file is provided through the <code>cache</code> or
<code>file-cache</code> directives, then a file will be created at
that location and used as a cache. If not, then a default path of
<code>~/.ugarit-cache</code> will be used instead.

WARNING: If you use multiple different vaults from the same UNIX
account, and the same tag names are used in those different vaults,
and you use the default cache path (or explicitly specify cache paths
that point to the same file), you will get a somewhat confused
cache. The effects of this will be annoying (searches finding things
that then can't be fetched) rather than damaging, but it's still best
avoided!

The cache is used to cache snapshot records and archive import
records. This is used by operations that extract snapshot history and
archive objects; snapshots are stored in a linked list of snapshot
objects, each referring to the previous snapshot. Therefore, reading
the history of a snapshot tag requires reading many objects from the
storage, which can be time-consuming for a remote storage! Similarly,
archives are represented as a linked list of imports, and searching
for an object in the archive can involve traversing the chain of
imports until a match is found (and then searching on until the end to
see if any further matches can be found!). The cache is even more
important for archive imports, as it not only keeps a local copy of
all the import information, it also records the "current" metadata of
every object in the archive (so that we don't need to search through
superceded previous versions of the metadata of an object when looking
for something), and uses B-tree indexes to enable fast searching of
the cached metadata.

If you configure the cache path with <code>file-cache</code> rather
than just <code>cache</cache>, then as well as the snapshot/archive
metadata caching, you will also enable file hash caching.

This significantly speeds up subsequent snapshots of a filesystem
tree. The file cache maps filenames to (mtime,size,hash) tuples; as it
scans the filesystem, if it finds a file in the cache and the mtime
and size have not changed, it will assume it is already stored under
the specified hash. This saves it from having to read the entire file
to hash it and then check if the hash is present in the vault. In
other words, if only a few files have changed since the last snapshot,
then snapshotting a directory tree becomes an O(N) operation, where N
is the number of files, rather than an O(M) operation, where M is the
total size of files involved.

WARNING: If you use a file cache, and a file is cached in it but then
subsequently deleted from the vault, Ugarit will fail to re-upload it
at the next snapshot. If you are using a file cache and you go
deleting things from your vault (should that be implemented in
future), you'll want to flush the cache afterwards. We might implement
automatic removal of deleted files from the local cache, but file
caches on other Ugarit installations that use the same vault will not
be aware of the deletion.

<h2>Other options</h2>

<code>double-check</code>, if present, causes Ugarit to perform extra
internal consistency checks during backups, which will detect bugs but
may slow things down.

<h2>Example</h2>

For example:

<pre>(storage "ssh ugarit@spiderman 'backend-fs splitlog /mnt/ugarit-data /mnt/ugarit-metadata/metadata'")
(hash tiger "i3HO7JeLCSa6Wa55uqTRqp4jppUYbXoxme7YpcHPnuoA+11ez9iOIA6B6eBIhZ0MbdLvvFZZWnRgJAzY8K2JBQ")
(encryption aes (32 "FN9m34J4bbD3vhPqh6+4BjjXDSPYpuyskJX73T1t60PP0rPdC3AxlrjVn4YDyaFSbx5WRAn4JBr7SBn2PLyxJw"))
(compression lzma)
(file-cache "/var/ugarit/cache")</pre>

Be careful to put a set of parentheses around each configuration
entry. White space isn't significant, so feel free to indent things
and wrap them over lines if you want.

Keep copies of this file safe - you'll need it to do extractions!
Print a copy out and lock it in your fire safe! Ok, currently, you
might be able to recreate it if you remember where you put the
storage, but encryption keys and hash salts are harder to remember...

Added docs/intro.wiki.















































































































































































































































































































































































































































































































































































































































































































































































>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
<h1>About Ugarit</h1>

<h2>What's content-addressible storage?</h2>

Traditional backup systems work by storing copies of your files
somewhere. Perhaps they go onto tapes, or perhaps they're in archive
files written to disk. They will either be full dumps, containing a
complete copy of your files, or incrementals or differentials, which
only contain files that have been modified since some point. This
saves making repeated copies of unchanging files, but it means that to
do a full restore, you need to start by extracting the last full dump
then applying one or more incrementials, or the latest differential,
to get the latest state.

Not only do differentials and incrementals let you save space, they
also give you a history - you can restore up to a previous point in
time, which is invaluable if the file you want to restore was deleted
a few backup cycles ago!

This technology was developed when the best storage technology for
backups was magnetic tape, because each dump is written sequentially
(and restores are largely sequential, unless you're skipping bits to
pull out specific files).

However, these days, random-access media such as magnetic disks and
SSDs are cheap enough to compete with magnetic tape for long-term bulk
storage (especially when one considers the cost of a tape drive or
two). And having fast random access means we can take advantage of
different storage techniques.

A content-addressible store is a key-value store, except that the keys
are always computed from the values. When a given object is stored, it
is hashed, and the hash used as the key. This means you can never
store the same object twice; the second time you'll get the same hash,
see the object is already present, and re-use the existing
copy. Therefore, you get deduplication of your data for free.

But, I hear you ask, how do you find things again, if you can't choose
the keys?

When an object is stored, you need to record the key so you can find
it again later. In Ugarit, everything is stored in a tree-like
directory structure. Files are uploaded and their hashes obtained, and
then a directory object is constructed containing a list of the files
in the directory, and listing the key of the Ugarit objects storing
the contents of each file. This directory object itself has a hash,
which is stored inside the directory entry in the parent directory,
and so on up to the root. The root of a tree stored in a Ugarit vault
has no parent directory to contain it, so at that point, we store the
key of the root in a named "tag" that we can look up by name when we
want it.

Therefore, everything in a Ugarit vault can be found by starting with
a named tag and retrieving the object whose key it contains, then
finding keys inside that object and looking up the objects they refer
to, until we find the object we want.

When you use Ugarit to back up your filesystem, it uploads a complete
snapshot of every file in the filesystem, like a full dump. But
because the vault is content-addressed, it automatically avoids
uploading anything it already has a copy of, so all we upload is an
incremental dump - but in the vault, it looks like a full dump, and so
can be restored on its own without having to restore a chain of incrementals.

Also, the same storage can be shared between multiple systems that all
back up to it - and the incremental upload algorithm will mean that
any files shared between the servers will only need to be uploaded
once. If you back up a complete server, than go and back up another
that is running the same distribution, then all the files in <tt>/bin</tt>
and so on that are already in the storage will not need to be backed
up again; the system will automatically spot that they're already
there, and not upload them again.

As well as storing backups of filesystems, Ugarit can also be used as
the primary storage for read-only files, such as music and photos. The
principle is exactly the same; the only difference is in how the files
are organised - rather than as a directory structure, the files are
referenced from metadata objects that specify information about the
file (so it can be found) and a reference to the contents. Sets of
metadata objects are pointed to by tags as well, so they can also be
found.

<h2>So what's that mean in practice?</h2>

<h3>Backups</h3>
You can run Ugarit to back up any number of filesystems to a shared
storage area (known as a <i>vault</i>, and on every backup, Ugarit
will only upload files or parts of files that aren't already in the
vault - be they from the previous snapshot, earlier snapshots,
snapshot of entirely unrelated filesystems, etc. Every time you do a
snapshot, Ugarit builds an entire complete directory tree of the
snapshot in the vault - but reusing any parts of files, files, or
entire directories that already exist anywhere in the vault, and
only uploading what doesn't already exist.

The support for parts of files means that, in many cases, gigantic
files like database tables and virtual disks for virtual machines will
not need to be uploaded entirely every time they change, as the
changed sections will be identified and uploaded.

Because a complete directory tree exists in the vault for any
snapshot, the extraction algorithm is incredibly simple - and,
therefore, incredibly reliable and fast. Simple, reliable, and fast
are just what you need when you're trying to reconstruct the
filesystem of a live server.

Also, it means that you can do lots of small snapshots. If you run a
snapshot every hour, then only a megabyte or two might have changed in
your filesystem, so you only upload a megabyte or two - yet you end up
with a complete history of your filesystem at hourly intervals in the
vault.

Conventional backup systems usually either store a full backup then
incrementals to their archives, meaning that doing a restore involves
reading the full backup then reading every incremental since and
applying them - so to do a restore, you have to download *every
version* of the filesystem you've ever uploaded, or you have to do
periodic full backups (even though most of your filesystem won't have
changed since the last full backup) to reduce the number of
incrementals required for a restore. Better results are had from
systems that use a special backup server to look after the archive
storage, which accept incremental backups and apply them to the
snapshot they keep in order to maintain a most-recent snapshot that
can be downloaded in a single run; but they then restrict you to using
dedicated servers as your archive stores, ruling out cheaply scalable
solutions like Amazon S3, or just backing up to a removable USB or
eSATA disk you attach to your system whenever you do a backup. And
dedicated backup servers are complex pieces of software; can you rely
on something complex for the fundamental foundation of your data
security system?

<h3>Archives</h3>

You can also use Ugarit as the primary storage for read-only
files. You do this by creating an archive in the vault, and importing
batches of files into it along with their metadata (arbitrary
attributes, such as "author", "creation date" or "subject").

Just as you can keep snapshots of multiple systems in a Ugarit vault,
you can also keep multiple separate archives, each identified by a
named tag.

However, as it's all within the same vault, the usual de-duplication
rules apply. The same file may be in multiple archives, with different
metadata in each, as the file contents and metadata are stored
separately (and associated only within the context of each
archive). And, of course, the same file may appear in snapshots and in
archives; perhaps a file was originally downloaded into your home
directory, where it was backed up into Ugarit snapshots, and then you
imported it into your archive. The archive import would not have had
to re-upload the file, as its contents would have already been found
in the vault, so all that needs to be uploaded is the metadata.

Although we have mainly spoken of storing files in archives, the
objects in archives can be files or directories full of files, as
well. This is useful for storing MacOS-style files that are actually
directories, or for archiving things like completed projects for
clients, which can be entire directory structures.

<h2>System Requirements</h2>

Ugarit should run on any POSIX-compliant system that can run
[http://www.call-with-current-continuation.org/|Chicken Scheme]. It
stores and restores all the file attributes reported by the <code>stat</code>
system call - POSIX mode permissions, UID, GID, mtime, and optionally
atime and ctime (although the ctime cannot be restored due to POSIX
restrictions). Ugarit will store files, directories, device and
character special files, symlinks, and FIFOs.

Support for extended filesystem attributes - ACLs, alternative
streams, forks and other metadata - is possible, due to the extensible
directory entry format; support for such metadata will be added as
required.

Currently, only local filesystem-based vault storage backends are
complete: these are suitable for backing up to a removable hard disk
or a filesystem shared via NFS or other protocols. However, the
backend can be accessed via an SSH tunnel, so a remote server you are
able to install Ugarit on to run the backends can be used as a remote
vault.

However, the next backend to be implemented will be one for Amazon S3,
and an SFTP backend for storing vaults anywhere you can ssh
to. Other backends will be implemented on demand; a vault can, in
principle, be stored on anything that can store files by name, report
on whether a file already exists, and efficiently download a file by
name. This rules out magnetic tapes due to their requirement for
sequential access.

Although we need to trust that a backend won't lose data (for now), we
don't need to trust the backend not to snoop on us, as Ugarit
optionally encrypts everything sent to the vault.

<h2>Terminology</h2>

A Ugarit backend is the software module that handles backend
storage. An actual storage area - managed by a backend - is called a
storage, and is used to implement a vault; currently, every storage is
a valid vault, but the planned future introduction of a distributed
storage backend will enable multiple storages (which are not,
themselves, valid vaults as they only contain some subset of the
information required) to be combined into an aggregrate storage, which
then holds the actual vault. Note that the contents of a storage is
purely a set of blocks, and a series of named tags containing
references to them; the storage does not know the details of
encryption and hashing, so cannot make any sense of its contents.

For example, if you use the recommended "splitlog" filesystem backend,
your vault might be <samp>/mnt/bigdisk</samp> on the server
<samp>prometheus</samp>. The backend (which is compiled along with the
other filesystem backends in the <code>backend-fs</code> binary) must
be installed on <samp>prometheus</samp>, and Ugarit clients all over
the place may then use it via ssh to <samp>prometheus</samp>. However,
even with the filesystem backends, the actual storage might not be on
<samp>prometheus</samp> where the backend runs -
<samp>/mnt/bigdisk</samp> might be an NFS mount, or a mount from a
storage-area network. This ability to delegate via SSH is particularly
useful with the "cache" backend, which reduces latency by storing a
cache of what blocks exist in a backend, thereby making it quicker to
identify already-stored files; a cluster of servers all sharing the
same vault might all use SSH tunnels to access an instance of the
"cache" backend on one of them (using some local disk to store the
cache), which proxies the actual vault storage to a vault on the other
end of a high-latency Internet link, again via an SSH tunnel.

A vault is where Ugarit stores backups (as chains of snapshots) and
archives (as chains of archive imports). Backups and archives are
identified by tags, which are the top-level named entry points into a
vault. A vault is based on top of a storage, along with a choice of
hash function, compression algorithm, and encryption that are used to
map the logical world of snapshots and archive imports into the
physical world of blocks stored in the storage.

A snapshot is a copy of a filesystem tree in the vault, with a header
block that gives some metadata about it. A backup consists of a number
of snapshots of a given filesystem.

An archive import is a set of filesystem trees, each along with
metadata about it. Whereas a backup is organised around a series of
timed snapshots, an archive is organised around the metadata; the
filesystem trees in the archive are identified by their properties.

<h2>So what, exactly, is in a vault?</h2>

A Ugarit vault contains a load of blocks, each up to a maximum size
(usually 1MiB, although other backends might impose smaller
limits). Each block is identified by the hash of its contents; this is
how Ugarit avoids ever uploading the same data twice, by checking to
see if the data to be uploaded already exists in the vault by
looking up the hash. The contents of the blocks are compressed and
then encrypted before upload.

Every file uploaded is, unless it's small enough to fit in a single
block, chopped into blocks, and each block uploaded. This way, the
entire contents of your filesystem can be uploaded - or, at least,
only the parts of it that aren't already there! The blocks are then
tied together to create a snapshot by uploading blocks full of the
hashes of the data blocks, and directory blocks are uploaded listing
the names and attributes of files in directories, along with the
hashes of the blocks that contain the files' contents. Even the blocks
that contain lists of hashes of other blocks are subject to checking
for pre-existence in the vault; if only a few MiB of your
hundred-GiB filesystem has changed, then even the index blocks and
directory blocks are re-used from previous snapshots.

Once uploaded, a block in the vault is never again changed. After all,
if its contents changed, its hash would change, so it would no longer
be the same block! However, every block has a reference count,
tracking the number of index blocks that refer to it. This means that
the vault knows which blocks are shared between multiple snapshots (or
shared *within* a snapshot - if a filesystem has more than one copy of
the same file, still only one copy is uploaded), so that if a given
snapshot is deleted, then the blocks that only that snapshot is using
can be deleted to free up space, without corrupting other snapshots by
deleting blocks they share. Keep in mind, however, that not all
storage backends may support this - there are certain advantages to
being an append-only vault. For a start, you can't delete something by
accident! The supplied fs and sqlite backends support deletion, while
the splitlog backend does not yet. However, the actual snapshot
deletion command in the user interface hasn't been implemented yet
either, so it's a moot point for now...

Finally, the vault contains objects called tags. Unlike the blocks,
the tags' contents can change, and they have meaningful names rather
than being identified by hash. Tags identify the top-level blocks of
snapshots within the system, from which (by following the chain of
hashes down through the index blocks) the entire contents of a
snapshot may be found. Unless you happen to have recorded the hash of
a snapshot somewhere, the tags are where you find snapshots from when
you want to do a restore.

Whenever a snapshot is taken, as soon as Ugarit has uploaded all the
files, directories, and index blocks required, it looks up the tag you
have identified as the target of the snapshot. If the tag already
exists, then the snapshot it currently points to is recorded in the
new snapshot as the "previous snapshot"; then the snapshot header
containing the previous snapshot hash, along with the date and time
and any comments you provide for the snapshot, and is uploaded (as
another block, identified by its hash). The tag is then updated to
point to the new snapshot.

This way, each tag actually identifies a chronological chain of
snapshots. Normally, you would use a tag to identify a filesystem
being backed up; you'd keep snapshotting the filesystem to the same
tag, resulting in all the snapshots of that filesystem hanging from
the tag. But if you wanted to remember any particular snapshot
(perhaps if it's the snapshot you take before a big upgrade or other
risky operation), you can duplicate the tag, in effect 'forking' the
chain of snapshots much like a branch in a version control system.

Archive imports cause the creation of one or more archive metadata
blocks, each of which lists the hashes of files or filesystem trees in
the archive, along with their metadata. Each import then has a single
archive import block pointing to the sequence of metadata blocks, and
pointing to the previous archive import block in that archive. The
same filesystem tree can be imported more than once to the same
archive, and the "latest" metadata always wins.

Generally, you should create lots of small archives for different
categories of things - such as one for music, one for photos, and so
on. You might well create separate archives for the music collections
of different people in your household, unless they overlap, and
another for Christmas music so it doesn't crop up in random shuffle
play! It's easy to merge archives if you over-compartmentalise them,
but harder to split an archive if you find it too cluttered with
unrelated things.

I've spoken of archive imports, and backup snapshots, each having a
"previous" reference to the last import or snapshot in the chain, but
it's actually more complex than that: they have an arbitrary list of
zero or more previous objects. As such, it's possible for several
imports or snapshots to have the same "previous", known as a "fork",
and it's possible to have an import or snapshot that merges multiple
previous ones.

Forking is handy if you want to basically duplicate an archive,
creating two new archives with the same contents to begin with, but
each then capable of diverging thereafter. You might do this to keep
the state of an archive before doing a bit import, so you can go back
to the original state if you regret the import, for instance.

Forking a backup tag is a more unusual operation, but also
useful. Perhaps you have a server running many stateful services, and
the hardware becomes overloaded, so you clone the basic setup onto
another server, and run half of the services on the original and half
on the new one; if you fork the backup tag of the original server to
create a backup tag for the new server, then both servers' snapshot
history will share the original shared state.

Merging is most useful for archives; you might merge several archives
into one, as mentioned.

And, of course, you can merge backup tags, as well. If your earlier
splitting of one server into two doesn't work out (perhaps your
workload reduces, or you can now afford a single, more powerful,
server to handle everything in one place), you might rsync back the
service state from the two servers onto the new server, so it's all
merged in the new server's filesystem. To preserve this in the
snapshot history, you can merge the two backup tags of the two servers
to create a backup tag for the single new server, which will
accurately reflect the history of the filesystem.

Also, tags might fork by accident - I plan to introduce a distributed
storage backend, which will replicate blocks and tags across multiple
storages to create a single virtual storage to build a vault on top
of; in the event of the network of actual storages suffering a
failure, it may be that snapshots and imports are only applied to some
of the storages - and then subsequent snapshots and imports only get
applied to some other subset of the storages. When the network is
repaired and all the storages are again visible, they will have
diverged, inconsistent, states for their tags, and the distributed
storage system will resolve the situation by keeping the majority
state as the state of the tag on all the backends, but preserving any
other states by creating new tags, with the original name plus a
suffix. These can then be merged to "heal" the conflict.

Added docs/release-2.0.wiki.



































































>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
<h1>Ugarit 2.0 release notes</h1>

<h2>What's new?</h2>

Archival mode [dae5e21ffc], and to support its integration into
Ugarit, implemented typed tags [08bf026f5a], displaying tag types in
the VFS [30054df0b6], refactoring the Ugarit internals [5fa161239c],
made the storage of logs in the vault better [68bb75789f], made it
possible to view logs from within the VFS [4e3673e0fe], supported
hidden tags [cf5ef4691c], recording configuration information in the
vault (and providing instant notification if your vault
hashing/encryption setup is incorrect, thanks to a clever idea by Andy
Bennett) [0500d282fc], rearranged how local caching is handled
[b5911d321a], and added support for the history of a snapshot or
archive tag to have arbitrary branches and merges [a987e28fef], which
(as a side-effect) improved the performance of running "ls" in long
snapshot histories [fcf8bc942a]. Also added an sqlite backend
[8719dfb84f], which makes testing easier but is useful in its own
right as it's fully-featured and crash-safe, while storing the vault
in a single file; and improved the appearance of the explore mode ls
command, as the VFS layout has become more complex with the new
log/properties views and all the archive mode stuff.

<h2>Upgrading</h2>

Ugarit 2.0 uses a new format for tags and logs, as well as the whole
new concept of archive tags. As such, the vault format has
changed. Ugarit 2.0 will read a vault created by prior versions of
Ugarit, and will silently upgrade it when it adds things to the vault
(by using the new formt for new things, and keeping the old format for
old things). As such, when you upgrade to Ugarit 2.0 and start using
it on an existing vault, older versions of Ugarit will not be able to
read things that Ugarit 2.0 has added to the vault.

Added docs/release-old.wiki.































































































































































































































































































>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
<h1>Ugarit v1.* release history</h1>


  *  1.0.9:  More humane display of sizes in explore's directory
     listings, using low-level I/O to reduce CPU usage. Myriad small
     bug fixes and some internal structural improvements.

  *  1.0.8: Bug fixes to work with the latest chicken master, and
     increased unit test coverage to test stuff that wasn't working
     due to chicken bugs. Looking good!

  *  1.0.7: Fixed bug with directory rules (errors arose when files
     were skipped). I need to improve the test suite coverage of
     high-level components to stop this happening!

  *  1.0.6: Fixed missing features from v1.0.5 due to a fluffed merge
     (whoops), added tracking of directory sizes (files+bytes) in the
     vault on snapshot and the use of this information to display
     overall percentage completion when extracting. Directory sizes
     can be seen in the explore interface when doing "ls -l" or "ls -ll".

  *  1.0.5: Changed the VFS layout slightly, making the existence of
     snapshot objects explicit (when you go into a tag, then go into a
     snapshot, you now need to go into "contents" to see the actual
     file tree; the snapshot object itself now exists as a node in the
     tree). Added traverse-vault-* functions to the core API, and tests
     for same, and used traverse-vault-node to drive the cd and get
     functions in the interactive explore mode (speeding them up in the
     process!). Added "extract" command. Added a progress reporting
     callback facility for snapshots and extractions, and used it to
     provide progress reporting in the front-end, every 60 seconds or
     so by default, not at all with -q, and every time something
     happens with -v. Added tab completion in explore mode.

  *  1.0.4: Resurrected support for compression and encryption and SHA2
  hashes, which had been broken by the failure of the
  <code>autoload</code> egg to continue to work as it used to. Tidying
  up error and ^C handling somewhat.

  *  1.0.3: Installed sqlite busy handlers to retry when the database is
   locked due to concurrent access (affects backend-fs, backend-cache,
   and the file cache), and gained an EXCLUSIVE lock when locking a
   tag in backend-fs; I'm not clear if it's necessary, but it can't
   hurt.

   BUGFIX: Logging of messages from storage backends wasn't
   happening correctly in the Ugarit core, leading to errors when the
   cache backend (which logs an info message at close time) was closed
   and the log message had nowhere to go.

  *  1.0.2: Made the file cache also commit periodically, rather than on
  every write, in order to improve performance. Counting blocks and
  bytes uploaded / reused, and file cache bytes as well as hits;
  reporting same in snapshot UI and logging same to snapshot
  metadata. Switched to the <code>posix-extras</code> egg and ditched our own
  <code>posixextras.scm</code> wrappers. Used the <code>parley</code> egg in the <code>ugarit
  explore</code> CLI for line editing. Added logging infrastructure,
  recording of snapshot logs in the snapshot. Added recovery from
  extraction errors. Listed lock state of tags in explore
  mode. Backend protocol v2 introduced (retaining v1 for
  compatability) allowing for an error on backend startup, and logging
  nonfatal errors, warnings, and info on startup and all protocol
  calls. Added <code>ugarit-archive-admin</code> command line interface to
  backend-specific administrative interfaces. Configuration of the
  splitlog backend (write protection, adjusting block size and logfile
  size limit and commit interval) is now possible via the admin
  interface. The admin interface also permits rebuilding the metadata
  index of a splitlog vault with the <code>reindex!</code> admin command.

  BUGFIX: Made file cache check the file hashes it finds in the
    cache actually exist in the vault, to protect against the case
    where a crash of some kind has caused unflushed changes to be
    lost; the file cache may well have committed changes that the
    backend hasn't, leading to references to nonexistant blocks. Note
    that we assume that vaults are sequentially safe, eg if the
    final indirect block of a large file made it, all the partial
    blocks must have made it too.

  BUGFIX: Added an explicit <code>flush!</code> command to the backend
    protocol, and put explicit flushes at critical points in higher
    layers (<code>backend-cache</code>, the vault abstraction in the Ugarit
    core, and when tagging a snapshot) so that we ensure the blocks we
    point at are flushed before committing references to them in the
    <code>backend-cache</code> or file caches, or into tags, to ensure crash
    safety.

  BUGFIX: Made the splitlog backend never exceed the file size limit
    (except when passed blocks that, plus a header, are larger than
    it), rather than letting a partial block hang over the 'end'.

  BUGFIX: Fixed tag locking, which was broken all over the
    place. Concurrent snapshots to the same tag should now block for
    one another, although why you'd want to *do* that is questionable.

  BUGFIX: Fixed generation of non-keyed hashes, which was
    incorrectly appending the type to the hash without an outer
    hash. This breaks backwards compatability, but nobody was using
    the old algorithm, right? I'll introduce it as an option if
    required.

  *  1.0.1: Consistency check on read blocks by default. Removed warning
  about deletions from backend-cache; we need a new mechanism to
  report warnings from backends to the user. Made backend-cache and
  backend-fs/splitlog commit periodically rather than after every
  insert, which should speed up snapshotting a lot, and reused the
  prepared statements rather than re-preparing them all the
  time.

  BUGFIX: splitlog backend now creates log files with
  "rw-------" rather than "rwx------" permissions; and all sqlite
  databases (splitlog metadata, cache file, and file-cache file) are
  created with "rw-------" rather then "rw-r--r--".

  *  1.0: Migrated from gdbm to sqlite for metadata storage, removing the
  GPL taint. Unit test suite. backend-cache made into a separate
  backend binary. Removed backend-log.

  BUGFIX: file caching uses mtime *and*
  size now, rather than just mtime. Error handling so we skip objects
  that we cannot do something with, and proceed to try the rest of the
  operation.

  *  0.8: decoupling backends from the core and into separate binaries,
  accessed via standard input and output, so they can be run over SSH
  tunnels and other such magic.

  *  0.7: file cache support, sorting of directories so they're archived
  in canonical order, autoloading of hash/encryption/compression
  modules so they're not required dependencies any more.

  *  0.6: .ugarit support.

  *  0.5: Keyed hashing so attackers can't tell what blocks you have,
  markers in logs so the index can be reconstructed, sha2 support, and
  passphrase support.

  *  0.4: AES encryption.

  *  0.3: Added splitlog backend, and fixed a .meta file typo.

  *  0.2: Initial public release.

  *  0.1: Internal development release.

Added docs/security.wiki.





















































































































































































































































































































































>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
<h1>Security model</h1>

I have designed and implemented Ugarit to be able to handle cases
where the actual vault storage is not entirely trusted.

However, security involves tradeoffs, and Ugarit is configurable in
ways that affect its resistance to different kinds of attacks. Here I
will list different kinds of attack and explain how Ugarit can deal
with them, and how you need to configure it to gain that
protection.

<h2>Vault snoopers</h2>

This might be somebody who can intercept Ugarit's communication with
the vault at any point, or who can read the vault itself at their
leisure.

Ugarit's splitlog backend creates files with "rw-------" permissions
out of the box to try and prevent this. This is a pain for people who
want to share vaults between UIDs, but we can add a configuration
option to override this if that becomes a problem.

<h3>Reading your data</h3>

If you enable encryption, then all the blocks sent to the vault are
encrypted using a secret key stored in your Ugarit configuration
file. As long as that configuration file is kept safe, and the AES
algorithm is secure, then attackers who can snoop the vault cannot
decode your data blocks. Enabling compression will also help, as the
blocks are compressed before encrypting, which is thought to make
cryptographic analysis harder.

Recommendations: Use compression and encryption when there is a risk
of vault snooping. Keep your Ugarit configuration file safe using
UNIX file permissions (make it readable only by root), and maybe store
it on a removable device that's only plugged in when
required. Alternatively, use the "prompt" passphrase option, and be
prompted for a passphrase every time you run Ugarit, so it isn't
stored on disk anywhere.

<h3>Looking for known hashes</h3>

A block is identified by the hash of its content (before compression
and encryption). If an attacker was trying to find people who own a
particular file (perhaps a piece of subversive literature), they could
search Ugarit vaults for its hash.

However, Ugarit has the option to "key" the hash with a "salt" stored
in the Ugarit configuration file. This means that the hashes used are
actually a hash of the block's contents *and* the salt you supply. If
you do this with a random salt that you keep secret, then attackers
can't check your vault for known content just by comparing the hashes.

Recommendations: Provide a secret string to your hash function in your
Ugarit configuration file. Keep the Ugarit configuration file safe, as
per the advice in the previous point.

<h2>Vault modifiers</h2>

These folks can modify Ugarit's writes into the vault, its reads
back from the vault, or can modify the vault itself at their leisure.

Modifying an encrypted block without knowing the encryption key can at
worst be a denial of service, corrupting the block in an unknown
way. An attacker who knows the encryption key could replace a block
with valid-seeming but incorrect content. In the worst case, this
could exploit a bug in the decompression engine, causing a crash or
even an exploit of the Ugarit process itself (thereby gaining the
powers of a process inspector, as documented below). We can but hope
that the decompression engine is robust. Exploits of the decryption
engine, or other parts of Ugarit, are less likely due to the nature of
the operations performed upon them.

However, if a block is modified, then when Ugarit reads it back, the
hash will no longer match the hash Ugarit requested, which will be
detected and an error reported. The hash is checked after
decryption and decompression, so this check does not protect us
against exploits of the decompression engine.

This protection is only afforded when the hash Ugarit asks for is not
tampered with. Most hashes are obtained from within other blocks,
which are therefore safe unless that block has been tampered with; the
nature of the hash tree conveys the trust in the hashes up to the
root. The root hashes are stored in the vault as "tags", which an
vault modifier could alter at will. Therefore, the tags cannot be
trusted if somebody might modify the vault. This is why Ugarit
prints out the snapshot hash and the root directory hash after
performing a snapshot, so you can record them securely outside of the
vault.

The most likely threat posed by vault modifiers is that they could
simply corrupt or delete all of your vault, without needing to know
any encryption keys.

Recommendations: Secure your vaults against modifiers, by whatever
means possible. If vault modifiers are still a potential threat,
write down a log of your root directory hashes from each snapshot, and keep
it safe. When extracting your backups, use the <code>ls -ll</code> command in the
interface to check the "contents" hash of your snapshots, and check
they match the root directory hash you expect.

<h2>Process inspectors</h2>

These folks can attach debuggers or similar tools to running
processes, such as Ugarit itself.

Ugarit backend processes only see encrypted data, so people who can
attach to that process gain the powers of vault snoopers and
modifiers, and the same conditions apply.

People who can attach to the Ugarit process itself, however, will see
the original unencrypted content of your filesystem, and will have
full access to the encryption keys and hashing keys stored in your
Ugarit configuration. When Ugarit is running with sufficient
permissions to restore backups, they will be able to intercept and
modify the data as it comes out, and probably gain total write access
to your entire filesystem in the process.

Recommendations: Ensure that Ugarit does not run under the same user
ID as untrusted software. In many cases it will need to run as root in
order to gain unfettered access to read the filesystems it is backing
up, or to restore the ownership of files. However, when all the files
it backs up are world-readable, it could run as an untrusted user for
backups, and where file ownership is trivially reconstructible, it can
do restores as a limited user, too.

<h2>Attackers in the source filesystem</h2>

These folks create files that Ugarit will back up one day. By having
write access to your filesystem, they already have some level of
power, and standard Unix security practices such as storage quotas
should be used to control them. They may be people with logins on your
box, or more subtly, people who can cause servers to writes files;
somebody who sends an email to your mailserver will probably cause
that message to be written to queue files, as will people who can
upload files via any means.

Such attackers might use up your available storage by creating large
files. This creates a problem in the actual filesystem, but that
problem can be fixed by deleting the files. If those files get
stored into Ugarit, then they are a part of that snapshot. If you
are using a backend that supports deletion, then (when I implement
snapshot deletion in the user interface) you could delete that entire
snapshot to recover the wasted space, but that is a rather serious
operation.

More insidiously, such attackers might attempt to abuse a hash
collision in order to fool the vault. If they have a way of creating
a file that, for instance, has the same hash as your shadow password
file, then Ugarit will think that it already has that file when it
attempts to snapshot it, and store a reference to the existing
file. If that snapshot is restored, then they will receive a copy of
your shadow password file. Similarly, if they can predict a future
hash of your shadow password file, and create a shadow password file
of their own (perhaps one giving them a root account with a known
password) with that hash, they can then wait for the real shadow
password file to have that hash. If the system is later restored from
that snapshot, then their chosen content will appear in the shadow
password file. However, doing this requires a very fundamental break
of the hash function being used.

Recommendations: Think carefully about who has write access to your
filesystems, directly or indirectly via a network service that stores
received data to disk. Enforce quotas where appropriate, and consider
not backing up "queue directories" where untrusted content might
appear; migrate incoming content that passes acceptance tests to an
area that is backed up. If necessary, the queue might be backed up to
a non-snapshotting system, such as rsyncing to another server, so that
any excessive files that appear in there are removed from the backup
in due course, while still affording protection.

Added docs/storage-admin.wiki.























































































































































































>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
<h1>Storage administration</h1>

Each backend offers a number of administrative commands for
administering the storage underlying vaults. These are accessible via
the <code>ugarit-storage-admin</code> command line interface.

To use it, run it with the following command:

<pre>$ ugarit-storage-admin '<vault identifier>'</pre>

The available commands differ between backends, but all backends
support the <code>info</code> and <code>help</code> commands, which
give basic information about the vault, and list all available
commands, respectively. Some offer a <code>stats</code> command that
examines the vault state to give interesting statistics, but which may
be a time-consuming operation.

<h2>Administering <code>splitlog</code> storages</h2>

The splitlog backend offers a wide selection of administrative
commands. See the <code>help</code> command on a splitlog vault for
details. The following commands are available:

<dl>

<dt><code>help</code></dt>
<dd>List the available commands.</dd>

<dt><code>info</code></dt>
<dd>List some basic information about the storage.</dd>

<dt><code>stats</code></dt>
<dd>Examine the metadata to provide overall statistics about the
archive. This may be a time-consuming operation on large
storages.</dd>

<dt><code>set-block-size! BYTES</code></dt>
<dd>Sets the block size to the given number of bytes. This will affect
new blocks written to the storage, and leave existing blocks
untouched, even if they are larger than the new block size.</dd>

<dt><code>set-max-logfile-size! BYTES</code></dt>
<dd>Sets the size at which a log file is finished and a new one
started (likewise, existing log files will be untouched; this will
only affect new log files)</dd>

<dt><code>set-commit-interval! UPDATES</code></dt>
<dd>Sets the frequency of automatic synching of the storage
state to disk. Lowering this harms performance when writing to the
storage, but decreases the number of in-progress block writes that
can fail in a crash.</dd>

<dt><code>write-protect!</code></dt>
<dd>Disables updating of the storage.</dd>

<dt><code>write-unprotect!</code></dt>
<dd>Re-enables updating of the storage.</dd>

<dt><code>reindex!</code></dt>
<dd>Reindex the storage, rebuilding the block and tag state from the
contents of the log. If the metadata file is damaged or lost,
reindexing can rebuild it (although any configuration changes made
via other admin commands will need manually repeating as they are
not logged).</dd>
</dl>

<h2>Administering <code>sqlite</code> storages</h2>

The sqlite backend has a similar administrative interface to the
splitlog backend, except that it does not have log files, so lacks the
<code>set-max-logfile-size!</code> and <code>reindex!</code> commands.

<h2>Administering <code>cache</code> storages</h2>

The cache backend provides a minimalistic interface:

<dl>

<dt><code>help</code></dt>
<dd>List the available commands.</dd>

<dt><code>info</code></dt>
<dd>List some basic information about the storage.</dd>

<dt><code>stats</code></dt>
<dd>Report on how many entries are in the cache.</dd>

<dt><code>clear!</code></dt>
<dd>Clears the cache, dropping all the entries in it.</dd>

</dl>

Changes to ugarit-api.scm.

400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
                        ('double-check
                         (set! (job-check-correctness? (current-job)) #t))
                        ('store-atime
                         (set! (job-store-atime? (current-job)) #t))
                        ('store-ctime
                         (set! (job-store-ctime? (current-job)) #t))
                        (('storage command-line)
                         (set! *storage* 
                               (with-backend-logging 
                                (import-storage command-line))))
                        (('hash . conf) (set! *hash* conf))
                        (('compression . conf) (set! *compression* conf))
                        (('encryption . conf) (set! *crypto* conf))
                        (('cache path)
                         (set! *cache-path* path))
                        (('file-cache path)







|
|







400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
                        ('double-check
                         (set! (job-check-correctness? (current-job)) #t))
                        ('store-atime
                         (set! (job-store-atime? (current-job)) #t))
                        ('store-ctime
                         (set! (job-store-ctime? (current-job)) #t))
                        (('storage command-line)
                         (set! *storage*
                               (with-backend-logging
                                (import-storage command-line))))
                        (('hash . conf) (set! *hash* conf))
                        (('compression . conf) (set! *compression* conf))
                        (('encryption . conf) (set! *crypto* conf))
                        (('cache path)
                         (set! *cache-path* path))
                        (('file-cache path)

Changes to ugarit.release-info.

1
2
3
4
5
6
7
8
9
10
11

(repo fossil "https://www.kitten-technologies.co.uk/project/{egg-name}")
(uri zip "https://www.kitten-technologies.co.uk/project/{egg-name}/zip/{egg-name}.zip?uuid={egg-release}")
(release "1.0")
(release "1.0.1")
(release "1.0.2")
(release "1.0.3")
(release "1.0.4")
(release "1.0.5")
(release "1.0.6")
(release "1.0.7")
(release "1.0.9")












>
1
2
3
4
5
6
7
8
9
10
11
12
(repo fossil "https://www.kitten-technologies.co.uk/project/{egg-name}")
(uri zip "https://www.kitten-technologies.co.uk/project/{egg-name}/zip/{egg-name}.zip?uuid={egg-release}")
(release "1.0")
(release "1.0.1")
(release "1.0.2")
(release "1.0.3")
(release "1.0.4")
(release "1.0.5")
(release "1.0.6")
(release "1.0.7")
(release "1.0.9")
(release "2.0")

Changes to ugarit.setup.

1
2
3
4
5
6
7
8
9
10
(use posix)

(define *version* "1.0.9")

(define (newer file1 file2)
  (or
   (not (get-environment-variable "UGARIT_FAST_BUILD"))
   (not (file-exists? file2))
   (> (file-modification-time file1)
      (file-modification-time file2))))


|







1
2
3
4
5
6
7
8
9
10
(use posix)

(define *version* "2.0")

(define (newer file1 file2)
  (or
   (not (get-environment-variable "UGARIT_FAST_BUILD"))
   (not (file-exists? file2))
   (> (file-modification-time file1)
      (file-modification-time file2))))