Magic Pipes
Documentation
Login

Commands to add/modify

Error handling in mpls

You get this effect:

[alaric@ahusai magic-pipes]$ mpls -R / | mpfilter '(lambda (de) (and (dirent-filename de) (string=? (dirent-filename de) "magic-pipes.scm")))' | mpmap dirent-path

Error: (directory) cannot open directory - Permission denied: "/home/temp"
...

It would be nicer to skip files that produce errors (and report them to stderr).

mpsqlite "create/extend table mode"

A variation on the insert/update/replace modes that creates the table if required, and creates any columns needed for things found in the alist but not in the table. Can be used on a nonexistant table to start it from scratch.

mpsqlite "direct input mode"

When running mpsqlite in output mode, rather than specifying an existing sqlite DB, it would be nice to accept a stream of alists on standard input to build into an in-memory table (schema specified how?) that can then be queried.

Or take a list of filenames (or fds) to read input from, each into its own table.

mpcat -r <read mode> -f <filter> path...

If read mode is raw, or unspecified as that's the default:

Takes a list of filenames on the command line, and calls the user-supplied filter procedure with each file opened for reading as (current-input-port) in turn, with the dirent of the file as the sole argument. The default procedure is (lambda (de) (cons (dirent-path de) (read-string)), making a list of files into an alist.

If read mode is csv, json or xml:

Takes a list of filenames on the command line, and calls the user-supplied filter procedure with two arguments for each file: the dirent of the file, and the result of reading the file as CSV, JSON, or XML.

The default filter is (lambda (de content) (cons (dirent-path de) content)).

mpflatten (FIXME)

Reads s-expressions from standard input, and if they're lists, writes the elements out in turn. Otherwise, just writes the s-expression.

mpsort (FIXME)

mpsort -c -r -p <integer> <expr> [<expr>]

The first expression must produce a two-argument comparison procedure, and defaults to "smart<" if none is present. The second expression must produce a single-argument key extraction procedure, which defaults to the identity.

Reads in all the expressions from the input, sorts them by applying the comparison procedure to the results of applying the extraction procedure to the expressions, then returns the result.

If (-c) is specified, then the extraction procedure is assumed to be expensive, and its result computed and cached at load time.

If (-r) is specified, then the sort order is reversed.

Provide smart< and smart> procedures, which compare things in a type-agnostic way: < for numbers, string< for strings, recursive testing for pairs and vectors.

As usual, the procedures have no access to current input or output ports, but can write to the error port.

If (-p) is specified, then rather than sorting in-memory, we instead start the specified number of threads, each of which reads sexpressions from a bounded FIFO and sends them to a child mpsort process. A master thread then reads sexpressions from standard input and round-robins them to the FIFOs, skipping any FIFOs that are "full" and blocking if they all are. Each child process also has a reader thread that reads its sorted output and loads them into another FIFO, and a final output thread merges the sorted FIFO outputs into a final sorted output to standard output. #!eof is used as a marker in the FIFOs to record the actual end of the file, to distinguish EOF from an empty FIFO due to the source not having produced anything yet.

Or do we make a separate mpmerge tool that takes a list of filenames on the command line along with extract and compare procedure expressions, and invoke that using a set of FIFOs which the sub-mpsorts feed out to?

Is it worth having an option to go multi-machine by running mpsort from inetd (perhaps in parallel mode to use multiple cores) on remote machines and parallelising via TCP rather than running a child process? That would be kind of cool and not too hard.

Or for huge sorts (where there's not enough memory available), we could have a flag that splits the input into temporary files of up to a certain size, sorts them individually one by one, then merges the results together.

mpgroup (FIXME)

mpgroup -a -t -l <expr>

The expression must be a single-argument procedure. It is applied to each input s-expression to obtain a "key" for each input s-expression.

As usual, the procedure has no access to current input or output ports, but can write to the error port.

If (-a) is specified, then the s-expressions are accumulated in memory by their keys, into a hashtable. If (-f) is specified, the only the first s-expression for each key is kept; if (-l) is specified, the only the last is kept. At the end, the hash table is written out; if (-t) is specified, it is written as one list per key, the first element being the key value and the rest being the s-expressions with that key. If (-t) is not specified, then it is just one list per key, but without the key as the first element. The order of the keys listed in undefined, but if neither (-f) nor (-l) are specified, the s-expressions within a key are in the order they were read.

If (-a) is not specified, then the s-expressions are not accumulated and spat out in a single batch; instead, they are output in the same order that they were read in, but grouped into lists of s-expressions having the same key in a contiguous run. If (-t) is specified, the key value is prepended to the list. If (-f) is specified, then only the first s-expression in each run of the same key value is listed (and if (-t) is not specified, then it is output as-is rather than as a single-element list). Likewise, if (-l) is specified, the only the last s-expression in each run with of the same key value is listed, and unless (-t) is specified, it's written as-is without a single-element list enclosing it.

mpmerge, mpjoin, mpcogroup?

Do we need these more advanced operators from the database world, or can they be done in other ways?

mpmerge would need to accept a list of file names and read from them all (possibly including standard input as well), comparing already-sorted input elements using a supplied comparison expression similar to mpsort, and output the results in merged order.

mpcogroup would also accept a list of input file names (possibly including standard input) and, for each, an expression mapping an s-expression to a join key value. For each distinct join key value in the entire input, it would output a list starting with the join key value, followed by a (possibly empty) list of matching s-expressions from each input file in order.

mpjoin would work much like mpcogroup, except that the output would consist of the cross product of each group. Each s-expression in the output would be a list with the join key value followed by one element per input file, containing an s-expression from each file that produced the same join expression.

mpcogroup/mpjoin might build up a hash table internally, then if it reaches a certain limiting size, write it to a temporary sqlite file and then continue writing into that until it's time to generate output.

mptree (FIXME)

mptree <id-expr> <parent-expr> <children-expr> <output-expr>

Reads input s-expressions and organises them into a tree. For each s-expression, the single-argument procedures that the first three expressions evaluated to are called, yielding an identifier for the s-expression, an identifier for its parent (or #f if it cannot be obtained), and a list of identifiers of its children (or '() or #f if they cannot be obtained). Using what information becomes available, parent/child relationships are found between the s-expressions, forming one or more trees. If conflicts arise (multiple parents for the same s-expression), an error is signalled and processing stops. If no errors occur, then a set (hopefully singleton) of roots (s-expressions with no parents) is found, each at the head of a nice tree.

If the output expression is supplied, then it is applied to each tree in turn (in some arbitrary order). The trees are represented by "node" record instances, which have the following accessors:

(These come from a magic-pipes-runtime-tree module which is automatically loaded.)

If no output expression is supplied, a default one is used which renders the nodes as s-expressions with the node data as the first element and the children thereafter, indented neatly to show the structure.

mprandom (FIXME)

Take random samples of the input - either pick any s-expression with a
given chance, or read all the s-expressions into RAM and pick N at
random

mpshuffle (FIXME)

Read input s-expressions into a list, shuffle, and output the result.

mphead (FIXME)

mptail (FIXME)

mpsxpath (FIXME)

mpps (FIXME)

mplookup-set (FIXME)

mplookup-set <map file> <expr1> <expr2>

In the given map file (which, if nonexistant, is created), set expr1 to map to expr2.

If the exprs are omitted, then sexprs are read from standard input, and must be pairs, the first element of which is treated as expr1 and the second as expr2, and are all set into the map in order.

mplookup-delete (FIXME)

mplookup-delete <map file> <expr>

Deletes the given mapping from the given map file. If the expression is omitted, then expressions are read from stdin and removed from the map file. If the map file does not exist, an error is raised.

mplookup-dump (FIXME)

mplookup-dump <map file>

Spits out the contents of the map file as a sequence of pairs, with the car being the key and the cdr the value. This can be piped into mplookup-set to effect map file format conversions.

mpfork (FIXME)

mpfork -x <integer>...

Runs the given list of shell commands in parallel, distributing input s-expressions to them atomically, and atomically merging their output s-expressions to standard output. If any commands terminate before their input is closed, mpfork terminates with an error.

(-x) specifies a multiplier factor; subsequent commands are "repeated" that many times. (-x) defaults to 1, in practice.

Implementation: a pair of threads is spawned for each command, one for input and one for output (standard error is left untouched). Each thread has a single-sexpr buffer.

A master input thread reads s-exprs from standard input and places them in the first empty input buffer in the list of command input threads, in round-robin fashion, blocking if none are available.

A master output thread blocks until at least one output buffer is full, then scans in round-robin fashion to find and empty it to standard output.

Once input is closed, all the subprocess standard inputs are closed; and once all the subprocesses have terminated, mpfork terminates.

Runtime library

More dirent utilities

A procedure to canonicalise the pathname of a dirent.

Useful UNIX information procedures in runtime library

uid->username (see posix unit) username->uid gid->groupname groupname->gid ip->hostnames (see hostinfo egg) hostname->ips get-environment-variable (alised to $)

Infrastructure

Safe reader

Currently, feeding sexprs from untrusted sources into magic pipes scripts runs the risk of people using unsafe Chicken reader features to execute arbitrary code. I should find a way to have a safe reader.

Test suite

It could be a shell script that feeds expected inputs and and compares with expected outputs.

for script in "tests/*.sh"
do
  input="`echo $script | sed s/sh$/in/`"
  output="`echo $script | sed s/sh$/out/`"
  expected="`echo $script | sed s/sh$/expected/`"
  cat "$input" | "$script" > "$output"
  if diff "$output" "$expected"
     echo "TEST $script FAILED"
     exit 1
  fi
done