Archive:
Subtopics:
Comments disabled |
Thu, 30 Jan 2014
Twingler, a generic data structure munger
(Like everything else in this section, these are notes for a project that was never completed.) IntroductionThese incomplete notes from 1997-2001 are grappling with the problem of transforming data structures in a language like Perl, Python, Java, Javascript, or even Haskell. A typical problem is to take an input of this type:
and to transform it to an output of this type:
One frequently writes code of this sort, and it should be possible to specify the transformation with some sort of high-level declarative syntax that is easier to read and write than the following gibberish:
This is especially horrible in Perl, but it is bad in any language. Here it is in a hypothetical language with a much less crusty syntax:
You still can't see what it really going on without executing the code in your head. It is hard for a beginner to write, and hard to anyone to understand. Original undated notes from around 1997–1998Consider this data structure DS1:
This could be transformed several ways:
Basic idea: Transform original structure of nesting depth N into an N-dimensional table. If Nth nest is a hash, index table ranks by hash keys; if an array, index by numbers. So for example, DS1 becomes
Or maybe hashes should be handled a little differently? The original basic idea was more about DS2 and transformed it into
Maybe the rule is: For hashes, use a boolean table indexed by keys and values; for arrays, use a string table index by integers. Notation idea: Assign names to the dimensions of the table, say X and Y. Then denote transformations by:
The (...) are supposed to incdicate a chaining of elements within the larger structure. But maybe this isn't right. At the bottom: How do we say whether
turns into
or [ X => [Y, Z] ] (accumulation) Consider
Note that:
Brackets and braces just mean brackets and braces. Variables at the same level of nesting imply a loop over the cartesian join. Variables subnested imply a nested loop. So:
But
Hmmm. Maybe there's a better syntax for this. Well, with this plan:
It seems pretty flexible. You could just as easily write
and you'd get
If there's a `count' function, you can get
or maybe we'll just overload Question: How to invert this process? That's important so that you can ask it to convert one data structure to another. Also, then you could write something like
and omit the X's and Y's. Real example: From proddir. Given
For example:
Turn this into
Something interesting happened here. Suppose we have
And we ask for In the example above, why didn't we get
If the outer iteration was supposed to be over all id-name-desc triples? Maybe we need
Then you could say
to indicate that you want to uniq a list. But maybe the old notation already allowed this:
It's still unclear how to write the example above, which has unique key-triples. But it's in a hash, so it gets uniqed on ID anyway; maybe that's all we need. 1999-10-23Rather than defining some bizarre metalanguage to describe the transformation, it might be easier all around if the user just enters a sample input, a sample desired output, and lets the twingler figure out what to do. Certainly the parser and internal representation will be simpler. For example:
should be enough for it to figure out that the code is:
Advantage: After generating the code, it can run it on the sample input to make sure that the output is correct; otherwise it has a bug. Input grammar:
Simple enough. Note that (...) lines are not allowed. They are only useful at the top level. A later version can allow them. It can replace the outer (...) with [...] or {...] as appropirate when it sees the first top-level separator. (If there is a => at the top level, it is a hash, otherwise an array.) Idea for code generation: Generate pseudocode first. Then translate to Perl. Then you can insert a peephole optimizer later. For example
could be optimized to
add into hash: as key, add into value, replace value add into array: at end only How do we analyze something like:
Idea: Analyze structure of input. Analyze structure of output and figure out an expression to deposit each kind of output item. Iterate over input items. Collect all input items into variables. Deposit items into output in appropriate places. For an input array, tag the items with index numbers. See where the indices go in the output. Try to discern a pattern. The above example:
OK—2s are keys, 1s are array elements. A different try fails:
Now consider:
A,C,D get 1; B,E get 2. this works again. 1s are keys, 2s are values. I need a way of describing an element of a nested data structure as a simple descriptor so that I can figure out the mappings between descriptors. For arrays and nested arrays, it's pretty easy: Use the sequence of numeric indices. What about hashes? Just K/V? Or does V need to be qualified with the key perhaps? Example above:
Now try to find a mapping from the top set of labels to the bottom.
Problem with this:
is unresolvable. Still, maybe this works well enough in most common cases. Let's consider:
etc. Conclusion: How to reverse? Simpler reverse example:
Conclusion: What if V items have the associated key too?
Now there's enough information to realize that B and C stay with the A, if we're smart enough to figure out how to use it. 2001-07-28Sent to Nyk Cowham 2001-08-24Sent to Timur Shtatland 2001-10-28Here's a great example. The output from
we want to transform this into
[Other articles in category /notes] permanent link |