Archive:
In this section:
Subtopics:
Comments disabled |
Thu, 30 Jan 2014
Twingler, a generic data structure munger
(Like everything else in this section, these are notes for a project that was never completed.) IntroductionThese incomplete notes from 1997-2001 are grappling with the problem of transforming data structures in a language like Perl, Python, Java, Javascript, or even Haskell. A typical problem is to take an input of this type:
and to transform it to an output of this type:
One frequently writes code of this sort, and it should be possible to specify the transformation with some sort of high-level declarative syntax that is easier to read and write than the following gibberish:
This is especially horrible in Perl, but it is bad in any language. Here it is in a hypothetical language with a much less crusty syntax:
You still can't see what it really going on without executing the code in your head. It is hard for a beginner to write, and hard to anyone to understand. Original undated notes from around 1997–1998Consider this data structure DS1:
This could be transformed several ways:
Basic idea: Transform original structure of nesting depth N into an N-dimensional table. If Nth nest is a hash, index table ranks by hash keys; if an array, index by numbers. So for example, DS1 becomes
Or maybe hashes should be handled a little differently? The original basic idea was more about DS2 and transformed it into
Maybe the rule is: For hashes, use a boolean table indexed by keys and values; for arrays, use a string table index by integers. Notation idea: Assign names to the dimensions of the table, say X and Y. Then denote transformations by:
The (...) are supposed to incdicate a chaining of elements within the larger structure. But maybe this isn't right. At the bottom: How do we say whether
turns into
or [ X => [Y, Z] ] (accumulation) Consider
Note that:
Brackets and braces just mean brackets and braces. Variables at the same level of nesting imply a loop over the cartesian join. Variables subnested imply a nested loop. So:
But
Hmmm. Maybe there's a better syntax for this. Well, with this plan:
It seems pretty flexible. You could just as easily write
and you'd get
If there's a `count' function, you can get
or maybe we'll just overload Question: How to invert this process? That's important so that you can ask it to convert one data structure to another. Also, then you could write something like
and omit the X's and Y's. Real example: From proddir. Given
For example:
Turn this into
Something interesting happened here. Suppose we have
And we ask for In the example above, why didn't we get
If the outer iteration was supposed to be over all id-name-desc triples? Maybe we need
Then you could say
to indicate that you want to uniq a list. But maybe the old notation already allowed this:
It's still unclear how to write the example above, which has unique key-triples. But it's in a hash, so it gets uniqed on ID anyway; maybe that's all we need. 1999-10-23Rather than defining some bizarre metalanguage to describe the transformation, it might be easier all around if the user just enters a sample input, a sample desired output, and lets the twingler figure out what to do. Certainly the parser and internal representation will be simpler. For example:
should be enough for it to figure out that the code is:
Advantage: After generating the code, it can run it on the sample input to make sure that the output is correct; otherwise it has a bug. Input grammar:
Simple enough. Note that (...) lines are not allowed. They are only useful at the top level. A later version can allow them. It can replace the outer (...) with [...] or {...] as appropirate when it sees the first top-level separator. (If there is a => at the top level, it is a hash, otherwise an array.) Idea for code generation: Generate pseudocode first. Then translate to Perl. Then you can insert a peephole optimizer later. For example
could be optimized to
add into hash: as key, add into value, replace value add into array: at end only How do we analyze something like:
Idea: Analyze structure of input. Analyze structure of output and figure out an expression to deposit each kind of output item. Iterate over input items. Collect all input items into variables. Deposit items into output in appropriate places. For an input array, tag the items with index numbers. See where the indices go in the output. Try to discern a pattern. The above example:
OK—2s are keys, 1s are array elements. A different try fails:
Now consider:
A,C,D get 1; B,E get 2. this works again. 1s are keys, 2s are values. I need a way of describing an element of a nested data structure as a simple descriptor so that I can figure out the mappings between descriptors. For arrays and nested arrays, it's pretty easy: Use the sequence of numeric indices. What about hashes? Just K/V? Or does V need to be qualified with the key perhaps? Example above:
Now try to find a mapping from the top set of labels to the bottom.
Problem with this:
is unresolvable. Still, maybe this works well enough in most common cases. Let's consider:
etc. Conclusion: How to reverse? Simpler reverse example:
Conclusion: What if V items have the associated key too?
Now there's enough information to realize that B and C stay with the A, if we're smart enough to figure out how to use it. 2001-07-28Sent to Nyk Cowham 2001-08-24Sent to Timur Shtatland 2001-10-28Here's a great example. The output from
we want to transform this into
[Other articles in category /notes] permanent link Sun, 19 Jan 2014
Notes on a system for abbreviating SQL queries
(This post inaugurates a new section on my blog, for incomplete notes. It often happens that I have some idea, usually for software, and I write up a bunch of miscellaneous notes about it, and then never work on it. I'll use this section to post some of those notes, mainly just because I think they might be interesting, but also in the faint hope that someone might get interested and do something with it.) Why are simple SQL queries so verbose?For example:
(This is 208 characters.) I guess about two-thirds of this is unavoidable, but those join-using clauses ought to be omittable, or inferrable, or abbreviatable, or something.
(Only 94 characters.)
Then the question arises of how to join the batches to the clients. This is the only really interesting part of this project, and the basic rule is that it shouldn't do anything really clever. There is a graph, which the program can figure out from looking at the foreign key constraints. And the graph should clearly have a short path from batches through products to clients.
If something is truly ambiguous, we can issue an intelligent request for clarification:
Overview
Can 1 and 2 really be separated? They can in the example above, but maybe not in general. I think separating 3 and putting it at the end is a good idea: don't
try to use field name abbreviations to disambiguate and debreviate
table names. Only go the other way. But this means that we can't
debreviate What if something like About abbreviationsAbbreviations for
There is a tradeoff here: the more different kinds of abbreviations you accept, the more likely there are to be ambiguities. About table inferenceThere could also be a preferences file that lists precedences for
tables and fields: if it lists About join inferenceShort join paths are preferred to long join paths. If it takes a long time to generate the join graph, cache it. Build it automatically on the first run, and then rebuild it on request later on. More examples(this section blank) Implementation notesMaybe convert the input to a Note that this requires that the input be valid SQL. Your original idea for the abbreviated SQL began with
rather than
but the original version would probably be ruled out by this implementation. In this case that is not a big deal, but this choice of implementation might rule out more desirable abbreviations in the future. Correcting dumb mistakes in the SQL language design might be in Quirky's purview. For example, suppose you do
Application notesRJBS said he would be reluctant to use the abbreviated version of a query in a program. I agree: it would be very foolish to do so, because adding a table or a field might change the meaning of an abbreviated SQL query that was written into a program ten years ago and has worked ever since. This project was never intended to abbreviate queries in program source code. Quirky is mainly intended for one-off queries. I picture it going into an improved replacement for the MySQL command-line client. It might also find use in throwaway programs. I also picture a command-line utility that reads your abbreviated query and prints the debreviated version for inserting into your program. Miscellaneous notes(In the original document this section was blank. I have added here some notes I made in pen on a printout of the foregoing, on an unknown date.) Maybe also abbreviate Since debreviation is easier [than join inference] do it first! Another idea: " [Other articles in category /notes] permanent link Fri, 10 Jan 2014
DateTime::Moonpig, a saner interface to DateTime
(This article was previously published at the Perl Advent Calendar on 2013-12-23.) The DateTime suite is an impressive tour de force, but I hate its interface. The methods it provides are usually not the ones you want, and the things it makes easy are often things that are not useful. MutatorsThe most obvious example is that it has too many mutators. I believe
that date-time values are a kind of number, and should be treated like
numbers. In particular they should be immutable. Rik Signes has
a hair-raising story
about an accidental mutation that caused a hard to diagnose bug,
because the DateTime::DurationBut the most severe example, the one that drives me into a rage, is
that the For example, suppose you would like to know how much time elapses between 1969-04-02 02:38:17 EST and 2013-12-25 21:00:00 EST. You can set up the two DateTime objects for the time, and subtract them using the overloaded minus operator:
Internally this invokes You might want to know how long this interval is, so you can compare
it to similar intervals. So you might want to know how many seconds
this is. It happens that the two times are exactly 1,411,669,328
seconds apart, but there's no way to get the It seems like there are methods that will get you the actual
elapsed time in seconds, but none of them will do it. For example,
And indeed, no method can tell you how long the duration really is, because the subtraction has thrown away all the information about how long the days and months and years were—days, months and years vary in length—so it simply doesn't know how much time this object actually represents. Similarly if you want to know how many days there are between the two dates, the DateTime::Duration object won't tell you because it can't tell you. If you had the elapsed seconds difference, you could convert it to the correct number of days simply by dividing by 86400 and rounding off. This works because, even though days vary in length, they don't vary by much, and the variations cancel out over the course of a year. If you do this you find that the elapsed number of days is approximately 16338.7653, which rounds off to 16338 or 16339 depending on how you want to treat the 18-hour time-of-day difference. This result is not quite exact, but the error is on the order of 0.000002%. So the elapsed seconds are useful, and you can compute other useful values with them, and get useful answers. In contrast, DateTime::Duration's answer of "536 months and 23 days" is completely useless because months vary in length by nearly 10% and DateTime has thrown away the information about how long the months were. The best you can do to guess the number of days from this is to multiply the 536 months by 30.4375, which is the average number of days in a month, and add 23. This is clumsy, and gets you 16337.5 days—which is close, but wrong. To get what I consider a useful answer out of the DateTime objects you must not use the overloaded subtraction operator; instead you must do this:
What's DateTime::Moonpig for?
DateTime::Moonpig introduces five main changes to the interface of DateTime:
Under the covers, it is all just DateTime objects, which seem to do
what one needs. Other than the mutators, all the many DateTime
methods work just the same; you are even free to use
When I first started this module, I thought it was likely to be a failed experiment. I expected that the Moonpig::DateTime objects would break once in a while, or that some operation on them would return a DateTime instead of a Moonpig::DateTime, which would cause some later method call to fail. But to my surprise, it worked well. It has been in regular use in Moonpig for several years. I recently split it out of Moonpig, and released it to CPAN. I will be interested to find out if it works well in other contexts. I am worried that disabling the mutators has left a gap in functionality that needs to be filled by something else. I will be interested to hear reports from people who try. [Other articles in category /prog/perl] permanent link Sat, 04 Jan 2014There is a famous mistake of Augustin-Louis Cauchy, in which he is supposed to have "proved" a theorem that is false. I have seen this cited many times, often in very serious scholarly literature, and as often as not Cauchy's purported error is completely misunderstood, and replaced with a different and completely dumbass mistake that nobody could have made. The claim is often made that Cauchy's Course d'analyse of 1821 contains a "proof" of the following statement: a convergent sequence of continuous functions has a continuous limit. For example, the Wikipedia article on "uniform convergence" claims:
The nLab article on "Cauchy sum theorem" states:
Cauchy never claimed to have proved any such thing, and it beggars belief that Cauchy could have made such a claim, because the counterexamples are so many and so easily located. For example, the sequence !! f_n(x) = x^n!! on the interval !![-1,1]!! is a sequence of continuous functions that converges everywhere on !![0,1]!! to a discontinuous limit. You would have to be a mathematical ignoramus to miss this, and Cauchy wasn't. Another simple example, one that converges everywhere in !!\mathbb R!!, is any sequence of functions !!f_n!! that are everywhere zero, except that each has a (continuous) bump of height 1 between !!-\frac1n!! and !!\frac1n!!. As !!n\to\infty!!, the width of the bump narrows to zero, and the limit function !!f_\infty!! is everywhere zero except that !!f_\infty(0)=1!!. Anyone can think of this, and certainly Cauchy could have. A concrete example of this type is $$f_n(x) = e^{-x^{2}/n}$$ which converges to 0 everywhere except at !! x=0 !!, where it converges to 1. Cauchy's controversial theorem is not what Wikipedia or nLab claim. It is that that the pointwise limit of a convergent series of continuous functions is always continuous. Cauchy is not claiming that $$f_\infty(x) = \lim_{i\to\infty} f_i(x)$$ must be continuous if the limit exists and the !!f_i!! are continuous. Rather, he claims that $$S(x) = \sum_{i=1}^\infty f_i(x)$$ must be continuous if the sum converges and the !!f_i!! are continuous. This is a completely different claim. It premise, that the sum converges, is much stronger, and so the claim itself is much weaker, and so much more plausible. Here the counterexamples are not completely trivial. Probably the best-known counterexample is that a square wave (which has a jump discontinuity where the square part begins and ends) can be represented as a Fourier series. (Cauchy was aware of this too, but it was new mathematics in 1821. Lakatos and others have argued that the theorem, understood in the way that continuity was understood in 1821, is not actually erroneous, but that the idea of continuity has changed since then. One piece of evidence strongly pointing to this conclusion is that nobody complained about Cauchy's controversial theorem until 1847. But had Cauchy somehow, against all probability, mistakenly claimed that a sequence of continuous functions converges to a continuous limit, you can be sure that it would not have taken the rest of the mathematical world 26 years to think of the counterexample of !!x^n!!.) The confusion about Cauchy's controversial theorem arises from a perennially confusing piece of mathematical terminology: a convergent sequence is not at all the same as a convergent series. Cauchy claimed that a convergent series of continuous functions has a continuous limit. He did not ever claim that a convergent sequence of continuous functions had a continuous limit. But I have often encountered claims that he did that, even though such such claims are extremely implausible. The claim that Cauchy thought a sequence of continuous functions converges to a continuous limit is not only false but is manifestly so. Anyone making it has at best made a silly and careless error, and perhaps doesn't really understand what they are talking about, or hasn't thought about it. [ I had originally planned to write about this controversial theorem in my series of articles about major screwups in mathematics, but the longer and more closely I looked at it the less clear it was that Cauchy had actually made a mistake. ] [Other articles in category /math] permanent link |