A myriad of markup systems
It’s hard to avoid the legions of custom markup systems out there these days. Every Wiki has it’s own syntactical quirks, while packages like Markdown, Textile, BBCode (in dozens of variants), reStructuredText offer easy ways of hooking markup conversion in to existing applications. When it comes to being totally over-implemented and infuratingly inconsistent, markup systems are rapidly catching up with template packages. Never one to miss out on an opportunity to reinvent the wheel, I’ve worked on several of each ;)
My most recent markup handling attempt has just been published as part of my SitePoint article on Bookmarklets (cliché). It’s a structured markup language in a bookmarklet: activate the bookmarklet to convert the text in any textarea on a page to XHTML. The syntax is ridiculously simple, and serves my limited needs just fine:
= This is a header Here is a paragraph. * This is a list of items * Another item in the list
Converts to:
<h4>This is a header</h4>
<p>Here is a paragraph.</p>
<ul>
<li>This is a list of items</li>
<li>Another item in the list</li>
</ul>
The algorithm is simple, and easily portable to any language you care to mention:
- Normalise newlines to \n, for cross-platform consistency.
- Split the text up on double newlines, to create a list of blocks.
- For each block:
- If it starts with an equals sign, wrap it in header tags.
- If it starts with an asterisk, split it in to lines, make each a list item (stripping off the asterisk at the start of the line if required) and glue them all together inside a
<ul>. - Otherwise, wrap it in a
<p>tag provided it doesn’t have one already.
- Glue everything back together again with a couple of newlines, to make the underlying XHTML look pretty.
The bookmarklet comes in two flavours: Expand HTML Shorthand (the full version) and Expand HTML Shorthand IE, which loses header support in order to fit within IE’s rippling 508 character limit. A more capable bookmarklet could be built using the import-script-stub method described in my article, but the implementation of such a thing is left as an exercise for the reader (I’ve always wanted to say that).
Incidentally, there’s a very common bug in markup systems that allow inline styles that proves extremely difficult to fix: that of improperly nested tags. Say you have a system where *text* is bold and _text_ is italic; what happens when the user enters _italic*italic-bold_bold*? Most systems (and that includes Markdown, Textile and my home-rolled Python solution) use naive regular expressions for inline markup processing and will output vadly formed XHTML: <em>italic<strong>italic-bold</em>bold</strong>. To truly solve this problem requires a context-sensitive parser, which involves an unpleasantly large amount of effort to solve what looks like a simple bug.
Yeah, it's fairly simple, but I've been coding html since it was in version 2 and xhtml since it was only a recommendation. Along with that, I also program in several programming languages. The last thing I need is another markup language to try and remember. I am just as comfortable typing in xhtml when I make an entry in my blog and pretty much avoid putting formating code in other's blogs so as not to potentially trash their layout.
I think the concept is cool, but why go the extra step, unless of course you don't have html/xhtml memorized, then it makes more sense.
Dave Giffin - 13th April 2004 05:31 - #
I use it purely as shorthand; it's faster to type a couple of newlines than it is to add four paragraph tags (a pair around each paragraph). The same is especially true of list syntax.
Simon Willison - 13th April 2004 06:04 - #
Keith - 13th April 2004 06:49 - #
Grant - 13th April 2004 07:20 - #
Chris Vincent - 13th April 2004 08:07 - #
Just remember when creating a custom markup language... What everyone really wants - I arrogantly propose - is a program that intelligently converts keystrokes into markup in a WYSI(Almost)WYG environment. (i.e. "*" starting a li elements and possibly a ul element, "enter" ending it and possibly the ul element too, "enter" adding p elements without going crazy like MS Word, etcetera...) These new markup languages - which are for the purpose of being translated into (X)HTML - ended up being used as a cheap replacement for the keystrokes and the WYSIWYG style without the actual program. (Or, it means more work by the writer and less by the programmer.)
I think someone needs (silly wording alert :-p) to program a program with programmable keys. Like a real-time text-to-screen/markup XSLT style sheet processor. These other markup language approaches seem cheap (and ego-stroking) yet clunky too.
Jimmy Cerra - 13th April 2004 08:35 - #
Sorry for the double post...
No, it doesn't seem hard. Simply make your parser find the beginning and end of a tag before recursively calling itself. IE:
Isn't that simple? Perhaps you should be using text-to-xml XSLT style sheets instead of clunky RegExp (like Norman Walsh - at least I think that is what he does).
Why not, instead of using a RegExp,Jimmy Cerra - 13th April 2004 09:07 - #
kevin - 13th April 2004 09:15 - #
Simon Willison - 13th April 2004 09:45 - #
You can decide whether this is by design or chance. ;)
Phil Wilson - 13th April 2004 10:42 - #
I've built my own text markup system and what I've done to solve improperly nested markup is to implement a recursive algorithm, using regular expressions, in a simple loop (with a FIFO buffer).
I then realized that I can use the same technique to see if a piece of xml is well-formed [1] using just regular expressions. The code only occupies 160 lines of readable Perl (comments and all).
[1] Well-formed in the sense that all tags are properly closed and nested properly, all attributes values are quoted, and no duplicate and minimized attributes. I've ignored the other criterias for well-formedness like those pertaining to XML entities.
Eugene - 13th April 2004 13:44 - #
Pat - 13th April 2004 14:50 - #
Eugene - 13th April 2004 15:40 - #
As I am now using Markdown on my own blog, I was surprised to read that Markdown failed your test. You may be using and older beta, although I thought that Markdown correctly handled improper nesting since the first public release.
I tested it against the current release (1.0 beta 4) via the Dingus (a web-based Markdown tester). It correctly converted:
into:
Although I can write XHTML fluently, I find authoring blog entries in XHTML adds to much friction to the process. Instead of typing what I'm thinking, I end up thinking about what I'm typing. Unless I'm blogging about XHTML, I don't want to think about closing tags, etc, while writing. To me, writing prose (blog entries) is a much different exercise than coding, and writing XHTML is coding. I wouldn't write my entire blog structure (headers, sidebars, footers, etc) in a language like Markdown, but it works very well for composing posts.
Prior to Markdown, I used Textile on my blog. While at first it seemed to reduce composition friction, I found after a couple of months that it really was just a different type of friction. There were too many things that didn't work quite the way I expected, or that I needed to test. I often had problems with blockquotes and code blocks, for example.
In this regard, Markdown performs much better for me. By adopting the idioms of plain-text email, the syntax is nearly second nature. More intuitive handling of block-level elements such as blockquotes and code blocks means that I spend less time thinking about how to express layout and semantics, and more time expressing my thoughts. The friction has dropped considerably, although you'd never know from my infequent posting schedule of late. Develop a markup system that adds a few more hours to the day, and you'll get my business. ;)
Jason Clark - 13th April 2004 15:41 - #
I think in these days of easy in-browser WYSIWYG editors that you have to think carefully about why or if you need a custom markup language.
I chose Restructured Text for the wiki I'm working on because it has some real advantages when writing about code -- especially since it is usable in non-wiki environments as well (e.g., in Python docstrings, extracted with epydoc), and as a text-based markup language it integrates with my text-based programs and text-based programming environments (Emacs).
But if there weren't these specific reasons for using ReST, I'm pretty sure I'd stick to a WYSIWYG and make (X)HTML my canonical form. (And I'm still planning on adding that as an option)
Ian Bicking - 13th April 2004 17:05 - #
mike - 13th April 2004 19:33 - #
Jimmy's approach is basically a recursive descent parser that does some default behavior rather than throwing an error upon exceptions.
Probably just the thing for this problem.
With all this talk of RegEx HTML parsing, I was wondering where I might find a description of the proper phases of HTML parsing?
I suppose it can't be strictly phased, else entity substitution, such as "<" could change the meaning of a document. Or entity substitution could come last?
I know I could just go look at some code. I guess I'm looking more for editorial on why a "proper" parser might be implemented in the way it is.
Anyway, I imagine there are lots of RegEx hacks running around with various partial implementations of HTML parsing, but imagining a general-purpose HTML parser made this way smacks of "two problems".
Jeremy Dunck - 13th April 2004 20:52 - #
Jeremy, I take it you have never heard of the desperate perl hacker? :-)
Jimmy Cerra - 14th April 2004 00:52 - #
asdasf - 14th April 2004 05:05 - #
Danny - 14th April 2004 11:40 - #
I don't like these markup substitutes. I already know XHTML. XHTML works. XHTML is regular. Lots of other people know XHTML. Validators exist for XHTML. You can throw together a subset of XHTML suitable for comments easily with XHTML 1.1 modules. Lots of software already exists for XHTML. Lots of other specifications like XPath and XSLT can be used with XHTML.
It's going to take more than shaving off a few milliseconds by using newlines instead of paragraph tags to make me want to throw all that away.
Jim Dabell - 14th April 2004 13:16 - #
There is clearly something I don't get about Perl. Lots of smart people write tons of useful stuff using it-- and I can't stand doing something trivial with it.
Start permathread now: If the power of CPAN were available for Python or Ruby, would Perl stand a chance?
If Python or Ruby provided seamless interop with Perl?
Jeremy Dunck - 14th April 2004 16:23 - #
Yes Perl would still stand a chance. Programming languages "die" because of one of two reasons. 1) Everyone switches to another language because they prefer it. or 2) The language becomes irrelevent. For example, if C was written to only run on PDP-11s it wouldn't still be around.
They don't die off because some particular feature is added to another language.
CPAN is one of the biggest reasons people love Perl, but it is not the only reason. Just like you cannot stand to code in Perl, I for example can't stand coding in Python. It, like most things, is a matter of personal choice and has very little to do with the underlying tech, feature set, performance, etc, etc. Actually it tends to have more to do with marketing and hype than anything else. A la Java.
One thing to think about is the "power" of CPAN is available to Python. They already have a start with the Vaults of Parnassus. I think the biggest Python problem is its naming conventions. For example, if you create a module in Perl that allows you to treat ascii files as databases it would be called something along the lines of Database::Ascii or Database::AsciiFile. On the Vault site they have one, but it's named FLAD. FLAD tells you nothing about the software or what it does, Database::Ascii at least hints at it. This is a one of the biggest problems I see for Python going forward is people trying to remember, let alone talk about amongst each other, the different third party add ons. Cute names are not only annoying, but they make the underlying product seem inferior from the start.
Which sounds more professional to you "I'm going to make some graphs with GNU Plot." or "I'm going to make some graphs with Biggles."? :)
Frank Wiles - 14th April 2004 22:56 - #
Of course I'm talking from a position of ignorance about Perl, but I agree that CPAN's module naming is darned useful. Last night, I searched CPAN for modules named "test", and came up with several that would suit my need.
As far as sounding professional, I'm an IT drone, and GNU-anything is likely to freak people out. Pointy-hair would say "not VB or Java?" while twitching.
Jeremy Dunck - 14th April 2004 23:57 - #
Frank Wiles - 15th April 2004 15:54 - #
You know, I don't really see any problems with lots of text to HTML formats popping up. As long as they're properly specced, and all lead to valid HTML, itinterop shouldn't be hard at all, and neither should be parsing one, once you've already got a parser for another one.
It could just be me, but none of the existing text to HTML formats conceptually work for me, yet, so I'm quite happy to see more flourish.
Lach - 16th April 2004 01:15 - #
I use bbcode. works out pretty well for me especially with speedy regex. I prefered not to keep my XHTML stored in the database but instead store the bbcode then and I could output post to multiple formats without having it all in xhtml format
And on that invalid bbcode issue with the [b][i]text[/b][/i] I simply igore any tags that get parse late (I think it's greedy). Yes it's lazy and reduces functionality. but I offer other tags such as [bu]text[/bu] and [bi]text[/bi] for people who just can't live without it.
The in-browser WYSIWYG editors he suggested only works in IE browsers? Ian (I feel like I seen this happen before...dejavu)
owen - 16th April 2004 01:18 - #