Curl Global Community
Serialized Scurl data: an Eng-Nihongo dictionary - Printable Version

+- Curl Global Community (https://communities.curl.com)
+-- Forum: Blogs (https://communities.curl.com/forumdisplay.php?fid=17)
+--- Forum: Tech blog (https://communities.curl.com/forumdisplay.php?fid=18)
+---- Forum: Robert blog (https://communities.curl.com/forumdisplay.php?fid=20)
+---- Thread: Serialized Scurl data: an Eng-Nihongo dictionary (/showthread.php?tid=589)



Serialized Scurl data: an Eng-Nihongo dictionary - RobertShiplett - 08-09-2012

Over at my kanji.aule-browser.com the Japanese-English dataset of Edict2 is nearing completion in Curl (it is complete in plain old HTML.) At this point only 6 files remain to process: one for hiragana dictionary entries and 3 for katakana + 2 other.

The first issue in using Curl instead of Json was the compiler stopping after about 15000 lines of code.

This was addressed by serializing batches of some 14,000 records each as arrays of dictionary entry objects (actually in three layers from very light (exclusion, try next set), lighter (establish inclusion, get JMDict ID) and full data (record accessed directly by id using lighter record.)

A few other minor optimizations were used to keep the Curl objects small and light. In addition, the Edict format permits multiple fields in a single entry - a challenge we have now addressed (just try out the page.)

You can see single-kanji and compound-kanji versions in action at Edict2 Kanji in Curl.

If you select compound kanji from the HTML pages below, you can observe the selective file loading of the Curl application driving the finder at the top of the page. By the end of day tomorrow, the applet will choose which of 15 files to search (occasionally exclusion is imperfect and 2 or more "lighter" files must be loaded and searched before a perfect match is found.

Because edict2 is subject to updates with new words, expressions, acronyms and corrections and as Unicode continues to evolve, the use of no-index in-memory DB is very attractive. There are no tables to maintain. XML is converted to Curl, serialized and away we go again (after some tests pass, of course!)

The next project is to use JMDict and produce versions in French, German and other languages in which I am adequately conversant.

The next step in this project is to use Curl CSPD for mnemonics. [More to follow, soon]