Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Serialized Scurl data: an Eng-Nihongo dictionary
08-09-2012, 10:56 AM,
Serialized Scurl data: an Eng-Nihongo dictionary
Over at my the Japanese-English dataset of Edict2 is nearing completion in Curl (it is complete in plain old HTML.) At this point only 6 files remain to process: one for hiragana dictionary entries and 3 for katakana + 2 other.

The first issue in using Curl instead of Json was the compiler stopping after about 15000 lines of code.

This was addressed by serializing batches of some 14,000 records each as arrays of dictionary entry objects (actually in three layers from very light (exclusion, try next set), lighter (establish inclusion, get JMDict ID) and full data (record accessed directly by id using lighter record.)

A few other minor optimizations were used to keep the Curl objects small and light. In addition, the Edict format permits multiple fields in a single entry - a challenge we have now addressed (just try out the page.)

You can see single-kanji and compound-kanji versions in action at Edict2 Kanji in Curl.

If you select compound kanji from the HTML pages below, you can observe the selective file loading of the Curl application driving the finder at the top of the page. By the end of day tomorrow, the applet will choose which of 15 files to search (occasionally exclusion is imperfect and 2 or more "lighter" files must be loaded and searched before a perfect match is found.

Because edict2 is subject to updates with new words, expressions, acronyms and corrections and as Unicode continues to evolve, the use of no-index in-memory DB is very attractive. There are no tables to maintain. XML is converted to Curl, serialized and away we go again (after some tests pass, of course!)

The next project is to use JMDict and produce versions in French, German and other languages in which I am adequately conversant.

The next step in this project is to use Curl CSPD for mnemonics. [More to follow, soon]

Robert Shiplett, Curlr
Fredericton NB


Possibly Related Threads...
Thread Author Replies Views Last Post
  Curl alpha test for Japanese-English dictionary RobertShiplett 0 3,155 08-12-2012, 09:10 AM
Last Post: RobertShiplett
  no-JSON : from Curl data to serialized object RobertShiplett 0 4,494 06-01-2012, 09:02 PM
Last Post: RobertShiplett
  Curl as Kanji dictionary parser RobertShiplett 0 3,065 04-26-2012, 01:21 AM
Last Post: RobertShiplett
Forum Jump:

Users browsing this thread:
1 Guest(s)

MyBB SQL Error

MyBB has experienced an internal SQL error and cannot continue.

SQL Error:
1017 - Can't find file: 'mybb_threadviews' (errno: 2)
INSERT INTO mybb_threadviews (tid) VALUES('589')