lazarus-ccr/components/csvdocument/doc/todo.txt
vvzh 5bdaf55641 csvdocument: added todo
git-svn-id: https://svn.code.sf.net/p/lazarus-ccr/svn@2523 8e941d3f-bd1b-0410-a28a-d453659cc2b4
2012-09-19 17:13:23 +00:00

95 lines
5.1 KiB
Plaintext

=== TODO ===
* Write more tests for different CSV format variations, especially those used by Excel and Calc.
* Optimize TCSVDocument.LoadFromStream / SaveToStream by changing "Cells[CurrentCol, CurrentRow]" usage
to direct manipulation with underlying data structures. This approach will help to eliminate redundant sanity checks.
=== Warning about speed optimizations ===
A try to speed up buffer operations (FCellBuffer, FWhitespaceBuffer)
by memory preallocation using straightforward String Builder implementation
resulted in about 25% slowdown compared with current implementation based
on string concatenation. This happened on Linux and was not tested on other
platforms. These changes were not commited.
Using TStrBuf object (http://freepascal-bits.blogspot.com/2010/02/simple-string-buffer.html)
for the same purpose showed neither noticable performance improvement nor a slowdown with
the following results on 5,4 MB CSV file:
Without StrBuf: 2392, 2363, 2544, 2441, 2422, 2407, 2467 ms
With StrBuf: 2423, 2437, 2404, 2471, 2405 ms
This happened on Linux too and was not tested on other platforms.
These changes were not commited either.
=== Some thoughts about CSV variations ===
There are two CSV specifications:
* RFC 4180 Common Format and MIME Type for Comma-Separated Values (CSV) Files
http://tools.ietf.org/html/rfc4180
* An unofficial CSV specification
http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm#FileFormat
The latter (unofficial) specification mentiones two CSV format features
that are not part of RFC 4180. The first of them is mentioned as mandatory:
1) Leading and trailing space-characters adjacent to comma field separators are ignored.
Fields with leading or trailing spaces must be delimited with double-quote characters.
The second feature is optional and comprises several variations
(http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm#CSVariations):
2) Embedded line-feeds in fields. This one is also escaped sometimes. Often like in C ("\n").
Embedded commas in fields. Again, an escape character is sometimes used in place of the comma.
Check if line feeds are \n
Check if embedded double quotes are \"
Check if ???
Here are some critics concerning both of these suggested features.
Behavior (1) is explicitely forbidden by RFC 4180: "Spaces are considered part
of a field and should not be ignored". There is a reason for this: when (1) is obeyed,
simple loading and saving CSV document (without any modifications) will result in data loss.
As for variations (2), there are more problems in implementing them than it seems at first glance:
* It should be clearly defined what escaping scheme should be used:
- what characters must be escaped,
- what escaped characters have special meaning (like \r and \n),
- how to include these special characters into text
i.e. how to escape escaping (like \\).
* It should be clearly defined whether/how escaping can be mixed with
traditional quotation scheme and what should take precedence.
Consider the following examples:
"quoted \"" field"
"embedded \, delimiter"
embedded \, delimiter
"embedded \\, delimiter"
\w\w\wescaped non-trimmable whitespace\w\w\w
" quoted non-trimmable whitespace "
Implementing feature (1) on the CSV parser level still has a point.
This feature requires to remove outer whitespace only (a whitespace outside quotes)
and keep inner whitespace (a whitespace inside quotes) intact. However, an application
that uses CSV parser does not have access to quotes and cannot distinguish between
inner and outer whitespace. That is why this feature cannot be implemented by client
application on top of parser, and should therefore be implemented by the parser itself.
However it should be optional and disabled by default to prevent data loss.
As for variations (2), they are too ambiguous to be implemented as is. The ambiguity
can be removed to some degree by the following limitations:
- traditional quoting takes precedence over backslash-escaping;
- backslash-escaping of separators and quotation marks is forbidden to obey RFC 4180.
These limitations allow client applications to implement backslash-escaping themselves
on top of CSV parser, effectively turning backslash-escaping into special field syntax.
Since CSV fields as defined by RFC 4180 can transparently store any sequence of characters,
applications are not limited in defining their own subformats (such as backslash-escaping)
and store them in CSV fields. That is why there is no point in implementing variations (2)
on the parser level, unless they are made more specific and require access to CSV internals
like feature (1) does.
Alternatively, feature (2) can be implemented like in Python csv module, using 3 quotation
modes: "full", "minimal" and "no quotation". "full" and "minimal" modes would instruct
csv writer to quote all fields or just fields containing special characters, while
"no quotation" mode would turn quotation off competely, instructing both reader and
writer to use backslash-escaping instead.
=== Links ===
http://tools.ietf.org/html/rfc4180#section-2
http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm#FileFormat
http://docs.python.org/library/csv.html