Auto-tools/Projects/PublicES: Difference between revisions

Jump to navigation Jump to search
Line 60: Line 60:
* JSON generation is slow: The built-in JSON emitter used generators to convert data structures to a JSON string, but the PyPy optimizer is terrible at analyzing generator code.  Furthermore, the JSON libraries available to CPython are incredibly fast (Ujson is by almost 2 orders of magnitude faster!)  This made the PyPy version appear inferior despite the speed up in the ETL portion of the code.  Part of the solution was to use PyPy's own JSON emitter, but also realize PyPy's default JSON emitter (no pretty printing, no sub-classing, etc) has Ujson speeds.  The fastest solution I found so far, is to copy the data structure (with sets, Decimal, and other special types) to one with simple dicts, lists and floats and pass it to the default PyPy JSON emitter[https://github.com/klahnakoski/pyLibrary/blob/61928e3c9b01b823d666bafcc68b90ab2e4199e3/tests/util/test_json_speed.py].
* JSON generation is slow: The built-in JSON emitter used generators to convert data structures to a JSON string, but the PyPy optimizer is terrible at analyzing generator code.  Furthermore, the JSON libraries available to CPython are incredibly fast (Ujson is by almost 2 orders of magnitude faster!)  This made the PyPy version appear inferior despite the speed up in the ETL portion of the code.  Part of the solution was to use PyPy's own JSON emitter, but also realize PyPy's default JSON emitter (no pretty printing, no sub-classing, etc) has Ujson speeds.  The fastest solution I found so far, is to copy the data structure (with sets, Decimal, and other special types) to one with simple dicts, lists and floats and pass it to the default PyPy JSON emitter[https://github.com/klahnakoski/pyLibrary/blob/61928e3c9b01b823d666bafcc68b90ab2e4199e3/tests/util/test_json_speed.py].
* Python has old-school, unintuitive, routine names (strftime, mktime, randrange, etc) these take time to find, and time to confirm there isn't a better library that should be used instead.  I opted to add a facade to most of them to re-envowel their names, and isolate myself from the risk of using the wrong lib (or have it behave in unexpected ways).
* Python has old-school, unintuitive, routine names (strftime, mktime, randrange, etc) these take time to find, and time to confirm there isn't a better library that should be used instead.  I opted to add a facade to most of them to re-envowel their names, and isolate myself from the risk of using the wrong lib (or have it behave in unexpected ways).
* Python2.7 strings are confusing: str() can be either ASCII or UTF8 encoded, but without any typing to indicate which encoding is used.  There are also unicode() strings, which look like strings until you try to compare them: <tt>"é" != u"é"</tt>
* Python2.7 strings are confusing: str() can be either Latin1 or UTF8 encoded, but without any typing to indicate which encoding is used.  There are also unicode() strings, which look like strings until you try to compare them: <tt>"é" != u"é"</tt>
* Multithreading was necessary so we can handle multiple network requests at one time, while keeping the code easy to read.  Python's threading library is still immature: It has no high level threading constructs to deal with common use cases in an environment that raises exceptions.
* Multithreading was necessary so we can handle multiple network requests at one time, while keeping the code easy to read.  Python's threading library is still immature: It has no high level threading constructs to deal with common use cases in an environment that raises exceptions.
* Python2.7 has no exception chaining - added it
* Python2.7 has no exception chaining - added it


In the end we have a high speed ETL solution that is easy to install and execute.  There are plenty of improvements that can be made, and definitely in the area of more threads and more multiple processes.  But those can wait while we deploy.
In the end we have a high speed ETL solution that is easy to install and execute.  There are plenty of improvements that can be made, and definitely in the area of more threads and more multiple processes.  But those can wait while we deploy.


= SecReview (20 November 2013) =
= SecReview (20 November 2013) =
Confirmed users
513

edits

Navigation menu