Tuesday, August 12, 2014

Open issues

This is the last article for the moment. It is about things I don't know how to solve or have not yet had time to solve.

Double transactions
I've already mentioned this issue but I still don't know what is the root cause and how to mitigate it. The check to not insert already inserted transaction prevents creation of ~25 double transactions per day. Still, I see there is usually one double transaction per day (that is per ~2500 transactions). I am quite confident this is a sever side issue as __createdAt times are very close to each other.

POS does not work on iPhone
For some reason, the application does not work on iPhone. The main page is displayed, user is correctly redirected to the Google's sign in page but after successful sing in, she is redirected back to the main page which does not proceed to the next page. I've briefly tested the issue on borrowed iPhone and found out it is caused by combination of AWS and Durandal. Pure AWS page works just fine and I have not found any indicia that Durandal does not work on iPhone. I believe I just need to borrow the iPhone for longer time to fix this issue.

Built-in *.azurewebsites.net SSL certificate
The server is hosted on azurewebsites.net domain and uses built-in wildcard certificate for secure access. Although I have read several articles that using wildcard certificates is bad, I don't think it is bad in our case. It is a REST service where both endpoints are fixed. If you disagree, please let me know an example of real threat.

For reference:

Authentication token expiration
The authentication token expires approximately every month. Which is the reason why I still have not developed automatic re-authentication when the token expires. It hasn't had enough priority.

There is a way how to do it generally for all requests by using withFilter (usage example). Or I will just sign out and in every couple of days. Whatever will be easier to implement.

No tests for server side JavaScript
I have tests for client side JavaScript and for SQL queries but I do not have any tests for sever side JavaScript. The main reason is there was usually just a few lines of simple code so I did not bother to create a testing set up. But this is slowly changing so creating server side JavaScript tests climbs up to the top of my to-do list.

Saturday, August 9, 2014

Lessons, Surprises, Mistakes III. - AMS

Next set of lessons, surprises and mistakes; this time about Azure Mobile Services (AMS).

Download location
Maybe I am blind but I cannot find a download link for the actual version of MobileServices.Web.js script. WinJs versions are easily available via nuget. Digging into tutorials I have found a location with all Web versions. Comparing it with nuget page, the latest available Web version is 1.2.2 (although WinJs is already at 1.2.3):

Update 24.1.2015: There is a changelog from which you can see that at this moment, the latest Javascript SDK is 1.2.5.

Shared Scripts
There is a possibility to manually upload a script to shared folder on server:
azure mobile script upload shared/helper.js
You can then use it via require in other scripts:
var helper = require('../shared/helper');

Unfortunately, the scripts there are periodically deleted. It turned out that manual upload is not supported and you can only use this feature if you enable source control. See more info on forum.

Stuck scheduled tasks
I've managed to get the scheduled AMS task into state where it ignored all my updates. It ran the old script version no matter what I did. Even deleting the task and creating a new one with different name did not help. I asked on forum and got quickly an advice how to work around the issue by restarting the whole mobile services.

Wrong authorization
Although I require an authorized user when accessing all AMS endpoints, it took me a while to spot an error in my implementation. I did not refuse cashier operations from unknown users. You couldn't do it from UI, of course. But any script kiddie could download the POS, get the application id, authorize with any google account, and send as many transactions/shifts as wanted. The fix was simple - only allow server operations to known users.

SQL Tables Initialization
I let AMS to create my tables. It was a mistake. Only when I started to have performance issues with getting report data, I realized what types are used for the data. For example, as JavaScript does not have integers but only floats, all integer values were stored in float columns. Not very fast and index-able… I have changed type of some columns but next time I will create proper SQL init script.

SQL query optimizations
I had not written complex SQL query for a long time so the first version of the report query was quite embarrassing. There were too many inner queries and other issues. After a consultation with my friend, I speeded up the query by
  • Use null id trick to select the whole row with max value
  • Use union all instead of union because all sub-queries are distinct
  • Converting float columns to integer columns
  • Index on date column did not help because server still clustered index on __createdAt column which has very similar content to the date columns

Monday, August 4, 2014

Lessons, Surprises, Mistakes II.

Next set of lessons, surprises and mistakes; this time various issues in no particular order.

Hardware issues
We use some cheap Android Sencor Element tablets. From 10 tablets, 2 were faulty and we had to replace them. You can recognize faulty tablet by its freezing. Yet there were also more interesting manifestations (appeared only on one tablet):
  • Time speeding up by minutes per hour. That produced transactions and shifts with wrong time. Automatic time synchronization does not help as I assume it is synchronized only once per day.
  • Touch events were not registered in some areas of the screen or were shifted (touching on one place triggers action from different place).

User feedback
It's always good to observe and talk to your very end users. Based on seeing the POS in action and talking to one cashier, I realized the buttons were too small. The cashier had hard time to aim at the right button. So I've made all buttons and texts as big as possible:

Wrong development browser size
I use Chrome Resolution Test plugin to easily resize browser window to tablet's resolution (1024 x 786). But I made a stupid mistake. The POS web page on tablet spreads across the full tablet screen (minus Android menu bar) while the plugin resizes the whole browser window to the specified size. So page areas were different. To have the same area in browser and on tablet (1024 x 720), the correct resolution for the plugin is 1034 x 826.

Browser zooming
Cashiers had also problem with browser zooming triggered by double tap. For a long time, I did not know how to disable it because I thought it is as a browser functionality and looked for an option in browser. Eventually, I've found out it is a HTML5 meta viewport property and you can turn it off by:
<meta name="viewport" content="user-scalable=0" />

JavaScript caching
I was surprised that browser loads JavaScript files from cache even when you ask to bypass it (e.g. by Ctrl+F5). I have not found a way how to force reloading JavaScript files. As a fix, I append a timestamp to main.js in the release package (and index.css because it can be cached too).

Automatic updates
Uptake of a new version was slow. It was normal that an old version was used several days after the new version was released because page refresh was needed on POS and only company employees could do it. Luckily, there was no critical update needed.

I've implemented automatic updates like this: successful ping request returns the actual server version. When it comes back to a tablet, it means there is internet connection (necessary because of mobile internet - see the previous post). When the server version does not match with the client version and there is no cashier signed in, the web page is refreshed which performs the update. The server version is set manually as part of release deployment.

Friday, August 1, 2014

Lessons, Surprises, Mistakes I. - Internet Connection

This is the first of several posts about my mistakes, surprises and lessons learned. I'm always eager to see other's experiences so I can learn from them. Here are mines so you can learn from me.

Going Live
1st April morning. It kind of works. Most kiosks are online and I see signed in cashiers and some transactions. Some kiosks are offline. I guessed internet connection problems.

It was very similar during the next couple of days. There were rare moments when all kiosks were online. Most of the time at least one kiosk was offline. We discussed the cause of the issues and it was clear that the culprit was the internet connection. The mobile internet is ***. It disconnects every now and then, it drops packets, it is sloooooow. Especially in the kiosk which is a big metal box. I had to mitigate the problems.

Status emails
First step was to have better visibility of the problems. Having status page with kiosks overview was nice but it was not enough - nobody periodically checked it. That's why I've implemented status emails. AMS offers scheduled jobs and I use them to periodically check kiosk statuses and send warning email with problematic kiosks. Initially, the check was every 30 minutes but there were too many emails and nobody cared about them. Currently, the status is checked once per hour during the opening hours and a warning is sent only when kiosk does not ping for more than one hour.

Automatic router restarts
We've discovered the biggest issue in routers. They disconnect from the internet quite often. Luckily for us, routers have monitoring functionality. They can ping specified servers and if no reply comes back, they restart themselves. Once turned on (with 30 minutes threshold), the internet reliability increased dramatically.

Inability to sign in/out
Pages in Durandal are stored in separate files. That means there has to be internet connection to successfully navigate to another page (e.g. starting/ending shift). I thought (but not tested) that browser would cache all pages. But I was wrong and with such bad internet connection, some cashiers could not sign in or out - the screen just turned white because the target page could not be loaded.

The solution is to build the whole app into three files (index.html, index.css, and main.js) which are loaded at once during POS initialization. To do that, I use grunt-durandal and grunt-uncss tasks when creating release package. The only communication then is sending data to server which is backed up in local storage.

Double transactions
Another sign of bad internet connection is double transactions. POS sends data, data is written into database, but the confirmation is lost on a way back and POS sends data again. I have actually seen several triple transactions in the database. To solve this issue, I have added a check before inserting a new data into database.

It did not fully mitigated the issue though. I'm still seeing some duplicate transactions. AMS provides createdAt column and according to this column these duplicate transactions are usually milliseconds apart or even at the same time. I have no idea how these duplicate transactions are created. It can be strange network behavior, bug in AMS…

Success transaction visibility
This is a small tweak. POS displays whether the last server operation was successful or not. It's not for cashiers because they don't care. It's mainly for company employees when they check the kiosk on site to know that everything is OK.

After implementing above fixes, the reliability is quite good. It is rare to see offline kiosk. There is still some work I'd like to do though. For example, I do not properly handle authentication token expiration so somebody has to manually re-login kiosks every ~30 days otherwise they would be offline because of missing authentication.