- memory barriers are only needed on armv6 and up
- DMB on ARMv6 is "mcr 15, 0, r0, cr7, cr10, {5}", not "mcr 15, 0, r0, cr7, cr10, {4}"
- improve write barrier on armv7 by using "dmb st" instead of "dmb sy"
todo: The use of the correct barrier code should be determined during runtime.
git-svn-id: trunk@22867 -
Add simple Mul+Sub/Mul+Add into MLS/MLA optimizations
Fix some other small issues in the optimizer
Implement Interlocked* functions with proper use of LDREX/STREX
git-svn-id: branches/laksen/arm-embedded@22801 -
fpc_ansistr_incr_ref for Darwin/ARM: they don't follow the Darwin/ARM
ABI for function calls, the code already contains enough ifdefs and
I don't want to spend time on maintaining OS-specific assembler
implementations
git-svn-id: trunk@22121 -
An LDR will have two load latency cycles on most ARM implementations,
moving the
mov r4, r0
two instructions away from the corresponding ldr will avoid the stalls.
git-svn-id: trunk@22107 -
+ CPUARM_HAS_BX is defined if the CPU supports the BX* instruction
+ CPUARM_HAS_REV is defined if the CPU supports the REV instruction. Note that you still have to check for compiler versions > 2.6.0 since the assembler reader of 2.6.0 does not understand that instruction.
+ CPUARM_HAS_IDIV is defined if the CPU supports the sdiv, udiv instructions. Use of this fixes a bug where previously these instruction were only used for armv7-m, while cortex3m cpus also support it.
+ CPUARM_HAS_LDREX is defined if the CPU supports the ldrex/strex instructions. Use of this fixes a bug with armv7(-a) cpus where this path has not been used.
+ SYSTEM_HAS_KUSER_CMPXCHG is defined if the system (mainly OS) support the kuser_cmpxchg functions. Use of this fixes a bug where ARMHF systems did not use it for synchronization (although ARMHF is armv7+ only, i.e. the LDREX path is used anyway)
git-svn-id: trunk@22081 -
Instead of directly using "swp" in InterlockedExchange, use
- kuser_cmpxchg if available (on Linux/armel)
- the fpc global mutex (fpc_system_lock) otherwise
to implement it.
git-svn-id: trunk@22062 -
Optimized to minimize load latency and icache usage. Together with the
previous fpc_ansistr_decr_ref optimization this little test programm
runs about 40% faster.
program stringspeed;
procedure test(s:string);
begin
end;
var
s:string;
i: cardinal;
begin
s:='abcd';
for i:=0 to $FFFFFF do
test(s);
end.
Even with s:='' it's about 30% faster.
git-svn-id: trunk@22035 -
As fpc_ansistr_decr_ref is a very often called procedure in typical
pascal programs this optimized version will shave off some cycles
compared to the generic one.
It tries to avoid load latencies as much as possible and also uses the
new Z-flag functionality of the InterlockedDecrement from the previous
patch. Also FreeMem is called as a tail-function.
git-svn-id: trunk@22034 -
Use movs instead of mov when setting the result in r0. This way the Z
flag will be set for the calling function which might allow some smaller
optimizations later on. It does not affect current code in any way,
because flags are not expected to be used across function calls.
git-svn-id: trunk@22033 -
We're currently using rev for armv6+, but FPC 2.6 could not handle the
instruction. So if somebody wants to build trunk it can't be for armv6+.
We'll circumvent the problem by always using the the generic code when
build with FPC 2.6.
git-svn-id: trunk@22003 -
This adds some small improvements to Move_pld and Move_blended.
1.) Overlapping memory is handled as "unusual" and the code is placed at
the end of the function for better icache/bpu performance
2.) Fused the overlap check into 3 instructions with a single jump
instead of 5 instructions with 2 jumps.
2.) Use ldmia/stmia with 2 registers instead of ldr/str for faster
copying.
3.) Some code cleanup
git-svn-id: trunk@21992 -
This corrects the handling of exception masks and ARM VFP
implementations. The old code enable the exception when it was present
in the mask. So in fact it did the contrary of what it was supposed to
do.
VFP-Support is currently broken, this patch at least allows to build a
working VFP-native compiler. But the full build still breaks because of
some compiler options not properly beeing passed down to packages/ which
results in:
"Trying to use a unit which was compiled with a different FPU mode"
because somehow OPT="-Cfvfpv2" did not get passed down.
git-svn-id: trunk@21952 -
The new version is more optimized to the "common case"
We assume most of the data will be aligned, thats why the unaligned
case has been moved to the end of the function so the aligned case is
more cache- and pipeline friendly.
I've also reduced the loop unrolling for the block transfer loop,
because for large blocks we'll most likely hit the write buffer limit
anyway.
I've did some measurements. The new routine is a bit slower for less
than 8 bytes, but beats the old one by 10-15% with 8 bytes++
git-svn-id: trunk@21760 -
get_caller_frame, get_caller_addr and dump_stack
with default NIL value to systemh.inc.
+ Added new get_addr function.
system.inc: Use get_addr and get_frame to call
HandleErrorAddrFrame instead of HandleErrorFrame
in several error functions.
Modify dump_stack to use frame and addr parameters.
Provide a dummy get_addr function returning nil.
i386/i386.inc, x86_64./x86_64.inc: Provide real
implementation of get_addr function.
git-svn-id: trunk@21697 -
The new version uses a pure pascal version for the 32bit case.
With the lastest compiler optimizations this generates optimal
4-instruction code which can be inlined. The rev-versions for
armv6+ are gone now, the inlineable pascal-code is faster than
the call-overhead for the rev-implementation.
The 64-bit versions received an updated assembly version which saves 4
cycles total on <armv6.
git-svn-id: trunk@21511 -
Currently the ARM-Port uses generic functions for SwapEndian, which are
relativly slow.
This patch adds optimized functions for the 32 and 64-bit case, the 16
bit case is still handled with a normal function, while the generated
code is far from optimal, the inlining (which is not possible with
asm-functions) makes it faster than the optimized function.
Some Numbers from my 1.2GHz Kirkwood (ARMv5):
Old New Result
SwapEndian(Integer) 12.168s 5.411s 44.47%
SwapEndian(Int64) 168.28s 9.015s 5.36%
Testcode was
begin
I := $FFFFFFF;
while I > 0 do
begin
Val2 := MySwapEndian(Val);
Dec(I);
end;
end.
Currently only the ARM implementation is tested. ARMv6+ includes a rev
instruction, while I've implemented them, I was not able to test them.
git-svn-id: trunk@20685 -
always points to the previous r7 on the stack (with the saved return
address coming right after it) so that the debugger and crashreporter
can use it for backtraces as specified in the ABI
o changed NR_FRAME_POINTER_REG and RS_FRAME_POINTER_REG from a symbolic
into a typed constant, and added a new method to tprocinfo that can
be used to initialze it (so it can be inited to r7/r11 depending on
the target platform)
* allow using r9 on Darwin, it was only used by the system on iOS up to
2.x, which we no longer support
* prefer using r9 and r12 before r4..r11 on Darwin, because they are
volatile and hence do not have to be saved
git-svn-id: trunk@20661 -
o new eabihf (hard float) abi
o vfpv3_d16 variant of VFP (default variant used by EABI assemblers: VFPv3
with only 16 double registers instead of 32) and pass it to GNU as
o make the odd numbered single precision floating point VFP registers
available for explicit allocation for use by the calling convention
* fixed copy/paste error in stdname of S30 register
-> use -dFPC_ARMHF to create an ARM eabi hard float compiler
(mantis #21554)
git-svn-id: trunk@20660 -
* Fix for InterLockedCompareExchange on ARMEL
InterLockedCompareExchange would not return the current data on failure.
Getting this to work correctly is a bit tricky. As kuser_cmpxchg does
not return the set value, we have to load it.
There is a tiny chance that we get rescheduled between calling
kuser_cmpxchg and loading the value. If the value changed in between
there is the possibility that we would return the Comperand without
having done an actual swap. Which might cause havoc and destruction.
So, if the exchange failed, compare the value and loop again in case
of CurrentValue == Comperand.
* Improve testing of InterLockedCompareExchange
Added a test to check for the case when Comperand is different from the
current value.
git-svn-id: trunk@20514 -
Use "nostackframe" for:
- Sptr (broken without nostackframe)
- get_caller_addr
- get_caller_frame
Use cmp+ldrne instead of movs+beq+ldr, its a bit more pipeline-friendly
and takes burden of the BPU.
git-svn-id: trunk@20506 -
The following functions where changed to make use of the kernel helper
kuser_cmpxchg:
InterLockedDecrement
InterLockedIncrement
InterLockedExchangeAdd
InterLockedCompareExchange
The previous implementation using a spinlock had a couple of drawbacks:
1.) The functions could not be used safely on values not completly managed
by the process itself, because the spinlock did not protect data but the
functions. For example, think about two processes using shared memory.
They would not be able to share fpc_system_lock, making it unsafe to use
these functions.
2.) With many active threads, there was a high chance that the scheduler
would interrupt a thread while fpc_system_lock was taken, which would
result in the following threads using one of these functions to spinlock till
the end of its timeslice. This could result in unwanted and unnecessary
latencies.
3.) Every function contained a pointer to fpc_system_lock. Resulting in
two polluted DCache-Lines per call and possible latencies through dcache
misses.
The new implementation only works on Linux Kernel >= 2.6.16
The functions are implemented in a way which tries to minimize cache pollution
and load latencies.
Even without Multithreading the new functions are a lot faster. I've did
comparisons on my Kirkwood 1.2GHz with the following template code:
var X: longint;
begin
X := 0;
while X < longint(100*1000000) do
FUNCTION(X);
Writeln(X);
end.
Function New Old
InterLockedIncrement: 0m3.696s 0m23.220s
InterLockedExchangeAdd: 0m4.034s 0m23.242s
InterLockedCompareExchange: 0m4.703s 0m24.006s
This speedup is most probably because of the reduced memory access,
which resulted in lots of cache misses.
git-svn-id: trunk@20491 -
* generate add.w instead of add for thumb-2 in case one of the registers
is > r8
* add register interferences for the "add" instruction so the register
allocator can detect invalid instruction forms (even for assembler code)
* fixed error in thumb2.inc detected by the previous change
git-svn-id: trunk@16633 -
and above, so this also works when calling thumb code (should actually
also be done for ARMv5T, but we don't have a monicker for that yet)
* use BX instead of "mov r15, r14" for simple returns from subroutines
on ARMv6+ to support returning to thumb code from ARM code (idem)
git-svn-id: trunk@14332 -
+ RTL support:
o VFP exceptions are disabled by default on Darwin,
because they cause kernel panics on iPhoneOS 2.2.1 at least
o all denormals are truncated to 0 on Darwin, because disabling
that also causes kernel panics on iPhoneOS 2.2.1 (probably
because otherwise denormals can also cause exceptions)
* set softfloat rounding mode correctly for non-wince/darwin/vfp
targets
+ compiler support: only half the number of single precision
registers is available due to limitations of the register
allocator
+ added a number of comments about why the stackframe on ARM is
set up the way it is by the compiler
+ added regtype and subregtype info to regsets, because they're
also used for VFP registers (+ support in assembler reader)
+ various generic support routines for dealing with floating point
values located in integer registers that have to be transferred to
mm registers (needed for VFP)
* renamed use_sse() to use_vectorfpu() and also use it for
ARM/vfp support
o only superficially tested for Linux (compiler compiled with -Cpvfpv6
-Cfvfpv2 works on a Cortex-A8, no testsuite run performed -- at least
the fpu exception handler still needs to be implemented), Darwin has
been tested more thoroughly
+ added ARMv6 cpu type and made it default for Darwin/ARM
+ ARMv6+ implementations of atomic operations using ldrex/strex
* don't use r9 on Darwin/ARM, as it's reserved under certain
circumstances (don't know yet which ones)
* changed C-test object files for ARM/Darwin to ARMv6 versions
* check in assembler reader that regsets are not empty, because
instructions with a regset operand have undefined behaviour in that
case
* fixed resultdef of tarmtypeconvnode.first_int_to_real in case of
int64->single type conversion
* fixed constant pool locations in case 64 bit constants are generated,
and/or when vfp instructions with limited reach are present
WARNING: when using VFP on an ARMv6 or later cpu, you *must* compile all
code with -Cparmv6 (or higher), or you will get crashes. The reason is
that storing/restoring multiple VFP registers must happen using
different instructions on pre/post-ARMv6.
git-svn-id: trunk@14317 -
-- Zusammenführen der Unterschiede zwischen Projektarchiv-URLs in ».«:
U rtl/arm/setjump.inc
A rtl/arm/thumb2.inc
U rtl/arm/divide.inc
A rtl/embedded/arm/stm32f103.pp
U rtl/inc/system.inc
U compiler/alpha/cgcpu.pas
U compiler/sparc/cgcpu.pas
U compiler/i386/cgcpu.pas
U compiler/ncgld.pas
U compiler/powerpc/cgcpu.pas
U compiler/avr/cgcpu.pas
U compiler/aggas.pas
U compiler/powerpc64/cgcpu.pas
U compiler/x86_64/cgcpu.pas
U compiler/cgobj.pas
U compiler/psystem.pas
U compiler/aasmtai.pas
U compiler/m68k/cgcpu.pas
U compiler/ncgutil.pas
U compiler/rautils.pas
U compiler/arm/raarmgas.pas
U compiler/arm/armatts.inc
U compiler/arm/cgcpu.pas
U compiler/arm/armins.dat
U compiler/arm/rgcpu.pas
U compiler/arm/cpubase.pas
U compiler/arm/agarmgas.pas
U compiler/arm/cpuinfo.pas
U compiler/arm/armop.inc
U compiler/arm/narmadd.pas
U compiler/arm/aoptcpu.pas
U compiler/arm/armatt.inc
U compiler/arm/aasmcpu.pas
U compiler/systems/t_embed.pas
U compiler/psub.pas
U compiler/options.pas
git-svn-id: trunk@13801 -
multi-platform version of patch in r12461, which caused the i386 version
of fpc_pchar_length to return 0 in all cases, which used tabs, and did
not include a test case)
git-svn-id: trunk@12464 -
+ int_str assembler implementations for i386
+ fpc_shortstr_to_shortstr assembler implementation for ARM
+ fpc_shortstr_assign assembler implementation for ARM
+ fpc_Pchar_length assembler implementation for ARM
git-svn-id: trunk@9582 -